I took a back seat on the webinar this week and left you in the very capable hands of our developer duo Chris A and Chris B (Bamford) who showed you some of our more advanced features. Now you may think that you need to a developer to use these features, but I’m here to tell you (as a non-dev myself) that actually the concepts are pretty simple. With just a little bit of extra knowledge you can make our tool do some pretty crazy things. And if you get stuck you can always shoot me an email at firstname.lastname@example.org and I’ll help you figure it out!
Ok, enough of that… Let’s get this show on the road!
Advanced Column Settings
The first advanced feature we offer to developers is the ability to get more targeted data during your extraction. Xpaths are what our app uses to pick out data from a webpage – when you highlight things with your cursor, import.io is actually looking at the Xpath underneath that. But sometimes you want data that you can’t necessarily be seen on the page (meta tags for example) or it might move around from page to page on the site. This is where our Xpath override feature comes in. You can use Chrome’s developer tools to help you find the bit of the Xpath you need and then modify it a bit to get exactly the data you want from the page!
Regular Expressions (Regex), on the other hand, are a way of filtering the results from the extraction into something more specific. So, if the browser is pulling too much data you can use a Regex to refine it to get just what you want.
Here is the configuration for the Amazon crawler Chris created: Amazon Crawler
You’re a star
One of the other things you can use our advanced column settings for is to get star ratings. When you see a star rating on a page, you can usually get them as an image, but that’s not always very helpful. A lot of times what you really want is the star value (ie. 3 of 5). Using a combination of Xpath and Regex you can pull exactly that!
If you’re looking for some flowers, here is the API Chris built: Flowers
Note: when using Xpath on a multiple results row, your Xpath is relative to the specific row not the entire page (like it is on a single results page). Chris shows you how to work this out in the video of the webinar.
The development team has been very busy upgrading our client libraries for different languages to help you integrate your data more efficiently. Simply chose the programming language (from the Integrate page) you want and we tell you where we host our script (if applicable), how to configure it with your credentials, and finally how to execute a query.
Download Crawler and Dataset data over the API
You can download the data saved for Crawlers and Datasets programmatically using the API. Chris has written a more in-depth blog post on how to do it, but in essence you paste this link (http://api.import.io/store/connector) with the GUID to your Data Source into your URL bar. Then, from the information that is pulled back you can use the “snapshot” field GUID to get the JSON file. You can also use this method to access your historical data from that source.
Which character encodings do import.io data sources support?
import.io detects a range of character encodings, including UTF-8, UTF-16, GBK, and many others using a combination of HTTP headers and HTML meta tags. If you notice encoding issues then please drop us a line on email@example.com with which site you’re having issues with, and we’ll take a look.
Is there a quick way to tell the difference between multiple versions of a Crawler or Dataset’s saved data?
Currently import.io does not provide this feature – but feel free to add it on our Ideas Forum (along with any other ideas you want!)
How do import.io Crawlers deal with robots.txt?
import.io’s Crawlers obey all robots.txt directives – there is no way to override this.
Can you use Xpaths and regular expressions at the same time?
Yes! Watch our webinar video embedded above to see how we combine these two tools together to extract star ratings from a particularly tricky data source.
Where can we find documentation for the API?
Also a quick congratulations to Raplh for winning the best question data punk t-shirt!
Join us next time…
Your favorite double act is back! Chris A and I will be doing another Tips & Tricks webinar to show you a few more tricks of the trade. Make sure you come prepared with questions, we’ll be giving away another t-shirt for the best one!
Turn the web into data for free
Create your own datasets in minutes, no coding required
Powerful data extraction platform
Point and click interface
Export your data in any format
Unlimited queries and APIs