Tips and tricks for using import.io

My original plan for this webinar was to look at voice activation and some of the hacks that we made a few months ago. Unfortunately, due to a few technical difficulties, I wasn’t able to do this. But, being the inventive guy I am I decided to wing it and show you more interesting tips and tricks you can use to pull data using our tool.

Crawling with infinite scroll

The first thing I showed you was how to get data from behind an infinite scroll. This is when some of the data on the page is hidden behind something like a “Learn more” button. This means you won’t be able to extract the other pages using traditional methods, because when you’re in extraction mode you can’t click the button. Luckily, you can use Chrome’s developer tools to access the URLs of the load more pages and use that pattern to build a Crawler. Then, using the concatenate function in Google Sheets we can generate a complete list of all the URLs for the data and paste them into the “Where to crawl” box. I’ve actually written a whole blog post and corresponding tutorial all about this topic; so definitely make sure to check those out.

XPaths

Next I demonstrated how to get data which moves around from page to page. Our tool works off of the patterns in the HTML of websites in order to pull the data you need. So, if for example you want data that is on line 7 on one page and line 8 on another page, the tool will get confused. To get around this problem you can write a custom Xpath for getting the data. In this case you would use the “following-sibling” tag to get the data that follows a specific word like “battery capacity”. For more detailed information you can read our blog post or check out our tutorial.

API publish failures

Finally, Bamford gave you a few insights into why you sometimes get API publish failures. There is a limit to the amount of Javascript we can process per call because if we process too much it takes up a lot of our servers. This means that if the site has a lot of JavaScript (more than 5 seconds), we may not be able to create an API for it. The first thing to try is to re-train the site with Javascript turned off. If you can’t get the site to work, do send us a support request and we’ll do our best to get it working for you.

Next time

For our next webinar, we’re partnering with our good friends over at Infogr.am to do some cool data visualizations. It’s going to be a great look at how to use a simple but powerful tool to make your data look awesome. You can sign up for that one here.

Comments

Hey,
so i’m wondering: is there a way to make things work for the other type of infinite scrolling? (the one that loads images automatically once you reach the bottom of the page)
great tool by the way and very good work

Hi Ahmed,

It depends on the underlying HTML that loads the next bit of the site. If you send the the URL to support@import.io, we’ll take a look at it for you and see what we can do.

Comments are closed.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!