Why isn’t my Crawler working?

Hey everyone, Alex here again. For this week’s webinar I chose the ever popular topic of…Crawlers. This is always a really popular one because a Crawler is the easiest way to get lots and lots of data very quickly. But instead of showing you how to build a Crawler (you can watch that webinar here), I want to talk about some of the most common Crawler issues and the how you can solve them.

Crawler is missing data

If you’re running a Crawler and it starts looking like this:

Some of the columns are missing data Some of the columns are missing data

…you’ve got an issue. You can see that it hasn’t pulled back the data in some of the rows. This is because you haven’t trained the Crawler on enough pages. Training pages is about more than just making you do the same task over and over, it’s about refining the Xpaths our tool uses to find the data. When you’re training your Crawler you want to give it a varied set of pages, that way it will have seen all the different micro-variations of the page layout and know where the data is in all of them.

Solution: Make note of a few links that aren’t returning all the data and try adding those to your training examples.

Crawler is only returning my training pages

This is probably the most common issue and it has three root causes:

Not enough links

The first, and simplest to figure out, issue is that the pages you have trained don’t have enough links (or the right links) to get the crawler from that page to another page with data on it.

Use a directory page with lots of links Use a directory page with lots of links

Solution: Try to find another page on the site (even if it’s not a page you want data from) that has lots of links on it, and use that as your Where to start URL.

Wrong URL pattern

Based on your training pages, import tries to generate a URL pattern that will work across the whole site, but it doesn’t always get it exactly right because (like the micro-variations in page layout) sites don’t always follow an exact URL template. To diagnose this issue, download the Crawler log file which will show you which URLs have been looked at by the Crawler vs which ones have actually been converted to data.

This URL has /?retailer= This URL has /?retailer= This URL has ?retailer= (no ?) This URL has ?retailer= (no ?)

Solution: Using the URLs in the log file, modify your URL template so that it will convert the right links.

Your URL template should look like this - notice the {any} Your URL template should look like this – notice the {any}

JavaScript links

This is a tricky one. If your page looks like it has lots of links and your URL pattern matches the URLs of the pages you want extracted, you probably have JavaScript links. This means that instead of normal HTML hyperlinks, the links you can see on the page are actually JavaScript buttons – which Crawlers can’t follow. An incredibly annoying source issue (right up there with displaying price as an image!).

Store names are JS buttons, not HTML links Store names are JS buttons, not HTML links

Solution: Make your URL template as generic as possible (using {any}) and put your page depth up to 10. This will take a lot longer, but it should hopefully get you all the data.

Make your URL template as generic as possible Make your URL template as generic as possible

I’m getting lots of blocked pages

If your Crawler is consistently getting blocked, this probably means that you have your Crawler settings are up too high.

Solution 1: Decrease your simultaneous pages and increase your pause between pages.

Solution 2: If that doesn’t work, or you’re getting a robots.txt message, you’ll need to use an Extractor and put the URLs you want to extract data from in manually with our Bulk Extract feature. It’s a far more efficient method and will be much quicker in terms of actually extracting the data.

Add your URLs manually into an Extractor using Bulk Extract Add your URLs manually into an Extractor using Bulk Extract

Can’t get enough webinars?

Me! Alex Gimson :-) Me! Alex Gimson 🙂

Every week, I host a live webinar where I give you some insights on how to use import.io to get data and answer all your burning questions. You can sign up to receive webinar invites here. If you can’t make it, don’t worry! Every webinar gets recorded and posted here on our blog and on our YouTube channel. Next week’s webinar is going to be a very special one on Data Journalism, where I’ll also be launching a brand new service.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!