5 questions to ask when crawling

Crawling can be a bit of a mystery if you’re not familiar with the principle. I’ve done a few (rather long) full “How to Crawl” tutorials, but to keep things simple; here are the 5 questions you should ask yourself while building a crawler…

Do I need to crawl in the first place?

Is the data you need dispersed across more than 10 pages? If not, then the crawler tool probably isn’t the one for you. In that case, it would be far more efficient to use our extractor tool and simply add the URLs you need to get the data from with in your dataset page.

Are you wanting to query a site with lots of different search terms? If so, then the connector is the tool you need. This uses page interactions such as searches to get the resulting data.

To truly take advantage of all the crawler has to offer, you need to want data from lots of pages from within the same website, that have a similar structure. The perfect example is if you wanted all the products from a site like Asos – there are 1,000s of them and each one is roughly the same.

Are there enough links on the page?

I.e. Is it possible to crawl? Don’t forget that when our crawler runs, it uses links from the pages you give it to navigate around the site. If the web pages don’t have links off of them (there are some out there!) then the crawler won’t be able to get to the rest of the site.

If you do find that to be the case, an extractor, along with our batch search Google Sheet hack, is really the only way to get the data you need.

Is the URL generic?

The URL plays a MAJOR part when it comes to crawlers. If you have generic URLs, then it won’t be possible to hone the crawler to a certain category of pages. For instance let’s look at the link below:

If we were to train a crawler to these 5 pages we would end up with a URL template that looked something like this:

http://www.asos.com/{words}/{any}$

This isn’t very helpful as if we only wanted to get stuff from the jean category, because the crawler won’t be able to differentiate between the categories. The thing to do here would be to generate your links for the crawler yourself and then paste them into the “Where to start”, “Where to crawl” and “Where to extract data” from boxes therefore creating a targeted crawler.

Does the data move around on the page?

A common issue when crawling is that sometimes data moves around on websites from page to page. For instance, take these snippets from a computer spec page:

Line 5 Line 5 Line 8 Line 8

You can see that in the 1st example, Solid State Drive Capacity is the 5th row down, whereas in the 2nd example, the capacity is the 8th row down. Since the crawler can’t read, it will pull whatever is in the 5th row down – not necessarily what you want.

In this case you need to use some XPath magic to anchor the tool on a certain word which should get around this issue. For more information on XPaths, please check out my webinar on them.

Can I hone the crawler?

If you find that your crawler is going to take hours and hours to finish, it might be an idea to train it on more pages to get a more exact URL pattern. 5 pages is the minimum that you can train the crawler on, however, I always suggest you train at least 10 pages as this will give you a really accurate URL template. While I’m on the subject of honing your crawler, it might also be an idea to change the page depth. The page depth is the amount of clicks away from the first pages. The higher the page depth, the further into the website the crawler is going to travel, sometimes it’ll go too far!

To demonstrate page depth, please look at the image below:

As you can see, the higher the page depth, the more pages the crawler has to cover.

Hopefully this gives you some insight into what you should look for when creating a crawler. But, just in case things do go wrong, here’s a few tips to give you a little guidance.

Save io files

These are VERY important. If the program crashes or you lose internet connection, these will save the progress you have made up until that point. You can save io files by hitting ALT+E at any point in the workflow and it will extract the progress up until that point. If you want to load these files up, simply open a new workflow and hit ALT+I, this will import the io file with all of your settings in meaning you haven’t missed out on anything!

Check out our tutorials!

We have a massive database of knowledge articles and webinars designed to educate and inform you guys about import.io. If you ever find yourself stuck, please check out our crawler tutorial as it might unlock the key to creating an efficient crawler!

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!