Crawling Nemo – an import.io webinar

At the end of my last webinar I asked you guys to tell me what topic you were dying to hear about. Quite a few of you wrote in, and after careful analysis…it was clear that you wanted to hear more about Crawlers! More specifically, you said you wanted a more in-depth look at some of our more advanced features.

Well, I’m nothing if not a people pleaser, so I set to work and came up with a webinar I think you guys will love! If you’re not familiar with crawlers, don’t worry, you can watch this Crawling 101 webinar I did a little while ago, which should tell you all you need to know.

Why use a crawler?

The first question you should always ask yourself when getting data is: Do I really need a crawler?

We have three tools at import, and each tool has a different purpose. The crawler’s job is to help you get lots and lots of data into a static data set (non-refreshable). It gives you maximum data with minimal effort. You just have to train the crawler on 5 pages and it will go through the whole site to find data that matches the data you trained. If you only want data from a few specific pages it’s probably better to use an extractor or if you need to fill in a form (like a search box) to access the data it’s better to use a connector.

How does a crawler work?

The training bit of the crawler works the same as training every other tool, except that you need to train a minimum of 5 pages. Once you’ve trained your crawler, our app generates a URL template based on the URLs of the pages you’ve trained. Then as it goes to each of the pages on the site – based on the links it finds on the pages you trained – it will get data from any page that matches the URL template.

Understanding how the crawler works is pretty important, because it helps to “think like a crawler” when you’re building one.

Back to square one

The second question you need to ask yourself is: Is this site crawlable? Not all sites can be crawled.

To illustrate an important point, I’m going to try building a crawler to this results page on Asos. You may think that if you just want the name, image and price of all the jeans on Asos, it would be better to crawl the results pages because there are lots of them. And you’d be right – in theory.

What happens in real life though, is that when I try to train page 2, the site redirects me back to page 1. This is because of something called a canonical URL – which you can find in the HTML of the page. A canonical URL will redirect the site back to a specific (master) URL. This can be for a number of reasons, but it’s most commonly because the site builders want Google’s crawler to index certain pages and not others. For example if you were A/B testing your homepage, you wouldn’t want Google to index your homepage twice! So the canonical URL will redirect the crawler to the “master” homepage.

In any case, crawlers can’t ignore canonical URLs – sorry, they just can’t. In the case of Asos we can crawl the individual product pages, because they don’t have a canonical URL. But if the site you want to crawl doesn’t have another option, the best thing to do is to build an extractor then feed in each URL to that extractor using this bulk upload spreadsheet (feature coming soon!).

More than meets the eye

Now, when you’re training your crawler, the page is essentially “locked” – that means you can’t navigate around. Which can be a bit of a pain when you need the data from another bit of the page – like if I wanted the Delivery information on this Asos page – you can see that it’s in another tab within this insert.

But, even though we can’t see the data, doesn’t mean it’s not on the page. If we go to the page in chrome (or any other browser) and right click on the piece of data we want then inspect the element; we can copy the Xpath for that piece of data.

Then you just paste that Xpath into the advanced column setting in the crawler.

“Walk” like a crawler

When you’re training the subsequent pages of your crawler, it’s best to “act like a crawler” by navigating to the next page within the app (instead of copy and pasting the URL from Chrome). This is to help you realize whether or not there are enough links on your training pages to get to the rest of the data.

You shouldn’t need to go more than 2 or 3 clicks between where you are and the next page you want to train. The more clicks you have to make, the less efficient your crawler will be. If there’s no way to navigate (by clicking) to the next page you want to train – it means that you will definitely need to change the “where to start” URL in the crawler’s settings later.

P.S. using the back button is cheating! Your crawler can’t use the back button, so neither can you.

It’s also a good idea to train more than 5 pages (I usually train at least twice that). The more pages you train the better URL template the app will be able to generate and the better your data extraction will be.

Run, Crawler, Run!

Before you run the crawler, toggle into the advanced settings section and have a look at the options to see where you can make it more efficient.

10,000 leagues under the crawler

Remember how I asked you to click through to each page from the one you were training? Did you count them? If you did, you should be able to use that as your page depth – because the crawler won’t need to go more than that many clicks from the start pages.

Now, for safety, and to make sure you really can get all your data, you should probably add one or two. Just remember that the higher the page depth the longer the crawler will take and the less efficient it will be – very rarely should you need more than 5.

Save your work

Crawlers can sometimes take a while, so it’s a good idea to save your progress. That way if your internet goes down or the app crashes (hey, we’re still in beta – it happens occasionally), you don’t have to start crawling from scratch. There are two different ways to save as you crawl.

The first is the save log, which will generate a file of all the URLs that have been visited and which ones have been converted or failed. This is also a good way to check if your crawler is working, run it for a bit and then check the save log to see if it is converting the right URLs.

The second save method is the save stream, which creates a file with the data that has been converted thus far. Just a quick warning though, this file can be quite large (depending on how much data you’re collecting) and can take up quite a bit of space on your laptop.

Multitasking like a crawler

Unlike me, crawlers can actually do more than one thing at a time. We call this feature simultaneous pages, and it refers to the number of pages the crawler will visit at one time. We cap this at 3, because each page visited produces a load on the site’s server – and the more pages at a time, the greater the load. The same is true of the pause between pages, which is how long the crawler will wait before going to the next link.

If you find you’re getting blocked by the site you’re crawling the first thing to do is increase the pause between pages and lower the number of simultaneous pages.

O’Crawler where art thou

The “Where to crawl” box is pretty obvious – it’s where the crawler will journey to find new links. If it’s your first time crawling the site, go ahead and set this as the homepage. As you get better at crawling try narrowing this to something more specific like the men’s section. Also a quick tip is that if you set this as the same URL as in the “Where to extract data from” field; it can sometimes make your crawl more efficient.

As you might imagine, the “Where not to crawl” is the opposite of “where to crawl” – it’s the links you don’t want the crawler to visit. The best way to use this is when you want to rerun your crawler to only extract the new links (or unconverted links), you can paste in the links that have already been converted (from your save log) so you don’t have to recrawl them.

Where to extract data from – we’ll get to this in the next section….which is now!

More than you bargained for

You can see when I start crawling, that I’m getting a lot of things that aren’t jeans (for the purposes of this webinar, I only want jeans). Because the word “jeans” wasn’t in the URL of the pages I trained in a structured way (ie. assos.com/jeans/brandname), my URL template (or “where to extract data from”) was only looking for combinations of words and numbers – not jean specific.

But, I can change all that by doing a little URL template manipulation…

If I change it from this: www.asos.com/{words}/{words}/Prod/pgeproduct.aspx?iid={num}$

…to this: www.asos.com/{words}/{any}Jeans{any}/Prod/pgeproduct.aspx?iid={num}$

…by adding the word jeans, the crawler should ignore any URL that doesn’t contain jeans somewhere in it.

Whew! That’s it. A comprehensive look at all the advanced crawler features.

Join us next time…

I am really excited for this next webinar. Andrew (our CoFounder and Chief Evangelist) will be joining me to show you all the different Google Sheets integrations we’ve done in the last year. From the simple import of live data, to the more complex chained APIs, and everything in between – this is going to be Google Sheets Madness!

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!