Everything you ever wanted to know about Crawlers in one webinar!

I do these webinars not because I like talking about my beard – although I certainly do – but because I love nothing more than helping you guys learn to get data from the web. So when I get something through on the request line, I always do my best to accommodate it! I got quite a few questions about Crawlers last time, so I decided to make this webinar is all about them. I’m going to take you through what one is, how to build one, a few advanced features and then answer some FAQs. But first, a poem from a user!

“The itsy bitsy Crawler went up the data spout,

down came the JSON and washed the Crawler out,

out came the sun and dried up all the XPath

so the itsy bitsy Crawler went up the data spout again!”

– Fred Kaimann

What is a Crawler?

A Crawler is effectively an automated Extractor. You train it on 5 example pages and then let it go! Your Crawler will follow the links on the pages you trained to try and travel to all the other pages on the website to bring back data that matches the pattern you mapped in your example pages.

Right, let’s build one

As usual, I’m turning to my trusted Asos to show you how to build a simple Crawler to get all the product information on their site. For any of you who have built crawlers before, you’ll notice that with our new My Data page, we’ve skipped a few steps by launching you directly to the Detect Optimal Settings step of the workflow. You don’t have to do anything different, just follow the steps from here.

One of the most important steps in building your Crawler is determining if your data is in Single or Multiple rows. In this case – a product page – we are mapping single results which means that our table will only have one row in it. Once we’ve mapped one page, we need to add at least 4 more examples. You shouldn’t have to train it anymore, just validate that the data mapping has worked correctly.

Top tip: If you want to crawl an entire site, it helps to train pages from all over the site so you can try to validate all different types of data layouts.

Now that we’ve done the training on 5 pages we can upload it to import.io (which will save the schema) and then we’re ready to run the crawler.

Running the Crawler – Basic Settings

We automatically populate all the crawling fields you need for you, but you can change them quite easily to make your crawler more efficient. The first two things to look at are the “Where to Start” box and the page depth.

  • Where to start – by default, the crawler will start from the pages you gave as examples. However, it is sometimes more efficient to start from somewhere more central to the site (like the homepage).
  • Page depth – this is the maximum number of clicks from the start URL the crawler will travel to to find data. I’ve even made a handy chart to help you understand what I’m on about…

In this example, we have taken one of the start URLs of the Crawler (orange circle). The first page depth would look at all of the links on this page (the arrows) and go to the resulting pages (yellow circles) in this instance there are 5, a page depth of 2 will then look at all of the links in these 5 examples and go to those pages (green circles). You can see that even with a page depth of just 2, the crawler has returned 25 results from one page. By default it is set to 10 (the maximum allowed) to enable you to get all the data. However, the fewer clicks the crawler needs to travel, the quicker your data will be returned.

Let’s kick it up a notch – Advanced Settings

If we toggle to advanced mode, you’ll see a bunch more options.

  • Simultaneous Pages – the number of pages the crawler will attempt to visit at the same time.
  • Pause Between Pages – how long the crawler will wait (in seconds) before moving from one page to the next.
  • Where to Crawl – sets the parameters of the URL pattern of the sites you want to crawl
  • Where Not to Crawl – these are the URL patterns you don’t want the crawler to visit when looking for data
  • Where to Extract Data From – this is the URL pattern generated from your example pages. The crawler will try to extract data from any page that matches that pattern

The final thing I want to point out is the save log, which will show you what pages have been converted successfully and which ones haven’t. If you find that you’re getting a lot of blocked pages you can check the save log to see where your Crawler is going wrong.

Once you’ve fiddled with all the settings, all you need to do is hit Go! For more on the advanced features, have a look at this advanced tutorial.

FAQs – Straight for the streets

In preparation for this webinar – yes, I do prep for these – I pulled a few of the most frequently asked questions I get through support.

Why does my crawler only bring back 5 results?

This is a question I get asked the most. Sometimes you’ve trained it on 5 pages, you hit “Go” and all it does is convert the 5 pages you trained it on in the first place. How annoying! Not to worry, there one very simple thing you can do to stop this from happening.

It all has to do with the “Where to start” box. Sometimes the URLs you trained the Crawler on, don’t have enough links on them to get your Crawler off and running in the right place (remember my diagram from earlier). So, the first thing to try is changing the “Where to start” URLs. You want to change these to something relatively generic, like the home page, or a page that has a lot of links on it to pages you want to extract data from.

What do I do when my data moves around the page?

Internally we refer to this as the wikipedia problem, and it happens when the data you want isn’t always in the same place (or row) within the HTML. Luckily, we’ve added the ability to write in your own XPaths which allow you to pinpoint the specific data you want based on the underlying code from the site.

XPaths might sound a bit scary, but they’re surprisingly simple. The W3 schools XPaths tutorial is super helpful and you can look at our own tutorial for a few of the most popular XPath issues.

What can I do with my Crawled data?

Well, that’s up to you really. We’ve given you a load of options such as integrating it to Google Sheets, downloading it into a CSV, sharing it via social media and loads more! Whatever you decide to do, just remember that a Crawler returns a static data set which means that it won’t update with new data unless you run the Crawler again.

Quick tip: if you start experiencing problems with the workflow, it’s a good idea to export an .io file before you exit the workflow and send me an email at support@import.io. This is essentially a “save as” and it allows me to see all the training you’ve done thus far so I know where it’s gone wrong.

To export an .io file press ALT + E (cmd + Shift +E on Mac) on your keyboard.

Question Time

Is there an option to download the images I’ve crawled?

Nope. We don’t allow you to download any of the images you extract. Mostly because this would take up an insane amount of server processing power! But also, because it might violate some copyright laws.

If the website changes their layout do I have to retrain the Crawler?

The answer to this depends on how much has changed. If there are only minor changes you should find that our extraction algorithms are self healing. For major site and layout changes though you will need to go back and retrain.

Can i schedule crawlers?

Yes you certainly can! We give you the ability to run your Crawler over the command line which means that you can schedule it to run in the future, update on a schedule and even POST your data to a URL.

Can I crawl pages that come up in pop-up windows?

It depends on how the pop-up window is structured. If you can access the data in the HTML or find a unique URL code for the window then it should be possible to pull data from it.

Can you crawl sites that require login?

We can get data from behind a login for Extractors and Connectors, but not Crawlers. This is because crawls don’t process cookies which you would need in order to remain logged in.

Can you save a crawl halfway through and start it later?

Not at the moment. It is an idea we’ve been toying with. If it’s a feature you’d like to see vote for it on our ideas forum.

When you save your settings for a crawl, how do you go back to it if you want to recrawl or  train pages again?

You can do this from your My Data page. Simply hit the edit button in the top right of the data set and we’ll put you back in the workflow where you can add more pages, change the mapping and recrawl the site.

What is the difference between crawling locally vs remotely?

If you choose to crawl locally, the Crawler will use your IP address, otherwise it will run through one of our proxy servers. In the case of Crawlers that need JavaScript to be turned on, you won’t be able to run the crawl locally.

Do you ever get redundant data from doing more depths?

We will not get duplicate data, unless the data is duplicated on the website itself with different URLS, theres no connection between page depth and likelihood of redundant data.

Join us next time

We’ve got something really special in store for you guys next webinar (Sept 23 @5pm)! Louis Dorard will be here to show you how to use data you get from import.io and analyse it in BigML to make predictions for the future!

Extract data from almost any website