How to crawl a website the right way

The word “crawling” has become synonymous with any way of getting data from the web programmatically. But true crawling is actually a very specific method of finding URLs, and the term has become somewhat confusing.

Before we go into too much detail, let me just say that this post assumes that the reason you want to crawl a website is to get data from it and that you are not technical enough to code your own crawler from scratch (or you’re looking for a better way). If one (or both) of those things are true, then read on friend!

In order to get data from a website programmatically, you need a program that can take a URL as an input, read through the underlying code and extract the data into either a spreadsheet, JSON feed or other structured data format you can use. These programs – which can be written in almost any language – are generally referred to as web scrapers, but we prefer to call them Extractors (it just sounds friendlier).

A crawler, on the other hand, is one way of generating a list of URLs you then feed through your Extractor. But, they’re not always the best way.

 

How a crawler works

Crawlers are URL discovery tools. You give them a webpage to start from and they will follow all the links they can find on that page. If the links they follow lead them to a page they haven’t been to before, they will follow all the links on that page as well. And so on, and so on, in a loop.

The hope is that if you repeat this process enough, you will eventually wind up with a list of all the possible URLs, usually restricted to a given domain (eg. Asos.com).

The good thing about crawlers is they try to visit every page on a website, so they are very complete. The bad thing about crawlers is that they try to visit every page on a website, so they take a long time.

Crawling is very slow. And it produces a pretty heavy load on the site you are crawling. Not to mention, crawlers produce a static list of URLs, meaning if you want new information you have to recrawl the entire website all over again.

The final problem with crawlers is that a lot of the URLs they find won’t have data you want. Say you’re trying to build a catalogue of all the products on Asos. Asos has a lot of pages that don’t have products on them, but the crawler will visit them anyway.

When it comes to passing your crawled list of URLs through an Extractor, you can use URL patterning to try to weed out some of these unwanted pages (more on that later). You can also try to add some logic to your crawler to help guide it away from pages you know don’t have data.

But, crawling isn’t the only way to get a list of URLs. If you’re at all familiar with the site you want data from, there are much easier (and faster) ways of getting to your data.

 

Crawlers vs Extractors – finding the right tool for the job

Extractors have two benefits over crawlers. The first is that they are super targeted – ie. much, much faster. And the second, is that they are refreshable – no recrawling required, just run the program again.

When you build your Extractor (remember that’s the thing that gets the data from your list of URLs), you build it to a specific webpage. You’re essentially training the program that, given a URL, this is what data looks like.

In addition to providing you with a program you can pass other URLs through, an Extractor also provides you with a spreadsheet of all the data on that page.

So if you were to build an Extractor to a page with lots of links on it, you would wind up with a spreadsheet with a list of links that you could then feed into another Extractor to get the specific data you’re after.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!

That all sounds rather theoretical, so let’s look at some examples:

 

Data on a single page

Obviously, if your data sits on just one page, crawling the whole site is pretty pointless. It will be much simpler to build a single Extractor to that single page.

Data on multiple concurrent pages

This is often referred to as pagination – ie. the data you want is in a single list that is spread out across multiple pages. In this case, you need to look at the URL for a few pages to see if you can detect a pattern. For example:

In this case you can see that each URL ends in page=X In this case you can see that each URL ends in page=X

If the pattern is replicable, you should:

  1. Build an Extractor to the first page
  2. Generate a list of URLs using the pattern in Excel
  3. Run your list of URLs through your Extractor

You should end up with a big spreadsheet of all the pages turned into data.

Note: This method also works if you already have a list of URLs already in a spreadsheet somewhere.

Data on “profile” pages

A lot of times you’ll want to pull information from pages that look like this:

“Profile” page

And most of the time there is a list (or paginated list) that contains links to all of those pages.

List page with links to profiles List page with links to profiles

In that case you should:

  1. Create an Extractor to the list to grab all the links (you may need to follow the steps in the previous example if it’s across multiple pages)
  2. Create a second Extractor to pull the data from a “profile” page
  3. Run the list of URLs from your first Extractor through your second Extractor

In this case, you’re using the output of one Extractor as the input for another. This effectively chains one Extractor to the other and should get you all the data you need quickly and efficiently.

Anything else

The cases above will probably cover 80-90% of your web data extraction needs. But, if none of those cover it then the crawler should be your fallback. Remember that it’s a true URL discovery tool, and so it will take a long time.

But, all is not lost, there are several things you can do to make your crawler more efficient.

Building an efficient crawler

How long the crawling process takes depends on how targeted you make your crawler. You need to define both where you want it to go, and – more importantly – where you don’t want it to go. For example, if you are only interested in Men’s clothes; there is no point letting your crawler visit pages with Women’s clothes. It’s a waste of time for you and an unnecessary load on the website. Not to mention it will bring you back a lot of data you don’t actually want, causing more work for you in post processing.

Here are some of the controls you should look for:

  1. Crawl depth – How many clicks from the start page you want the crawler travel. For the majority of websites, a crawl depth of 5 should be more than enough for most websites.
  2. Crawl exclusions – These are the parts of the site you do NOT want the crawler to visit, essentially where not to crawl
  3. Simultaneous pages – The number of pages the crawler will attempt to visit at the same time.
  4. Pause between pages – The length of time (in seconds) the crawler will pause before moving on to the next page.
  5. Crawl URL templates – This is how the crawler determines which pages you want data from (ie which ones to feed into the Extractor) so it’s important to make it as specific as possible.
  6. Save log – Crawlers can take a long time and you don’t want to lose your work if something goes wrong along the way. A save log will let you see which URLs were visited and which were converted into data. This log will help you troubleshoot your crawler if something goes wrong with your extraction. In addition, the URLs converted to data can be used through an Extractor directly next time so you don’t have to re-crawl the site.

Best practices in crawling

Regardless of how targeted you make your crawler, you still need to be aware that every time it visits a page or clicks on a link, that action produces a load on the website. Cause too much of a load and you may end up blocked, or worse take down the site entirely. So, here are a few guidelines you should stick to when crawling.

  1. Crawl speed – Despite how slow they are, crawlers can move a lot faster than people, because they are just looking at the URL not the content. This means you can set them to go pretty quickly. But, the faster you set it, the harder it will be on the server. We recommend waiting at least 5 – 10 seconds in between page clicks.
  2. Simultaneous pages – Unlike you, the crawler can visit more than one page at once. Again, the more pages it visits at once, the harder the strain on the website. We recommend only visiting 2 – 3 pages at one time.
  3. Robots.txt – Some websites set up what’s called robots.txt to tell crawlers not to visit those pages. Some crawlers will allow you to go around this, but we always recommend respecting robots.txt.
  4. Terms of Service – Crawling is not illegal, but violating copyright is. It’s always best to double check a website’s T&C before crawling them.

Building an Extractor or Crawler

Luckily, building both Extractors and crawlers is a LOT easier than it used to be. You don’t even need to know how to code. We’ve created a set of tools (which you can use for free) to help you build APIs and Crawlers using a simple point and click interface. Check out the video below to find out how to get started.

Or if you’re interested in getting a lot of data from complex websites, contact the data experts on our sales team to see how we can help you get accurate data at scale.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!