Crawl sites with infinite scrolling

Getting data from sites with an infinite scroll can be somewhat challenging, so I’ve created this guide to help you out. It’s really easy and you don’t need to be an amazing coder to find out how, just a detective and an excel wizard.

When should you use this?

The problem lies in the “Load more” button, like the one found on this Kickstarter page.

If you wanted to get the whole list of companies, you’d want to use our Crawler. Unfortunately, when you’re in extracting mode you can’t interact with the site to press the button.

Put on Your Detective Hat

The first thing to do is to find where the URLs that have the data actually live. You can do this by using the inspect element function in any normal browser like Chrome or Firefox. Once you have the dev tools open, click the “load more” button to expose the URL.

The URL should look like this: https://www.kickstarter.com/discover/advanced?page=2&sort=magic&seed=2340497

Excellent – you can take off the hat now! Now that we have a URL for the data that was previously hidden, we can build a Crawler that uses that URL to train your data as usual. To train more pages, simply change the “page = X” part of the URL (page=2, page=3, etc).

Building the links to follow

Once you’ve built your Crawler extraction pattern, it’s best to fill the “Where to Crawl” box with all the links you want the Crawler to visit. To do this efficiently, you can use a handy function in Excel.

Time to get your Excel hat on.

First separate the URL into it’s 3 distinct parts.

Next, fill column B with as many sequential numbers as you need pages from the site.

You will also need to fill columns A and C with the same part of the URL as A1 and C1 respectively.

Finally, in D1 type this formula: =CONCATENATE(A1, B1, C1).

To fill all of column D with the formula, click the small blue box in the right hand corner of cell D1 and drag it down.

This will generate a comprehensive list of URLs you want to crawl.

You can then paste these URLs into the “Where to start box”, the “Where to crawl box” and the “Where to extract data from” box in the Advanced section of the Crawler settings.

Time to Crawl

You can then run the Crawler with a page depth of 0 and it will go through only the URLs that you specified, allowing you to get all of the data from the companies on Kickstarter thus getting around the infinite scrolling issue!

You can take your hat off now…

For a full step-by-step on how to apply this method, see our tutorial.

Check out other great tutorials and use cases in our Knowledge Base.

Comments

Nice example. I’ve been trying to train a connector that gets the info from a site with infinite scroll. I use your technique to access second pages but it seems that connector doesn’t know how to do it alone. Is it possible? I’m doing something bad? Thanks a lot

Hi Santi,
Without seeing the site you’re talking about it’s hard to know for sure. If you shoot us an email with more details to support@import.io, someone will be able to look into it more closely :-).

Hi Alex,

Thanks for the great post. However, this does not cover one scenario where there is progressive rendering or infinite scroll where the user does not need to click load more. What could be the option for such sites?

Many thanks in advance for your help

Hi Selwyn,

The basic principle should still apply. You will just need to look a little harder within the inspect element to find the URL that relates to page 2. If you need any more specific help, please get in touch with us at support@import.io

Comments are closed.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!