Getting data from sites with an infinite scroll can be somewhat challenging, so I’ve created this guide to help you out. It’s really easy and you don’t need to be an amazing coder to find out how, just a detective and an excel wizard.
When should you use this?
The problem lies in the “Load more” button, like the one found on this Kickstarter page.
If you wanted to get the whole list of companies, you’d want to use our Crawler. Unfortunately, when you’re in extracting mode you can’t interact with the site to press the button.
Put on Your Detective Hat
The first thing to do is to find where the URLs that have the data actually live. You can do this by using the inspect element function in any normal browser like Chrome or Firefox. Once you have the dev tools open, click the “load more” button to expose the URL.
The URL should look like this: https://www.kickstarter.com/discover/advanced?page=2&sort=magic&seed=2340497
Excellent – you can take off the hat now! Now that we have a URL for the data that was previously hidden, we can build a Crawler that uses that URL to train your data as usual. To train more pages, simply change the “page = X” part of the URL (page=2, page=3, etc).
Building the links to follow
Once you’ve built your Crawler extraction pattern, it’s best to fill the “Where to Crawl” box with all the links you want the Crawler to visit. To do this efficiently, you can use a handy function in Excel.
Time to get your Excel hat on.
First separate the URL into it’s 3 distinct parts.
Next, fill column B with as many sequential numbers as you need pages from the site.
You will also need to fill columns A and C with the same part of the URL as A1 and C1 respectively.
Finally, in D1 type this formula: =CONCATENATE(A1, B1, C1).
To fill all of column D with the formula, click the small blue box in the right hand corner of cell D1 and drag it down.
This will generate a comprehensive list of URLs you want to crawl.
You can then paste these URLs into the “Where to start box”, the “Where to crawl box” and the “Where to extract data from” box in the Advanced section of the Crawler settings.
Time to Crawl
You can then run the Crawler with a page depth of 0 and it will go through only the URLs that you specified, allowing you to get all of the data from the companies on Kickstarter thus getting around the infinite scrolling issue!
You can take your hat off now…
For a full step-by-step on how to apply this method, see our tutorial.
Check out other great tutorials and use cases in our Knowledge Base.
Turn the web into data for free
Create your own datasets in minutes, no coding required
Powerful data extraction platform
Point and click interface
Export your data in any format
Unlimited queries and APIs