Extract live pricing data from the web

For this webinar, Chris and I pretend to be the owner of a clothes shop – we’ll call it Chris and Alex Inc – C&A for short! Because Chris and I are savvy business men, we know that we need to compare the prices of what our competitors are selling their products at so we can be competitive. Traditionally, we’d have to get someone to manually go through the whole website and note down (on paper?) the price and item name leaving us with 1000s of piece of paper and some poor person in admin sorting it all out – not cool.

Luckily for us, we discovered import.io which lets us get real time data from the comfort of our desktop.

There are two ways to do this:

The First Way: Combine Connectors into a Mix

A Connector uses page interactions to get data, meaning you can record an action (such as a search) and import.io will extract the resulting data. You can then query that search box from the dataset allowing you to get data, fast.

The first thing I showed you was how to record a search in the Connector workflow for socks on Next.co.uk. Remember that by default we have JavaScript turned off (it’s faster that way), but if you need JavaScript to see your data, simply click the “Not Working” button and we’ll turn it back on.

Tip: if you’re going to be combining multiple Connectors you’re going to need to use the same schema (ie. column names and search inputs).

With import.io Connectors you can get up to 10 pages of results, so if you’re interested in more than one page make sure you say “Yes, please” to pagination. The total results counter is less important, but if there is one it’s a good idea to go ahead and train it.

When you’re training your columns, it’s a good idea to make use of our different column fields. You can even train the same piece of data on the page as two different things. So for example I can train one column for the same of the product as text and another one for it as a link.

Then you need to train pagination by going to the 2nd page (by clicking the “next” button) and checking that import has pulled back the data correctly.

Once you’ve uploaded your data to import.io, you can query it directly from the Dataset page. Every time you press query, we’ll go to the site and bring back the live data!

Now, Chris and I realize of course that we have more than one competitor, so we’ll create a Mix with all of our competitors. Simply click “Add New Mix”, choose the Connectors you built and you’ll be able to search for one term across all the sites!

Here it is: Price Comparison Mix

The Second Way: Crawl Your Competitors

Crawlers are one of our most powerful tools and allows you to turn entire websites into structured data incredibly quickly. Chris and I are going to use the Crawler get as many products as possible which we can then host in a central database and we can query with an elastic search to pull out the products that we want to compare.

So, as usual, I built a Crawler to the product pages on Asos! Building a Crawler is pretty much the same as building a Connector except you don’t have to record a query. For this Crawler, we’re extracting data from each product’s individual page which means that we need to use the “Single Results” extraction pattern. And that of course means that we can go straight to adding columns.

While we were building our Crawler, Chris showed you how to use one of our advanced features the manual Xpath override which lets you get data that you can’t necessarily see on the page. This tool is also great for getting at data that moves around from page to page.

Now, the Crawler only requires you to train 5 pages in order to work out the extraction pattern, but I always recommend training more. It just makes it more likely that we’ll get the extraction right.

Before you run the Crawler, go ahead and look at the Advanced settings, especially the URL template that import generates for you. This is also the first thing to check if your Crawler isn’t returning as much data as it should. You can learn more about our Advanced Crawler Settings in this tutorial.

If we to let this run for a little while, we would have lot’s and lot’s of data – and I mean loads of data. Then you just pop that into a database and query it again using some kind of elastic search tool, saving you time and money to get the data that you need.

Question Time

Do you have to do this in the browser each time?

You do have to build each of your data sources using our tool, but once you’ve built them you can access your Datasets from any browser! This means you can create and query a Mix from anywhere you want.

Is it possible to run a Crawler in which you feed in all the URLs directly?

Yes. You can tell your Crawler exactly where you want data from by pasting in all the URLs to the “Where to crawl” box and setting the page depth to 0. In that case, the Crawler will only visit those pages that you specified. This should also make your crawl much quicker.

Can I get around robots.txt using import.io?

Nope. We try to support both the site owners and the people who need data, and that means that we obey all robots.txt.

What are the legalities of extracting prices?

The legalities of extracting data from the web can be a bit confusing. In general though, so long as you are obeying the Terms and Conditions of the site you are pulling data from, you should be fine.

How do you stop sites from blocking you?

We’ve designed our extraction to be as gentle on the site as possible. As long as you are respectful of a site and don’t pass through a lot of queries at once or run your Crawler at maximum speed, it is relatively unlikely that you will get blocked.

Does import.io allow you scrape anonymously?

We use proxy servers to process your data extraction requests. Site owners will be able to see that someone is using import to collect data, but they will not see any of your user information. Internally, we can match data request times with your User GUID, which means we can work out who was getting data from which site and when.

Is it possible to schedule the Crawler?

If you use the Crawler over the command line you can schedule it to run whenever you want. For more on how to do that, visit our tutorial.

Next Time

There’ll be no webinar next week, since we’re all going to be away at our company offsite. But you can join me and Chris again in two weeks time (July 29) for our next webinar all about how we used import.io and Google’s new voice activation API to create a super cool hack project! You can sign up for that one here.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!