For today’s webinar I am joined by my favorite Northerner and technical marketing expert, Dan Cave. In light of all the new signups we got after our last joint webinar with infogr.am, we thought it’d be a good idea to give you guys a quick overview of all the different data extraction tools we have to help you get data from the web.
The first, and simplest, way to get data from the web is by using our Extractor tool. Extractors are really great for when you want to turn just one page into structured data.
The first decision you need to make when mapping any page of data, is whether the page you are structuring has single or multiple rows. Single row is for things like product pages where there is only one subject you want to extract data about. Multiple rows on the other hand are great for things like results pages where there are many different products on the same page.
For this Extractor I’m only interested in data from this one BBC article, which means I want to use single row. Remember that choosing single row, means that I don’t have to train rows at all and I can go straight to training columns.
I’m going to get the title (text), picture (image), first paragraph (text) and date (date format). When you’re training your columns, make sure you check out the column field options. If you’re using the date field, for instance, you need to tell us what the date format is by using the drop down menu and choosing the right format codes.
Once you’ve finished building your extractor, you can access it in a dataset. From here you can either refresh it to get the latest data, or you can add another URL to a different BBC story and we’ll turn that story into data.
Here it is: BBC Article Extractor
Auto Table Extract
One of the quickest ways to get data is by using our newest beta feature Auto Table Extract. Whenever you come across a table on a webpage, it should turn green when you detect optimal settings which means we can extract it automagically!
All you need to do is click on the table to select it and then hit the train button and all that lovely data will be extracted without you needing to train any rows or columns.
Oh hey look: BBC Football Table
Integrating your data
While we’ve got some awesome premier football data (as evidenced by the fact that Arsenal is at the top), I thought it’d be a good idea to show you how to integrate it into Google Sheets. From your dataset page, simply hit the integrate button, select Google Sheets from the side bar, pop your password in the box to get your API key and copy and paste the formula we generate into an empty cell in a Google Sheet. Now in a weeks time – when the matches start – all I need to do is come back and refresh my Google Sheet and I would have all the latest stats in my sheet.
This is by far the most popular tool we have in our toolbox. Crawlers are great for getting data from all the pages on a particular website. As usual, I built a quick Crawler to Asos to get product information for all the products on their site.
The main difference between training a Crawler and an Extractor is that you need to train 5 examples. This training helps us to generate a URL pattern which we then use to traverse the site and look for any pages that contain the same data pattern that you mapped. Training 5 pages should be a pretty painless process, because the tool should be able to pull the data in automatically without you needing to do any actual training.
When you get to the run Crawler section, there are a few different settings you can change to make your crawl a bit faster and more efficient. Here you can put in your own URLs for “Where to Start” and “Where to Crawl” as well as some more advanced options.
Once you’ve checked your crawler settings all you have to do is hit “Go” and watch the data come back! And in just a few minutes I got back over 100 rows of data.
See for yourself: Asos Crawler
Connectors are one of my favorite tools to use because they let you record yourself interacting with a site to get data. For example, I can go to the Tesco website, record myself searching for “beer” (always a classic) and then I can extract all the data from the search results. Connectors also let you train pagination (up to 10 pages), which is great for when you want to get a load of search results.
The coolest bit – I always think – is when you add an example search and the Connector actually types in your search and clicks “Go” – all by itself! Once you’ve finished training your Connector, you can also search (or “Query” as we like to call it) it from the Dataset page.
The real deal: Let’s Go Tesco Connector
That’s all pretty awesome, but the absolute best thing about Connectors, is that you can combine multiple of them together into what we call a Mix, and then search one term across multiple sites.
How can I extract data from a page where I need to click a drop down and make a selection to see the data?
The great thing about Connectors is that they aren’t just for search boxes. You can record any kind of interaction you want, whether that be making a drop down selection, typing in a box or selecting a date.
Can I extract a price range?
If the price you are trying to extract is a range or has a regular and a sales price, you have a few options. You can map one or both prices in the same column by highlighting one or both and pressing train. If you highlight both prices, they will be displayed on different lines within the same column. You can also map each price in a separate column, by highlighting just the price you want. To highlight a very specific piece of data (like the max price in a range), simply drag your cursor over it the same way you would highlight something in word.
Is there a maximum number of rows you can import?
There’s no limit to the amount rows you can extract. However, we do limit you to an upload of 25mb.
Is there a way to get around the 10 page pagination limit?
If you’re using your Connector over the API you can set both the max pages you extract and the start page. So for example you could send two requests, one with the start page set to 1 and the other with the start page set to 10. If both have the max pages set to 10 you should get back 20 pages. Another good tip, especially for product pages, is to also record yourself changing the “number of results shown” to the max for that website. This will get more data onto each page and hopefully reduce the number of pages of results to less than 10.
What do I do if incorrect data appears in any of the columns?
If for any reason you map a column incorrectly you can hit the undo button to undo your last action. You can also use the trash icon to ditch all the mapping for that column and start again.
How do I extract an embedded video?
The best way to extract videos is to extract them as a link. In some cases, depending on how the video has been embedded, this may require you to write a custom XPath.
How can you edit the dataset once you have it uploaded?
From either your My Data page or within the Dataset itself simply click on the edit icon and you will be launched back into the workflow where you can edit your Extractor/Crawler/Connector to your heart’s content.
What are some examples of uses for import.io?
There are literally 1,000s of things you can do with data you collect using our tools. Our blog is full of user stories and use cases to help inspire you. Here are a few to get your gears turning:
- LeadChat: Using Data to Generate Sales Leads
- Roadless: Find Cool Things To Do When You Travel
- 87seconds: Get Animation Inspiration
- Recommend: Get High-quality Recommendations
- ThatGift: Build Your Own Affiliate Site
- Tableau: Visualize Your Data
We love hearing new and innovative ways people are using data! If you’d like to share your story, email us at email@example.com and you could be featured on our blog!
Join us next time
I’ll be taking a break next Tuesday (Aug 19th) and placing you in the capable hands of our Co-founder and Product Evangelist, Andrew Fogg. He’ll be showing you a cool workaround which lets you use Google Sheets to submit multiple queries to any Connector or Extractor.
Turn the web into data for free
Create your own datasets in minutes, no coding required
Powerful data extraction platform
Point and click interface
Export your data in any format
Unlimited queries and APIs