Become a data extraction master

At import.io we have a lot of different options and tools for getting data from the web. And navigating them can sometimes be a bit tricky. In this webinar, I take you on a comprehensive journey of all that import has to offer. From the simple pasting of a URL to the ultra-powerful automating actions, by the time you’re done watching this video you will be a data extracting master!

Magic – a quick fix

If you’re totally new to extracting data (or short on time), Magic is the perfect tool for you. It runs on our website, so there’s nothing to install or download – you don’t even need to train anything! To use Magic all you do is copy and paste the URL of the site you want data from into Magic and press “Go”.

Our extraction algorithms will go to that page and pull out the data from the primary list on the page (the one we detect as having the most data in it) into a table. You can then download that data (up to 20 pages) or use it over an API – including a Google Sheets integration.

Create a word cloud

We’ve been really in to word clouds in the last week. Using Magic you can create on in less than a minute – and most of that is playing with fonts! Here’s one I made in the webinar to my friend’s blog: Whiskey Times.

Step-by-step word cloud guide

Extractor – create an API

An extractor is a lot like Magic in that it allows you to turn a URL into an API. The difference is that the extractor is far more powerful and can extract a lot more different types of data (including single pages). To access this functionality, you’ll need to download our free web app – and, in fact, everything we do from here onwards will be in the app. 

Building an extractor, like this one I built to get data on NFL Super Bowls, is still pretty easy and quick to do. You just name your column and click on the first bit of data in that column. The extraction algorithms will use that information to figure out the rows and extract the rest of the data in that column as well. Then, you just repeat the process until you have all the data you want. 

There’s also a number of advanced options you can do with the extractor, like setting an xpath override and skiping unwanted rows, but that’s for another time. 

Step-by-step Extractor guide

Graph your data

Once you’ve got some data, a great thing to do it graph it with our built-in Plot.ly integration. Just click on the Plot.ly tab on the My Data page, select the columns you want to graph and then “Export to Plot.ly”. Plot.ly will open in a new tab where you can select the type of graph you want and choose your X and Y axses. Here’s a plot of my Super Bowl data…

attendance vs prize_money

Step-by-step Plot.ly integration guide

Crawler – build a large data set

A crawler is great for when you want to get lots of static data from an entire website. You just have to train it on 5 pages and it will learn what your data looks like based on the URL pattern it generates from your training data. Then the crawler will travel through the links on those pages to find all the pages with similar data and extract those too.

Step-by-step crawler guide

During the webinar, I showed you how to build a crawler to Asos to extract all their jeans. I also gave you a crash course in the types of settings you can tweak before you run your crawler to make sure that it is ultra efficient. Efficient crawlers are good for you because they take less time, and good for the website because there is less server load – which means you don’t get blocked.

Step-by-step crawler settings guide

Connector – automate actions

A connector is allows you to automate actions on a website. For example, if I wanted data for different products on Tesco, I could build a connector by recording myself searching for a product, say bread, and extracting the corresponding data. Then, using the API I created I can search for any of their other products and get back the corresponding data automatically (with no further training).

Step-by-step connector instructions

Integrating your data with Google Sheets

At import, we have a number of integration options but the simplest and easiest to use is Google Sheets. If you click on the GS tab in the My Data Page, you’ll notice there is a formula. To make it work all you need to do is put in your password (to get your API Key) and paste it into a cell in Google Sheets.

Then using a little GS magic, I can modify the formula so that instead of just getting the training data from import, I can specify a cell reference as the connector input.

Step-by-step Google Sheets integration guide

Join us next time…

Next week we have some very special guest hosts. The guys from text analysis startup, MonkeyLearn, will be here to create an employment analytics visualization. Ever wondered which city has the most arts jobs? Or recruitment openings? Will your career be better off in NYC or SF? Turns out you can answer these and many more questions by doing some simple data analysis. Using three awesome free tools (import.io, MonkeyLearn and Plot.ly) we can obtain, categorize and visualize all the data we need in just 10 minutes!

Comments

I have copied the Xpath; however, I cannot paste into the MANUAL XPath override box. How do you do this?

Comments are closed.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!