More tips & tricks for extracting data from the web

Tips & Tricks: The Sequel an import.io Webinar Production Tips & Tricks: The Sequel an import.io Webinar Production

We went old school for this week’s webinar, bringing back the usual suspects: myself (Alex) and our Developer Experience Engineer, Chris A. Since the last Tips & Tricks webinar was so popular (sold out in fact), we thought we’d do it again – this time with a moustache! These webinars are all about you, so our main aim for this one was to answer as many of your questions as possible.

If you have more questions as you’re using the tool you can always click on the little pink question mark on our site or in the app and just type your question in the box. It’ll search our knowledgebase for you and you if you can’t find what you’re looking for you can submit your question to me and the rest of the support team.

Extractors

I’m taking it back to basics to start out with here by showing you how to build a simple Extractor. What an Extractor does is it turns pages that look the same into data. So what you should be able to do is train one page and then give that extractor another URL from the same site and have it turned into data automagically (see what I did there?).

Tip: make sure you chose the right option when you get to the single or multiple rows stage. It’s an essential, but often overlooked, step in creating an API.  If you only have one thing on the page, like a product page or a news story you want single row. If you have a list of things, like when you search for “jeans” in Asos, you want multiple rows.

Tip: import.io has a number of different column formats to make normalising and combining your data super easy. One of the best ones to do is to use the date format. When you use this column value, you need to put in the date format for that source. You can use the drop-down menu to help you find the right keys for the correct date format. For example March 15, 2010 is MMMM d, yyyy.

When you’re building an extractor, you can test that import has understood the extraction pattern by adding another page to the training. When you go to another page import.io should be able to get all the data automatically. You can also do this from your Dataset page by clicking Add URL, pasting in another URL and clicking the “Refresh” button.

Here is the Premier League Extractor I showed you.

Tip: if you click the little plus icon in the top left hand corner of your Dataset table you’ll be able to see which source the data is from and its URL.

API Integration

Then, Chris showed you how to integrate the Extractor I just built over the API using our Integrate page. You just chose what language you want to integrate with (or Excel/Google Sheets) and we generate all the example code and query objects you need to plug it into your app. We even dynamically generate the code based on the data you’ve got in your Dataset page! The best thing about querying over our APIs is that we never cache your data so when you access it over the API, import.io is performing those queries live!

Crawlers

Another great tool that import.io has is the Crawler. A crawler lets you get all the data from a site, by just training a few examples. This is great if you are creating a database or you need a bunch of static data. For this crawler I was getting some prices and you’ll notice that in the column field I chose currency and the app knows that “£” means GBP.

Tip: import.io is full of hotkeys! For example, you can press ENTER on your keyboard instead of hitting the extract button.

Tip: when you’re selecting your column data, you can make it super specific by clicking and dragging with your cursor to highlight just the bit of data you’re interested in, instead of just clicking on a highlighted element.

Now, there are a lot of options of the crawler launch panel, but the one I want to draw your attention to is the “Save Stream” which is great if you know you’re going to be doing a long crawl because if your computer crashes you won’t lose any of your data.

Here is the Asos Crawler I built.

Connectors

The final (but possibly most exciting) tool we have in our bag over here at import.io is the Connector. A Connector lets you record yourself interacting with a site (like typing in a search box) to get to the data that you need. The Connector I built in the webinar was a great opportunity for me to show you how to train pagination and total results. To do this all I have to do is highlight the total results and hit the train button when asked and then go on to the second page of search results when prompted. When you’re training a Connector you will need to give it two tests if there are multiple rows and five if there it’s a single results page.

Here is the Tesco Connector I made.

Mixes

So Connectors are pretty cool, but what if you want to combine a couple of them together? Well, you can create a Mix, which is a combination of two or more Connectors you can query from one box. I did a whole webinar on Mixes a few weeks ago, so I only did a super quick demo in this one.

Here is the Particle Mix Chris demoed for you.

Tip: if you use a Mix, you can integrate all the sources over one API, which is what we call Federation.

My Data

Finally, I showed you how to check the status of your data sources on your My Data page. We run periodic tests on your data sources to make sure that they are still working and if we find something amiss we will notify you on your My Data page. From there you can edit to re-train the source so it is working again.

Question Time

How can I access data over the API?

We’ve made accessing data via an API super easy. Just choose the data source you want and hit the Integrate – this is available on the Dataset and My Data pages. Then, select the language you want to use, put in your password to retrieve your API key and we’ll automatically generate all the code you need to execute a query over the API. You can learn more about the nuts and bolts of the Integrate page by watching our advanced features webinar.

Can you refresh a data source from Excel?

You sure can! Here is a tutorial that will walk you through the process. This feature is only available for the PC version of Excel, but you can do a live integration into Google Sheets as well.

How do you get data from a product page that requires multiple inputs?

This is an excellent question. Sometimes you need to give a site two inputs to get all the way to the data you want. For example, on some US retail sites you need to search for the product first, then input your zip code in order to get the price. Well, you’re in luck! Our Connectors can take multiple inputs, meaning you can keep recording all the way to the end of the process and we’ll let you manage both inputs. This example (Internship Connector) asks you for two inputs, your major and zip code in order to display the data.

Can I use import.io to access data from behind a captcha box?

We believe in being good data citizens. If someone has put up a captcha box it’s because they don’t want people accessing that data over an API and we respect that. That is why we don’t support you getting data from behind captcha. You can read more about our policy on captcha from our CTO Matt Painter.

What’s the difference between local and remote crawl?

Classic import.io Crawlers are all run locally from your machine, meaning the requests to the site are originated from your computer rather than import.io servers. When we introduced the ability to crawl Javascript-enabled websites (e.g. sites that need to run Javascript in order to show you the data you want) we also introduced the ability to run some Crawlers on import.io servers. While the crawl is marshalled and controlled by your machine, the actual data acquisition is performed through the import.io cloud platform.

How can I prevent my crawler from being blocked?

By default we set our crawlers on a conservative setting, but if you find that you are getting blocked from a site because you are making too many requests, you can always turn it down in the advanced settings section. We encourage everyone to be good data citizens and to be kind to the sites they are crawling.

How do I export pictures into Excel?

Excel isn’t great at importing pictures, especially if there are lots of them. My best tip is to export the data into a CSV, which won’t try to render the images and will load a lot faster when you open it with Excel.

Can I edit my data from the Dataset page?

Right now we don’t allow you to do any data manipulation inside our app. If you need to do a little data cleansing you can export your data into Excel/CSV or do it directly from within your app having got the data over the API.

How do I find out what features are coming up?

We’re always looking for ways we can make our tool better for you. That’s why we started the ideas forum. You can post an idea, ask questions about upcoming features and vote for features you want to see. If an idea gets enough support, we’ll add it to our list of features we’re working on. If you vote for an idea you’ll also be notified when we start building it and when it’s released.

Are you guys available to consult on a specific problem?

Glad you asked! If we didn’t get to your question here or you have a more specific question about a particular data source, get in touch with me on support@import.io. Good support is really important to us, so there is always a human (usually me) on the other end to answer your questions.

Congratulations to Juri Loo for winning best question, and getting a coveted Data Punk t-shirt!

If we didn’t answer your question here or you want more information, please check out our knowledgebase or email us on support@import.io.

Next Time…

We’re giving you a chance to channel your inner investigative journalist and uncover the big story using data! Join myself and Bea on 27th May and learn all about how to do data journalism by creating datasets which may lead to a breaking story. Signup here!

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!