Auto Extraction
As part of the algorithms team at import·io, my job is to always be on the lookout for ways to make getting data out of webpages even easier. A while back I was looking at the training stage of the workflow, and I thought to myself wouldn’t it be cool if we could automate this?
In our latest feature release we are giving you the ability to start doing just that. The newest way to extract data with import·io is called Auto Extract, and it uses a set of algorithms to detect the data on the page automatically. Right now, in its Beta stage, it only works on sites where the data is contained in an HTML table.
What do you mean by an HTML Table?
Ok, I’ll try to keep this as brief and non-techie as possible. One of the ways in which web pages can be laid out behind the scenes (i.e. in the code) is in a table. It’s exactly what it sounds like, rows and columns of data written into the HTML. Data which is laid out in this manner can be recognized by our app and therefore we don’t need the human interaction of training rows and columns to tell us where the data is; we can even determine what format to extract the data in (number, link, currency, image, etc).
Example of data in an HTML table
Keep in mind though, that not everything which is laid out as a table in the code looks like one on the page and not everything that looks like a table on the page is laid out as a table in the code. Got it?
No? Ok, here is an example of a website that looks like it has a table but really doesn’t: A-Level Results
And here’s one that has a table in the code but doesn’t look like it on the page: IMDB Top 250
And just to make things even more exciting, here’s one that looks like it has a table and actually has a table: BBC Football
Anyway, none of this is really important to you, the user. It doesn’t matter if you can tell whether there is really a table there or not, because the app will tell you. The thing you need to remember when you start using it is, just because it looks like a table doesn’t mean it is.
How to Use It
Let’s move on to the exciting bit about how you actually use it. If you haven’t already, you’ll need to DOWNLOAD our web app and create an account. Then, just follow the steps as you normally would. When you get to single or multiple rows section, you’ll notice that there is a 3rd option – Auto Extraction (this will only appear if we have detected a table). After you’ve selected this option, click on the table you want to extract (it will turn green), press Extract Table, and voila!
You can read a more in depth tutorial here.
This Feature is in Beta
Please keep in mind that, this feature is still in Beta and inevitably there will be some kinks to work out. At the moment it can only extract data from HTML tables, but I’m working hard to expand its functionality to include other types of data.
This a totally new feature for us, so I’d love to get your feedback on it! And of course, if you have any issues, contact the lovely people at support@import.io who will do their best to get you the data you need.
Now get out there and Auto Extract (tables)!
Here’s a few sites to get you started