Today we are pleased to announce the November release of Import.io. This release is focused mainly on a newly updated version of our extraction platform. The data extraction platform is the brains that runs all of the web pages through your Extractors and helps you convert web pages into data.
The way that you build an Extractor is not changing. You still get to select the data that you want from a web page in the same, simple way via point-and-click; but now, when you run a batch of URLs through your Extractor, queries will run faster with a much improved success rate.
Queries in the cloud
The biggest change that we are bringing to you with this release is that now both the data extraction and the management of your queries is performed entirely in the cloud. Data extraction in the cloud has always been core to what we do at Import.io but now we are managing your queries there too.
The biggest benefit of moving queries to the cloud is that you don’t have to keep our old desktop product up-and-running on your laptop in order to feed URL queries into Import.io. In fact, your laptop no longer has to be involved in data extraction at all. Not only does this free up resources on your laptop, it makes extraction much more reliable.
Previously, if you were using our desktop crawler to feed queries into Import.io and your laptop ran out of batteries (or whatever) then your extraction would fail and you would have to start over. Now you can set Import.io to run multiple URL queries through an Extractor, shut your laptop for the night and let us manage the execution of those queries for you in the cloud while you sleep.
We have increased the number of cloud servers that are available for data extraction by several orders of magnitude. This is the biggest increase that we have ever put in place. More servers means more capacity and an increased speed and success rate for your data extraction. This capacity increase is immediately available for all users on both free and paid plans. The next time that you run queries through Import.io your run will use this new larger infrastructure.
Automatic query retries (for free)
We now detect extraction failures for you and automatically retry all associated queries. In addition, failed queries and query retries no longer count against your monthly query limit: we only count queries that successfully return data.
Previously you would pass URLs into Import.io using our desktop product and then our cloud servers would try to extract the data for you live, there and then. If for some reason the extraction didn’t work, we would just pass you back an error message and move on. The most common reason for such an extraction failure would be a timeout caused by the web page taking too long to respond.
Many users discovered that they could overcome such timeout and other extraction failures by simply retrying all of their failed URLs and thus completing their data extraction with a series of manual retries. But managing these retries yourself is difficult and time consuming. You have to first work out which URLs failed by using a couple of spreadsheets – one with your queries and one with your results – and then do complicated lookups between the two. Once you have worked out which URLs failed, you have to copy and paste those URLs back into your Extractor to run them again. You repeat this entire process until you have no more failures and a completed extraction. Hard work! And more importantly those failed queries and those retry queries counted against your monthly query limit!
That all changes with November’s release. We retry failed queries for you and only count the queries that successfully return data against your monthly query limit.
Be aware that even with automatic query retries it is still possible for you to experience query failures because while we will retry a failed query a number of times during your extraction, we don’t retry a failed query indefinitely. That said, if you do still experience query failures we have also made it much easier for you to manage retries yourself. Simply download the log file associated with your run, select the failed queries from the log file and resubmit them to your Extractor.
Slow down to speed up
From today we now dynamically change the speed at which we submit your queries to the website from which you are trying to extract data. We slow down and speed up extraction depending upon the ability of the underlying website to respond to your data requests. We only allow a certain number of connections from all of our users to a single website at any one time so as not to overwhelm the website. If we judge that there are too many connections open to the website then we put your queries into a queue for processing later. This may slightly slow down your real time query rate but we have found that it improves both the speed and quality of the overall extraction.
What you train is what you get
The web browser that is used to render a web page for training is now exactly the same web browser that is used to render a web page for data extraction. This is a very important point.
When using the old version of Import.io there were always two web browsers in play: the browser in the desktop product with which you would train your Extractor and the browser on our servers that would render a web page for extraction. Anyone who has used the web in recent years knows that different web browsers render web pages differently. This matters for web data extraction because not only the appearance but the actual contents of a web page can change depending upon the web browser that you are using.
Differences in how our two browsers rendered web pages was a number one reason for extraction failure on our old platform. Old users may remember the dreaded “Publish Request Failure” error message. Well we have entirely eliminated “Publish Request Failures” with this release. You will find that with our new synchronized browser sitting at the heart of Import.io you will experience a much higher extraction success rate.
Extraction time remaining
One last thing. When an extraction is running we now display the estimated time that it will take for your run to complete. If you judge the completion time to be too great then you can interrupt your run mid-flight and the data that has been extracted up to that point will be available for you to download.