At Import.io, data is not just what we deliver via our product but informs everything that we do, including how we evaluate the work that we do with our customers. The following data story is from one of our larger e-commerce clients and shows the difference in data quality that you can expect to see when working with Import.io on your web data project.
To set the scene, our customer is a very large, global e-commerce software vendor and had been working with web data themselves for over a decade before working with Import.io. They had created their own custom web-scraping software and had hired a very large team of web data engineers to maintain and support this. In fact, due to the nature of how the business had evolved over the years, they actually had multiple web-scraping teams working on different web data problems.
At the heart of our customer’s product offering is a suite of e-commerce analytics, which needs complete, regularly updated product data from the websites of many different retailers so that the manufacturers that use our customer’s product can have direct and real time insight into the pricing, product page quality and search ranking of their products on all of the different retailer sites that sell them online.
Our customer began working with us for a number of different reasons:
- They wanted to consolidate their web data solution with a single team
- They wanted to expand their web data collection efforts to a scale and frequency that their internal team was unable to support
- They wanted to solve a number of web data quality issues that they were having with their own web scraping
Web data quality
The main web data quality problem that they were facing was that their web scraping software was collecting incomplete web data from each product page more than half of the time. Different product-data field values would be missing from nearly 60% of all product records. These missing field values would go undetected and would propagate through to the e-commerce analytics reports and dashboards consumed by the end-users, distorting the insights that the end-users needed in order to effectively plan and execute their retailer strategies.
Measuring the problem
When we started working together, our customer knew that they had this data quality problem but didn’t know how big it was or how badly it was affecting their customers. All that they knew was that it was something that all of their customers complained about. As a first step to solving any problem it is always good to take some measurements.
We collected product data for 2.6 million unique products, every day, for 90 days using both Import.io and the customer’s legacy web scraping solution. We then compared how often complete product data records were collected on the first attempt using each system and how many retries were required with each solution in order to get complete product data.
Don’t get blocked
We found that web data extraction via Import.io was twice as successful at getting complete product data on the first attempt, compared to the customer’s web scraping. Upon investigation it became clear that the reason for this difference in success was that our customer’s web scrapes were getting blocked by retailer websites much more often than Import.io.
Not getting blocked by websites is one of the fundamental success factors for high performant web data extraction at scale. If you want to ensure that your web data extraction is successful you need to make sure that you are not blocked. The work that is required to make sure that you don’t get blocked can be significant, it requires that you invest in network proxies and proxy management infrastructure and tooling. At Import.io we have an entire team devoted to the development of the Import.io Traffic Manager whose single job is to ensure that as many web browsing requests as possible get accepted by target websites the first time.
If a website detects that you are browsing automatically then the website may choose to block you. There are many different strategies that websites take to block automatic browsing; they can be grouped into three main categories:
- The website serves up an error page instead of the requested HTML page
- The website serves up a CAPTCHA that needs to be completed before it will serve the requested HTML page
- The website serves up the requested HTML page but with false or missing data
It is fairly easy to determine if your automatic browsing has been detected and blocked with an error page or a CAPTCHA, it is harder to determine if you have been detected and blocked with a valid HTML page that contains false or missing information.
At Import.io, when we begin extracting data for a customer project, we start extracting at a low frequency in order to guarantee that we will not be blocked by any method and then we perform automated data profiling over what we know to be accurate web data. This data profiling creates expectations about the statistical shape and properties of the datasets that we hope to see. We review these statistical profiles with customers to make sure that they are consistent with the intuitions of the subject matter experts and then as we speed up the frequency of data extraction we use these statistical expectations to automatically monitor for anomalies in the data that will alert us if we start to see unusual patterns in the data that might be due to a website blocking us by serving HTML pages with false or missing information.
If you can tell when you have been blocked (even when you have been blocked with false or missing information) then you can retry your web data extraction in order to get a complete and accurate dataset. It is important though that your retry takes a different extraction approach so as to avoid being detected and blocked again. The Import.io Traffic Manager automatically retries blocked web data extraction attempts with a different extraction strategy.
Before working with us our customer had three problems that were affecting web data quality:
- Their web data extraction attempts were getting blocked
- They were not monitoring for when their web data extraction attempts were getting blocked
- They were not retrying their blocked web data extraction attempts
By using a combination of advanced network routing algorithms, statistical methods for detecting blocked web data extraction attempts and automatic retries, we were able to half the number of network queries needed in order to create a complete e-commerce product dataset.
Import.io extracts and analyzes web data at scale for the largest companies in the world. We believe that web data can solve the most difficult business problems and answer the most interesting questions. If web data is core to your business, schedule an appointment to talk to us about how we can help you.