Why one company stopped building their own web scrapers and decided to work with Import.io instead
What follows is the story of one Import.io customer who recently joined us after having built and operated their own web scraping team for over 15 years. Names and some key details have been changed in order to protect the identity of the customer.
A company that relies on web data
My name is James and I am the CFO of our company, we employ over 150 people and enjoy annual revenues of $40m. In addition to being the finance guy, I am also responsible for our business operations team of 60 people, which includes (or I should say “included”) our web data team. We offer a SaaS e-commerce analytics platform that provides real-time, consumer purchasing insights for some of the largest brand manufacturers and retailers around the world. At the heart of our platform sits a database of over 30 million product pages that has been gathered from the web over the last 15 years. We perform a variety of different ML analyses on these product pages and the associated product images in order to power the analytics products that we offer to our customers. To keep our insights fresh and relevant, our web data team has to keep our product database constantly updated, which means extracting new data on millions of products from over 3,000 websites, every single day. Web data is the absolute beating heart of our business!
At the beginning of 2020 we had 18 full time engineers solely dedicated to web scraping product data from websites but it wasn’t a happy picture. We were spending too much money on this web scraping team, we were unable to gather all the data that we needed and we were never able to properly plan: every year the team would exceed its allocated budget and only a couple of months into 2020, we were already projected to spend double what we had planned for the year. Add to this the fact that we would also regularly miss scheduled data collections for some of our most important target websites and we resolved to find a different way of doing things.
Problems with doing our own web scraping
Monitoring and maintenance
The biggest challenge that our web scraping team faced was with monitoring and maintenance. Our web scrapers would break, a lot. Sometimes this would be due to website changes but other times we simply wouldn’t know what was causing the break. This meant that the team spent a lot of time fixing and maintaining scrapers – that is, if we even noticed the break at all because, critically: there was no easy way to monitor if our web scrapers had stopped extracting data properly.
Of course the team built unit tests into all of our web scrapers as a matter of course but writing such tests means anticipating the different ways in which web scrapers can fail, which is harder than it sounds. If a web scraper encountered a behavior on a website that we hadn’t thought of, then the scraper would fail silently and would send corrupt data to our database without us realizing. It turns out that monitoring for web data quality is a hard problem.
Over time, we had come to develop a process whereby we would perform a quarterly audit of all of our web scrapers – manually running and checking the data from each web scraper individually and having a human compare that data with the website to make sure that it looked correct. It was a very labor-intensive process and difficult to automate but very necessary: we would find at least 100 broken scrapers (broken for who knows how long) every time that we performed this audit. What was worse than having to perform this audit was when customers themselves would find errors in our web data and they would call us up complaining about it, this undermined our customers’ confidence in our product and what we do for them.
Complexity and scale
Websites are now also much better at blocking automated access which has meant that in order to get data from websites at the speed and scale needed for our business, we were spending an inordinate and unpredictable amount of money on proxy networks and other such infrastructure. This was one of the primary reasons why we found it so difficult to model the financial needs of our web data operations team.
Moving to Import.io
At the beginning of this year we began a vendor evaluation process. We wanted to find a strategic partner that we could build a relationship with for the long term, who would handle all of our web data collection and who could scale with us as we grew. We ended up doing a proof of concept with two of the vendors from our shortlist and selected Import.io as the company that we wanted to do business with.
The thing that distinguished Import.io during the proof of concept was the Data Operations Center. We are now able to see, at a glance, the real-time status of all of our scheduled web data extraction jobs over all of our target websites. It is critical for our business that all of these scheduled jobs complete within 24 hours and we are now alerted ahead of time if there are any issues that arise that might affect the successful and timely arrival of that data. In addition, because Import.io is monitoring our web data collection on a value-level (anomalous values, unexpected nulls, value changes compared to last time) rather than doing just a simple “did it run” test, we are confident that we are feeding high quality data into our products.
The team at Import.io took great care to ensure that they understood how our business works and led us through a very rapid on-boarding process: converting over 3,500 web scrapers onto the Import.io platform in just 6 weeks (including all of the websites on our “too complex to do” list which we just weren’t getting data from before). All of our web data operations are now entirely managed by Import.io and I am happy to say that we have repurposed our team of 18 engineers, who were previously running web data operations, and they are now working on building valuable features in our core product for the benefit of our customers.