Everyone is talking about the usefulness of data. Data can bring your more revenue, save you money, help you understand your market and lead you to better insights. But, what people often ignore (or maybe just forget to talk about), is that all of the above is true ONLY IF you have accurate data. And as anyone who has worked with data can tell you, data is often a dirty, unstructured mess. Especially when it comes from the web.
At Import.io, we’ve spent a lot of time focused on data quality – when we build out data sets for our clients, data quality is incredibly important. The first step in ensuring quality data is obviously trying to build your APIs to be as accurate as possible by using the right column fields, marking up multiple pages or generating your own Xpaths. But even after all that, you still need to check your data for accuracy. This article will show you how our data delivery team monitors and checks the quality of data sets we create for corporate clients and give you some guides for checking your own data.
Serving the data as is
Before we can get into the details of our data acceptance tests, you have to realize that the data you collect will only ever be as good as your source website. Import.io can only extract what is on the page in front of us (or in the HTML). We have no way of knowing whether that information is accurate or not.
Data on the web isn’t perfect. But then, neither is the world. Both businesses and people need to accept that at least some part of the data may be wrong. And that’s OK. Given enough data you should be able to spot trends and make accurate decisions. Big data is all about looking at the aggregate and for that you don’t need 100% accurate information. Once you accept this fact and find ways of working with it, not fighting against it, you’ll have a whole lot more success with web data.
Whew! Now that the philosophy is out of the way, we can talk more specifically about the kind of data acceptance tests you should be performing to check your data’s accuracy.
As much as we love to rely on technology and statistics, nothing (yet) replaces good old fashioned manual testing by a human. Get someone who wasn’t the builder of the data set to manually check at least 10 records (rows of data) against the source website. You’re simply looking to see if the data in your spreadsheet is the same as the data on the page. This is really the fastest way to spot any obvious problems with your data extraction.
Coverage is about testing if your data set has the right amount of records. Before you can extract data you need to know how many results there should be – this is your expected data set size. For example if you want the address of every 7-Eleven in the US you need to know how many 7-Elevens there are. Wikipedia and Bloomberg are good places to find this information.
Then you need to know how many records you actually have (remember to de-duplicate them first), known as your actual data set size. To calculate the coverage simply divide the actual size by the expected size.
So, if you expect 7-Eleven to have 14,000 stores, but the actual store count is 13,950; your coverage calculation would be 13950/14000 = 99.64%. It’s pretty rare that this number will ever be 100% – for most use cases, we consider anything over 95% to be acceptable.
Just because you have roughly the right number of rows doesn’t actually mean you have the right number of results (cells with the right data in them). Completeness is about measuring which columns should always have something in them (ie every item should have a price), which should sometimes have something in them (ie some items will have a sale price) and which should never have something in them (ie you expect no items from a particular source to have reviews).
How important completeness is to you depends on what you’re doing with the data. If you’re plotting 7-Elevens on a map, the address would be essential while the opening hours might be more of a bonus. Either way, if you find a lot of missing fields that should be complete you need to go back and look at your data source to make sure the information really is on the page and – if it is – retrain your API. Sometimes if the page structure changes slightly from result to result, you need to do a little extra training to expand the XPath.
For fields that you expect to be partially complete, you should still manually check a few of the missing ones against the website to make sure they really aren’t on the page and that it’s not an error with your API.
Data type testing
Each property (column field) of your data set should have a type – which you define when you create your schema. In addition to being simple things like number, text or currency; types can also define validation rules, such as expected string patterns. For example, US zip codes contain five or nine digits (optionally separated by a hyphen) or a price is valid if it consists of a number and possibly a currency sign.
In addition to testing the data type, you should also check for null values in your data – especially the sneaky kind that don’t look empty but really are. For example instead of the expected male/female for gender you might get a dash or a default value like N/A which will make your cell look full when it really isn’t.
The best way to test for this is to run a distribution on your cells to see if it is skewed (obviously this only works if you know what all the expected values are) and then filter for uniques.
Make data quality a priority
Whether you’re collecting data for a big company or a personal project, it’s essential that you make data quality a priority. Halo conducted a survey of 140 companies who estimated they lost an average of $8,200,000.00/year due to dirty or incorrect data.
Your data will likely never be perfect (especially when it’s from the web), but by following the methods above you should be able to control for the majority of issues. Of course, it’s also important to collect data from reputable sources in the first place.
If you’re interested in receiving quality, process-ready data sets that are custom built for your business needs, get in touch with our sales team by filling out the form below. If you’re more of a do it yourself kinda guy, we recommend Trifacta as a great tool for testing data quality.