We all know that time is money, especially when you are paying an expensive data scientist. But the New York Times reports that…
“Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in [the] mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
– Steve Lohr, NYT
According to the Co-founder of Trifacta, even the most powerful algorithms can’t derive insights from raw data and so datasets have to be cleaned and combined with one another in order to deliver value. This means that those expensive data scientists are forced to act more like data janitors than data scientists. Cleaning and combining different datasets, a process that is commonly referred to as “data wrangling”, is a shockingly large part of a data scientist’s daily work.
“It’s something that is not appreciated by data civilians. At times, it feels like [data wrangling] is everything we do.”
– Monica Rogati, VP for Data Science at Jawbone
This is a major issue for the data industry, because it means that more than half of all the time spent working with data is not time spent actually analyzing data. If Big Data is ever going to deliver on its promise of smarter, data-driven decision-making in every field, there has got to be a better, faster way of getting data ready for analysis.
Changing the way we gather data
We think that data scientists spending so much of their time on data collection and cleaning is ridiculous. Of course not all data wrangling is time wasted, but the vast majority of it is done because the traditional methods used to collect data from the web are not very accurate.
What is needed is a better way to collect process-ready data. If you’re looking to get process-ready data from the web, then we think we can help. Over the course of building our data extraction technology, we have become pretty good at extracting clean data from the web. Now, with our newest offering, Import Data, you can order custom-built, process-ready datasets that are built specifically for your data needs.