The term “wrangling” evokes images of cowboys lassoing runaway cattle, getting them under control, and taking them back to the herd. Data wrangling is also like lassoing data and putting it to work in your business.
According to IBM, our output of data was roughly 2.5 quintillion bytes a day in 2017, and by the year 2020, it’s estimated that about 1.7 megabytes of new information will be created every second for every human being on the planet.
As we deal with such an ever-increasing amount of data, we need to find better and more efficient ways of rounding up, cleaning and organizing that data, and using it to our advantage.
What is Data Wrangling?
Data wrangling is the process of taking messy, incomplete or complex data, and cleaning and simplifying it so it’s easier to access, organize and analyze.
In practical terms, that might mean targeting a field, row or column in a data set, and applying an action such as joining, parsing, cleansing, consolidating, or filtering to create the desired output, which will then be used down the road.
This wrangled data could then be used by businesses, data architects, or data scientists, who may process the data further or reuse it in other formats.
Data wrangling has emerged in recent years as a fast-growing segment of the analytics industry. Messy and complex data sets were once a roadblock to data analysis, but the ability to wrangle data has transformed the process, replacing tedious, time-consuming efforts to master diverse data sources.
In data science, it’s common to deal with disorganized or incomplete data, and the raw data obtained initially is often unusable in its native form, so wrangling the data can make it more appropriate and valuable for other purposes, such as analytics.
According to a 2018 IDC Business Analytic Solutions survey, data scientists spent a 73% of their time just wrangling the data – so it’s an essential part of the data science role.
But as important as this function is, it’s not exactly seen as fun. As Lukas Biewald of Computerworld says, “Data wrangling is cleaning data, connecting tools and getting data into a usable format; the sexy stuff is predictive analysis and modeling. Considering that the first is sometimes referred to as ‘janitor work,’ you can guess which one is a bit more enjoyable.”
The Benefits of Data Wrangling
Regardless of how unexciting the process of data wrangling might be, it’s still critical because it makes your data useful. Properly wrangled data can provide value through analysis or be fed into a collaboration and workflow tool to drive downstream action once it’s been conformed to the target form.
Conformance or transforming disparate data elements into the same format also addresses the problem of siloed data. Siloed data assets cannot “talk” to each other without translating data elements between the different formats, which is often time or cost prohibitive.
Another benefit of data wrangling is that it can be organized into a standardized and repeatable process that moves and transforms data sources into a common format, which can be reused multiple times. Once your data has been conformed to a standard format, you’re in a position to do some very valuable, cross-data set analytics.
Conformance is even more valuable when multiple data sources are wrangled into the same format. For example, if you’re a healthcare organization, you may need to compare physician performance across a large set of patients using health insurance claims data. If there are multiple health insurers involved, their membership and claims data will be in different formats. Wrangling data into a common, conformed format will enable you to compare physician performance metrics, regardless of payer.
Or if you’re a nationwide retailer, you may need to aggregate sales data from different regions to identify and respond to regional trends. Every region may not be using the same system to capture sales data, and may even have different product catalogs, which can create problems. In this case, wrangling data into a conformed format will enable you to create “apples to apples” comparisons between regions.
How to Wrangle Data
There are three general methods to wrangle data: manual, semi-automated and fully automated.
With this method, you open, inspect, cleanse, manipulate, test and distribute the data by hand. This method can yield some quick and dirty results to get you started. However, it’s not very efficient or enjoyable. You only want to do this for a one-time analysis, or to do a proof-of-concept for an ongoing analysis.
Regularly wrangling data by hand for a production process is a time-consuming, error-prone nightmare because of the very human element: using a manual process introduces the possibility that the analyst will forget a key step and end up sending bad data to end users.
Thankfully, there are code-based tools that can make data wrangling easier. The first is the old standby from the relational database world, Structured Query Language (SQL). If you can pull the source data into a SQL table, you’ll be in a much better position to profile the data, analyze it for trends, manipulate it, and run calculations and summary queries on it than you would in a spreadsheet.
In addition, if you have a recurring task that involves a small number of data sources, you could also create SQL stored procedures to process your data wrangling for you.
The next step up from stored procedures is Extraction/Transformation/Load (ETL) tools. These tools extract data from its source format, transform it to match the target format, then load it into the target location.
There are many ETL tools you could choose from. Some are even free. ETL tools are a step above SQL stored procedures because 1) they tend to handle many different data sources better than SQL, 2) feature a graphical development interface, 3) are generally better for complex transformations and lookups, and 4) tend to manage memory better, which could be an issue for larger data sets.
3. Fully Automated
If you have a need for repetitive and complex data wrangling, you should consider developing an enterprise data warehouse and schedule automated ETL workflows. This approach develops the data wrangling logic upfront in a reusable ETL automated process. Then, the ETL process is run on a schedule to automate ongoing data loads from a given data source into a conformed format.
While this approach requires more analysis, design and development up front, as well as ongoing data stewardship and governance, it not only provides the benefits of reusing ETL logic, but you can also reuse the conformed data for multiple business scenarios.
According to Margy Ross, President of DecisionWorks Consulting, conforming data can “support the ability to drill across and integrate data from multiple business processes. Reusing conformed dimensions ultimately shortens the time-to-market by eliminating redundant design and development efforts.”
For any organization that needs to produce anything but the most narrow and basic business intelligence analysis, data wrangling is a fact of life. It can be done by hand or on an ad hoc basis using automated tools, such as Import.io. Import.io not only extracts data, but also transforms the data into structured, common format, which can be reused multiple times
But the best way to tame your unruly data is to create and schedule automated processes to do most of the data wrangling for you, and conform disparate data elements to a common format – freeing up your analysts for greater value-added activities.