Data creation and consumption is becoming a way of life. According to a recent IBM report, the world produced approximately 2.5 quintillion bytes of data a day in 2017. By the year 2020, analysts predict that every second, about 1.7 megabytes of new information will be created for every person on earth.
Most of this data is stored on the internet, making it the largest database on earth. Google, Amazon, Microsoft, and Facebook alone store 1,200 petabytes (1.2 million terabytes).
But using data comes with risks. The MIT Sloan Management review reported that financial losses due to bad data equal between 15%- 25% of a company’s revenue. And according to a 2018 IDC Business Analytic Solutions survey, data scientists spend 73% of their time doing the hard work of preparing data for more, value-added activities like predictive analytics or forecasting.
With so much potential for adverse business outcomes (lost time, lost sales, lost market share, lost customers, and more), businesses seeking to use data analytics to grow their bottom line really need to grasp the concepts of data cleansing and transformation.
While traditional web scraping methods may supply a large volume of data, it’s often messy and disorganized. Web data integration (WDI), however, focuses on data quality and controls. WDI has built-in Excel-like transform functions that allow you to normalize data right within the web application.
This allows you to extract, prepare, integrate, and consume data all within the same environment – from idea to insights in one simple solution. In turn, you’re able to utilize data with a high level of trust and confidence and take full advantage.
Before Starting With Data Cleansing and Transformation
Oftentimes, analysts are tempted to jump into cleaning data without completing some essential tasks. The items listed below set the stage for data wrangling by helping the analyst identify all of the data elements (but only the data elements they need to address):
- Define the business case: Knowing the business objective is the first step toward proper data wrangling. A good business case lays out the alignment with corporate strategy, the customer problems to be solved, new or updated business processes, the estimated costs, and the projected return on investment. These parameters help identify necessary (and unnecessary) data insights.
- Data source investigation: A sufficiently-designed data model sheds light on the possible sources of data such as websites and webpages to populate that model. Specifically, a thorough data source examination includes:
- Identifying the data required by the business case
- Knowing whether the data will be integrated directly into an application or business process or if it will be used to drive an analytical investigation
- Identifying what trends project team members anticipate seeing as web data is collected over time
- Cataloging possible data sources and their data stewards in a mature IT environment
- Understanding the delivery mechanism and frequency of refreshed data from the source
Also note that the value of web data increases, and over time, it becomes possible to perform time-series and trend analyses on the data. Therefore, your decision-making improves, and you gain a deeper understanding of how significant events such as a celebrity endorsement or limited-time sale impact your company.
- Data profiling: This step entails really getting to know the data before transforming it. Data profiling reveals data structure, null records, outliers, junk data, and potential data quality issues, etc. A thorough inspection of the data can help determine if a data source is worthy of inclusion in the data transformation effort, possible data quality issues, and the amount of wrangling required to transform the data for business analytics use.
The process of defining the business case, developing the data model, and finding and profiling data sources performs a valuable, winnowing function on data sources: it identifies only the data needed, and the processing work necessary to make that data usable. The stage is now set for data cleansing.
Data Cleansing
Only after the data source is evaluated and profiled can data cleansing proceed. Data cleansing depends on thorough and continuous data profiling to identify data quality issues that must be addressed.
Generally speaking, all applications of cleansing, transformation, profiling, discovery, wrangling, etc., should be in terms of data that is captured/extracted from the web. Each website should be treated as a source, and you should use language from that standpoint rather than the traditional ETL/data integration slant on enterprise data management and data from traditional sources.
Common data cleansing best practices can include (but are not limited to):
- Defining a data quality plan: Derived from the business case (see above), the quality plan may also entail some conversation with business stakeholders to tease out answers to questions like “What are our data extraction standards,” “What opportunities do we have to automate the data pipeline,” “What data elements are key to downstream products and processes,” “Who is responsible for ensuring data quality,” and “How do we determine accuracy.”
- Validating accuracy: One type of accuracy is taking steps to ensure data is correctly entered at the point of collection – for example, if a website has changed and the value is no longer there, or if the pricing of a product is only available when you put an item into a shopping cart because of a promotion.
- Deduplicating: No source data set is perfect, and sometimes source systems send duplicate rows. The key here is to know the “natural key” of each record, meaning the field or fields that uniquely identify each row. If an inbound data set includes records having the same natural key, all but one of the rows could be removed.
- Handling blank values: Are blank values represented as “NA,” “Null,” “-1,” or “TBD”? If so, deciding on a single value for consistency’s sake will help eliminate stakeholder confusion. A more advanced approach is imputing values. This means using populated cells in a column to make a reasonable guess at the missing values, such as finding the average of the populated cells and assigning that to the blank cells.
- Reformatting values: If the source data’s date fields are in the MM-DD-YYYY format, and your target date fields are in the YYYY/MM/DD format, update the source date fields to match the target format.
- Threshold checking: This is a more nuanced data cleansing approach. It includes comparing a current data set to historical values and record counts. For example, in the health care world, let’s say a monthly claims data source averages total allowed amounts of $2M and unique claim counts of 100K. If a subsequent data load arrives with a total allowed amount of $10M and 500K unique claims, those amounts exceed the normal expected threshold of variance, and should trigger additional scrutiny.
Upfront data cleansing provides accurate and consistent data to downstream processes and analytics, which will increase customer confidence in the data. Import’s WDI aids in data cleansing by preparing extracted data by exploring, assessing, and refining the data quality. It also cleanses, normalizes, and enriches the data using 100+ spreadsheet functions and formulas.
Data Wrangling
Data wrangling (sometimes called “data preparation” or “data munging”) is the practice of converting cleansed data into the dimensional model for a particular business case. It involves two key components of the WDI process – extraction and preparation. The former involves rendering CSS, processing JavaScript, interpreting network traffic, etc. The latter harmonizes the data and ensures quality assurance.
Here are some data wrangling best practices:
- Start with a small test set: One of the challenges of Big Data is working with large data sets, especially early in the data transformation where analysts need to quickly iterate through many different exploratory techniques. To help tame the unruly beast of 500 million rows, apply random sampling to the data set to explore the data and lay out the preparation steps. This method will greatly accelerate data exploration and quickly set the stage for further transformation.
- Understand the columns and data types: Having a data dictionary (a document that describes a data set’s column names, business definition, and data type) can really help with this step. It’s necessary to ensure that the data values actually stored in a column match the business definition of that column. For example, a column called “date_of_birth” should be formatted in a format like MM/DD/YYYY. Combining this practice with data profiling, described above, should help the analyst really get to know the data.
- Visualize source data: Using common graphing tools and techniques can help bring the “current state” of the data to life. Histograms show distributions, scatter plots help find outliers, pie graphs show percentage to whole, and line graphs can show trends in key fields over time. Showing how data looks in visual form is also a great way to explain exploratory findings and needed transformations to non-technical users.
- Zero in on only the needed data elements: This is where having a well-defined business case can really help. Since most source data sets have far more columns than are actually needed, it’s imperative to wrangle only the columns required by the business case. Proper application of this practice will save untold amounts of time, money, and credibility.
- Turn it into actionable data: The steps above shed light on the manipulations, transformations, calculations, reformatting, etc. needed to convert the web source data into the target format. A skilled analyst can create repeatable workflows that translate the required business rules into data wrangling action.
- Test early and often: Ideally, reliable expected values are available to test the results of a data wrangling effort. A good business case could include expected values for validation purposes. But even if not, knowing the business question and iteratively testing the results of data wrangling should help testers surface data transformation issues for resolution early in the process.
The unprecedented amount, type, and immediacy of data available today is a huge opportunity for businesses to improve their revenue, market share, competitive position, and customer relationships. Skimping on data cleansing or quality, however, runs the risk of bad data, bad decisions, and lost credibility. So the value of traditional web scraping is marginalized in this regard.
That’s where WDI comes in. With a focus on data quality and controls, WDI enables you to fully realize the potential of data through well-thought-out, rigorous, and consistent data cleansing and data wrangling. Investing in the right tools means you can earn trust in the data and make data insights available to the right people at the right time.