“Big data” isn’t just a buzzword; it’s actually an accurate description of the data challenge organizations face today. Consider these statistics from Forbes:
- Between 2015 and 2020, the total digital universe of data will grow by a factor of 10
- By 2020, there will be 6.1 billion mobile phone users, using devices that collect a large amount of data every day
- Less than 0.5% of all data is ever analyzed and used
Clearly, the volume and variety of data available is growing at an astonishing rate, which helps explain that last startling statistic: organizations gain information and insights from less than 0.5% of this data.
What is Data Transformation?
Data transformation is the process of converting a raw data source into a cleansed, validated, and ready-to-use form. To begin making sense of their data, companies must master the art and science of data transformation, since it can turn data into timely insights that positively impact businesses. Data transformation is also known as ETL (Extract/Transform/Load), which describes the process and use of data transformation.
Appropriately transformed data is accessible, consistent, secure, and seen as a trustworthy source by the intended user community. Data transformation is an important step in the full web data integration process.
With the ever-growing volume of data available to and about your business, you have a great opportunity to use it to find new business value. But harnessing this data requires a conscientious data transformation strategy that orients the data around the needs of business users.
7 data transformation best practices:
The following are best practices to keep in mind when conducting data transformations. There are data transformation tools or ETL tools out there that can help with the process as well. Import includes automatic data transformation in our Web Data Integration tool in addition to automatic identification, extraction, preparation, and cleansing of data.
1. Start with the End in Mind: Design The Target
When faced with an ocean of data to process, it’s tempting to jump right into the nuts and bolts of data transformation. However, before transforming data into insights, you need to engage business users to understand the business processes you’re trying to analyze, and design the target format.
This process, known as “dimensional modeling,” results in two types of target tables for transformed data:
- Dimension tables: These provide the “Who, what, where, when, why and how” context for the data. Data warehousing guru Ralph Kimball calls dimension tables “the soul of the data warehouse because they contain the entry points and descriptive labels that enable the [Data Warehouse/Business Intelligence] system to be leveraged for business analysis.”
- Fact tables: These store the results of the events being measured, and answer the “How many” questions. Fact table types include Transaction (records events at the atomic level), Periodic Snapshot (summarizes events over a regular interval of time), and Accumulating Snapshot (captures the execution of a process, the steps of which could occur at irregular intervals, within a single record).
Dimensional modeling with target users yields several advantages. It:
- Engages users early and gives them a sense of ownership over the end result
- Scopes the overall data transformation effort by identifying only the data needed to fulfill user needs
- Provides the “star schema” relationship of facts to dimensions that are easy for most users to grasp
- Provides a “target” for the data transformation effort
2. Speed Date your Data with Data Profiling
Knowing the business process you want to analyze usually points to the data source(s) to transform. For example, to analyze sales trends, you’d need to access the customer database, the product database, and then pull in sales results from a point of sale system. Once the data source is known, you can then extract the raw data into a usable format.
Before jumping into transformation, data profiling can help you understand the state of that raw data and the amount of work you need to perform on it to make it ready for analysis. Data profiling is simply the practice of getting to know your data before transforming it. You’ll want to know the size of the data set you’re working with, the column heads, the data type of each column, the relationships between columns, the range of values in each column, the frequency of missing or junk data, and the number of rows.
3. Cleanse: When Your Data Needs a Bath
Armed with insights from data profiling, you can better understand how much and what kind of data transformation work you need to do on the data to make it usable. For example, if the source data’s date fields are in the YYYY/MM/DD format, and your target date fields are in the MM-DD-YYYY format, you’ll have to transform the source date fields to match the target format. Or, if some columns show a large frequency of missing values or junk data, you may need to have a conversation with business stakeholders to determine if you want to estimate values for missing data or exclude these records.
Cleansing data early in the data transformation process helps ensure the obviously bad data will not make it to end users and will help improve business user confidence in the data.
4. Conform Data to the Target Format
The previous three steps set the stage for transforming the data into the target format, also known as conforming the data. Here, the data transformation team’s knowledge of the source data meets the users’ need for data attributes to analyze a business process. Starting with mapping source columns to target columns, the data transformation team then uses ETL tools to automate the data flow for those columns on successive data loads.
One of the great benefits of conformed data is that it breaks down the silos between data that have long vexed business intelligence users. Data siloed between multiple data stores usually requires a lot of work to merge in order to create a meaningful analysis. Conformed data requires much less pre-processing and frees up analysts for more value-added work.
5. Build Dimensions Then Facts
As we mentioned above, dimensions put context around the data; facts explain what happened within the dimensional context. For example, customers, products, and dates could be dimensions; sales results and measures could be facts.
The benefit of loading dimension tables first is that newly-loaded fact records can then link to meaningful dimensional records. Sales data would not be very useful if the linkages to customer, product, and date dimensions didn’t exist. So, customer, product, and date dimensions need to be updated first with every data load, followed by the sales fact table.
Based on stakeholder engagement performed before data transformation, you may discover a need to create special fact tables of common aggregations to save users time in analysis. This is where the periodic and accumulating snapshot facts discussed earlier may come into play. For example, you could pre-aggregate sales volume by product and week/month/quarter.
6. Record Audit and Data Quality Events
Tracking audit and data quality metrics during the data transformation process provides big benefits. Audit tracking captures the number of records loaded at each step of the transformation process, and the time at which those steps happened. Data quality tests include the following types of screens performed on inbound data:
Column: Does the column include nulls, junk data, or out-of-bounds values?
Structure: Tests relationships between columns and tables. For example, do all sales fact records link back to a valid customer record?
Business Rule: Sanity checks on the entire data set. For instance, does the current data load’s sales volume show a statistically valid variation from previous data loads?
Capturing data quality test results, including them in the audit records for a data load, and linking fact records to audit records provides the ability to reconstruct the lineage of fact data and help prove the validity of metrics calculated from fact data.
This approach enables analysts to “work backwards” to answer common stakeholder questions like “Where did this data come from?” and “How do I know these metrics are right?” Having ready and reliable answers to these questions builds user confidence in the transformed data, and puts the data transformation team on solid ground for ongoing engagement with end users.
7. Continually Engage the User Community
The ultimate measure of the value of data transformation is the extent to which the target user community accepts and continually uses the transformed data asset. So, making newly-minted, conformed data available to end users isn’t the end of your data transformation; it’s only the end of the beginning. Transformed data must undergo rigorous user acceptance testing, and the data transformation team must quickly address defects found by business users “in the wild.”
Today’s proliferation of data is a potential gold mine for businesses. Yet, like gold, this data must be carefully extracted, examined, refined, and delivered to maximize its value. Understanding the basics of data transformation, like dimensional modeling, profiling, cleansing, conforming, testing, and presentation, will put you in a position to unearth valuable insights from your data that can have a big impact on your business.
Import.io offers the world’s only web data integration platform for your business. This platform identifies, extracts, transforms, integrates, and consumes data from any web site to provide valuable insights for organizations. Want to try it out? You can sign up for a seven day free trial to test it out, or have us do the work for you.
Recommended Reading
What is Data Cleansing and Transformation/ Wrangling?