Data – a collection of facts (numbers, words, measurements, observations, etc) that has been translated into a form that computers can process
Whichever industry you work in, or whatever your interests, you will almost certainly have come across a story about how “data” is changing the face of our world. It might be part of a study helping to cure a disease, boost a company’s revenue, make a building more efficient or be responsible for those targeted ads you keep seeing.
In general, data is simply another word for information. But in computing and business (most of what you read about in the news when it comes to data – especially if it’s about Big Data), data refers to information that is machine-readable as opposed to human-readable.
Humans vs Machines
Human-readable (also known as unstructured data) refers to information that only humans can interpret and study, such as an image or the meaning of a block of text. If it requires a person to interpret it, that information is human-readable.
Machine-readable (or structured data) refers to information that computer programs can process. A program is a set of instructions for manipulating data. And when we take data and apply a set of programs, we get software. In order for a program to perform instructions on data, that data must have some kind of uniform structure.
For example, US Naval Officer Matthew Maury, turned years of old hand-written shipping logs (human-readable) into a large collection of coordinate routes (machine-readable). He was then able to process these routes en masse to reduce the average Naval journey by 33%.
Data in the news
When it comes to the types of structured data that are in Forbes articles and McKinsey reports, there are a few different types which tend to get the most attention…
Personal data is anything that is specific to you. It covers your demographics, your location, your email address and other identifying factors. It’s usually in the news when it gets leaked (like the Ashley Madison scandal) or is being used in a controversial way (when Uber worked out who was having an affair).
Lots of different companies collect your personal data (especially social media sites). Anytime you have to put in your email address or credit card details you are giving away your personal data. Often they’ll use that data to provide you with personalized suggestions to keep you engaged. Facebook, for example, uses your personal information to suggest content you might like to see based on what other people similar to you like.
In addition, personal data is aggregated (to depersonalize it somewhat) and then sold to other companies, mostly for advertising and competitive research purposes. That’s one of the ways you get targeted ads and content from companies you’ve never even heard of.
Transactional data is anything that requires an action to collect. For instance, you might click on an ad, make a purchase, visit a certain web page, etc.
Pretty much every website you visit collects transactional data of some kind, either through Google Analytics, another 3rd party system or their own internal data capture system.
Transactional data is incredibly important for businesses because it helps them to expose variability and optimize their operations for the highest quality results. By examining large amounts of data, it is possible to uncover hidden patterns and correlations. These patterns can create competitive advantages, and result in business benefits like more effective marketing and increased revenue.
Web data is a collective term which refers to any type of data you might pull from the internet. That might be data on what your competitors are selling, published government data, football scores, etc. It’s a catchall for anything you can find on the web that is public facing (i.e. not stored in some internal database). Studying this data can be very informative, especially when communicated well to management.
Web data is important because it’s one of the major ways businesses can access information that isn’t generated by themselves. When creating quality business models and making important BI decisions, businesses need information on what is happening internally and externally within their organization and what is happening in the wider market.
Web data can be used to monitor competitors, track potential customers, keep track of channel partners, generate leads, build apps, and much more. It’s uses are still being discovered as the technology for turning unstructured data into structured data improves.
Web data can be collected by writing web scrapers to collect it, using a scraping tool, or by paying a third party to do the scraping for you. A web scraper is a computer program that takes a URL as an input and pulls the data out in a structured format – usually a JSON feed or CSV.
Sensor data is produced by objects and is often referred to as the Internet of Things. It covers everything from your smartwatch measuring your heart rate to a building with external sensors that measure the weather.
So far, sensor data has mostly been used to help optimize processes. For example, Air Asia saved $30-50 million by using GE sensors and technology to help reduce operating costs and increase aircraft usage. By measuring what is happening around them, machines can make smart changes to increase productivity and alert people when they are in need of maintenance.
When does data become Big Data?
Technically all of the types of data above contribute to Big Data. There’s no official size that makes data “big”. The term simply represents the increasing amount and the varied types of data that are now being gathered as part of data collection.
As more and more of the world’s information moves online and becomes digitized, it means that analysts can start to use it as data. Things like social media, online books, music, videos and the increased amount of sensors have all added to the astounding increase in the amount of data that has become available for analysis.
The thing that differentiates Big Data from the “regular data” we were analyzing before is that the tools we use to collect, store and analyze it have had to change to accommodate the increase in size and complexity. With the latest tools on the market, we no longer have to rely on sampling. Instead, we can process datasets in their entirety and gain a far more complete picture of the world around us.
The importance of data collection
Data collection differs from data mining in that it is a process by which data is gathered and measured. All this must be done before high quality research can begin and answers to lingering questions can be found. Data collection is usually done with software and there are many different data collection procedures, strategies, and techniques. Most data collection is centered on electronic data, and since this type of data collection encompasses so much information, it usually crosses into the realm of big data.
So why is data collection important? It is through data collection that a business or management has the quality information they need to make informed decisions from further analysis, study, and research. Without data collection, companies would stumble around in the dark using outdated methods to make their decisions. Data collection instead allows them to stay on top of trends, provide answers to problems, and analyze new insights to great effect.
The sexiest job of the 21st century?
After data collection, all that data needs to be processed, researched, and interpreted by someone before it can be used for insights. No matter what kind of data you’re talking about, that someone is usually a data scientist.
Data scientists are now one of the most sought after positions. A former exec at Google even went so far as to call it the “sexiest job of the 21st century”.
To become a data scientist you need a solid foundation in computer science, modeling, statistics, analytics and math. What sets them apart from traditional job titles is an understanding of business processes and an ability to communicate quality findings to both business management and IT leaders in a way that can influence how an organization approaches a business challenge and answer problems along the way.
If you’re interested in learning more about big data, data collection, or want to start taking advantage of all it has to offer, check out these blogs, events, companies and more.
- Flowing Data – run by Dr. Nathan Yau, PhD, it has tutorials, visualizations, resources, book recommendations and humorous discussions on challenges faced by the industry
- FiveThirtyEight – run by data-wiz Nate Silver, it offers data analysis on popular news topics in politics, culture, sports and economics
- Edwin Chen – the self-named blog from the head data scientist at Dropbox, this blog offers hand-on tips for using algorithms and analysis
- Data Science Weekly – for the latest news in data science, this is the ultimate email newsletter
- No Free Hunch (Kaggle) – hosts a number of predictive modeling competitions. Their competition and data science blog, covers all things related to the sport of data science.
- SmartData Collective – an online community moderated by Social Media Today that provides information on the latest trends in business intelligence, data management, and data collection.
- KDnuggets – is a comprehensive resource for anyone with a vested interest in the data science community.
- Data Elixir – is a great roundup of data news across the web, you can get a weekly digest sent straight to your inbox.
- Marcus Borba (CTO Spark) – his feed is stacked with visualizations of complex concepts like the Internet of Things (IoT) and several incarnations of NoSQL
- Lillian Pierson (Author, Data Science for Dummies) – she links to a bevy of informative articles, from news clips on the latest companies taking advantage of Big Data, to helpful blog posts from influencers in both the data science and business space
- Kirk Borne (Principal Data Scientist at BoozAllen) – posts and retweets links to fascinating articles on Big Data and data science
- 40 data mavericks under 40 – this list encompases the who’s who of the bright and innovative in data and startups
- Udemy – free and paid for online courses to teach you everything you need to know
- Code School – learn coding online by following these simple step by step tutorials and courses
- Decoded – essential introduction to code that unlocks the immense potential of the digital world
- Data Camp – build a solid foundation in data science, and strengthen your R programming skills.
- Coursera – partnering with top universities and organizations to offer courses online
- W3schools – has great online tutorials for learning basic coding and data analysis skills.
- OpenRefine – a data cleaning software that allows you to pre-process your data for analysis.
- WolframAlpha – provides detailed responses to technical searches and does very complex calculations. For business users, it presents information charts and graphs, and is excellent for high level pricing history, commodity information, and topic overviews.
- Import.io is allows you to turn the unstructured data displayed on web pages into structured tables of data that can be accessed over an API.
- Trifacta – clean and wrangle data of files & databases you could not handle in excel, with easy to use statistical tools
- Tableau – a visualization tool that makes it easy to look at your data in new ways.
- Google Fusion Tables – a versatile tool for data analysis, large data set visualization and mapping.
- Blockspring – get live data, create interactive maps, get street view images, run image recognition, and save to Dropbox with this Google Sheets plugin
- Pot.ly – visualize your data in an easy way to quickly see trends and insights
- Luminoso – identify the relationships between keywords and concepts within your data set and glean insight about product perception
- BigML – Build a model of your market, with all the variables like pricing, product features and geography