As a relatively new – but already highly sought after – position, it can be hard to know where Data Analytics ends and Data Science begins. Is it science? Statistics? Programming? Analytics? Black magic? Or some strange and wonderful combination?
Luckily for us, Thomson Nguyen is here to help. In this quick 10 minute presentation – given at last year’s Extract San Francisco – the CEO and Co-founder of Framed Data clearly outlines what makes a true Data Scientist and discusses how they differ from a traditional Analyst.
Data scientists vs data analysts
Do you know the difference between a Data Scientist and a Data Analyst? To be honest, before I started doing research for this post, I’m not sure I really knew either.
Here’s what the Twittersphere thinks:
— Michael E. Driscoll (@medriscoll) July 17, 2012
I think that’s a pretty good one, but some people were more skeptical…
As can be seen, questions and disagreements abound when talking about data scientists and data analysts. What is a data analyst and how are they different from data scientists? Do data analyst qualifications differ that much from data scientist qualifications? Does the difference actually matter in the world of data science, or among businesses for that matter?
I think a lot of the ambiguity – and some of the animosity – is simply because data science is such a new term and a new field. It’s not like being a Data Analyst or a BI Analyst, we’ve had 20 years to understand those job roles. A big data scientist is truly blazing his or her own trail, taking businesses in exciting new directions.
Complicating the problem is that lots of companies have different definitions of what a data scientist is and what they do.
After doing some research I come up with a theory for what a data scientist is. One that I hope will help to disambiguate the term and differentiate a data scientist from a data analyst.
Understanding the difference
So what does a data analyst do when compared to a data scientist? This Venn diagram (above) is a good first cut at describing how the two jobs overlap and how they differ, providing a good visual aid in the data scientist vs. data analyst discussion. Data analysts are generally well versed in Sequel, they know some Regular Expressions, they can slice and dice data, they can use analytics or BI packages – like Tableau or Pentaho or an in house analytics solution – and they can tell a story from the data. They should also have some level of scientific curiosity. With these traits in mind, it’s easier to get a clearer picture of how to be a data analyst.
On the other end of the spectrum, a Data Scientist will have quite a bit of machine learning and engineering or programming skills and will be able to manipulate data to his or her own will. By specializing in these skills, they make it clear they intend to go down the big data scientist career path.[contentblock id=6 img=gcb.png]
The T shaped skill set
Valve software – a software company in Seattle that makes computer games – has a good definition of their ideal employee. It’s this “T” shape employee who is a generalist in variety of different areas but has deep a domain experience in one vertical.
That’s how we should think of Data Scientists as well.
A Data Scientist should have a wide breadth of abilities: academic curiosity, storytelling, product sense, engineering experience, business sense and just a catch-all I call cleverness. But he or she should also have deep domain expertise in Statistical and Machine Learning Knowledge.
Let’s look at each of those areas in greater depth…
To me, academic curiosity is a desire to go beneath the surface and distill a problem into a very clear set of hypotheses that can be tested. Much like how scientists in the research lab will have a very amorphous charter of improving science, data scientists in a business, will have an amorphous charter of improving their company’s product somehow.
He or she will use this academic curiosity to look at the available data sets and sources to figure out an experiment or a model that solves one of the company’s problems. In a sense, big data scientists can be regarded as extensive problem solvers with a keen eye toward improving the business.
Storytelling is the ability to communicate your findings effectively to non technical stakeholders.
For example, Mosaic took the entire UK population and ran a machine learning model over it. Based on what they found, they were able to split the entire UK population into 61 clusters. But if you have 61 different clusters, you need a good (easy to explain way) to differentiate between each cluster.
One of those categories is called Golden Empty Nesters, which is a good title because without me explaining anything to you, it evokes some sort of image about the person who would fit into it. Specifically, they are financially secure couples, many close to retirement, living in sought after suburbs.
Think about the complexity of the information involved in big data analytics. A data analyst may be able to interpret that data and explain it to those already in the data science field, but often it takes a data scientist to turn the numbers into a worthwhile storytelling opportunity. This ability to distill a quantitative result from a machine learning model into something (be it words, pictures, charts, etc) that everyone can understand immediately is actually a very important skill for data scientists. Some data analysts may also engage in storytelling, but data scientists must truly excel in it.
Product sense is the ability to use the story to create a new business product or change an existing product in a way that improves company goals and metrics. In this way, data scientists can demonstrate just how valuable they are to the business side things, especially if some within the organization wonder, what does a data scientist do exactly?
As a Data Scientist at, say, Amazon, it’s not enough to have built a collaborative filter to create a recommendation engine, you should also know how to mold it into a product. For example, the “customers who bought this item also bought” section is an 800 by 20 pixel box which outlines the result of this machine learning model in a way that is visually appealing to customers.
Even if you’re not the product manager – or the engineer that creates these products – as a Data Scientist, whatever you create, in code or in algorithms, will need to translate into one of these products. So having a good sense of what that might look like, will get you a long way.
Statistical and machine learning knowledge
Statistical and machine learning knowledge is the domain expertise required to acquire data from different sources, create a model, optimize its accuracy, validate its purpose and confirm its significance. This is the deep domain expertise in the T shape Data Scientist I mentioned earlier.
As a Data Scientist, if you know nothing else, you need to know how to take some data, munge it, clean it, filter it, mine it, visualize it and then validate it. It’s a very long process, but one only a big data scientist can do.
Engineering Experience refers to the coding chops necessary to implement and execute statistical models.
For a lot of big companies this means knowing intense amounts of Scala, Java, Closure, ect to deploy your models into production. For startups this can be as simple as implementing a model in R.
Consequently, R is a great language for scaffolding models and visualization, but it’s not so great for writing production ready code – it breaks whenever you throw anything more than 10 megabytes in front of it.
But, it’s a great language to set up a proof of concept, and the ability to create something out of nothing and to prove that it works, is a skill that I think most data scientists ought to have.
The last skill on my list I call cleverness, or the creativity to do all these things on a deadline or on constrained resources.
The difference between research scientists in academia and Data Scientists in the real world, is that scientists in academia (given funding) have all the time in the world to figure out problems. The whole point of academia is to move the boundary of knowledge forward at all cost.
The goal of a Big Data Scientist in a startup or a tech company, is to move the product forward at minimal cost, yesterday. So the ability to take on deadlines, constrained resources – even your company’s political climate – and push a product out in a reasonable amount of time is a really important skill. The best big data scientists will internalize this important trait and always be ready to meet and overcome these important business demands and obstacles.
A little something extra
If you want more information on what being a Data Scientist means or how to build Data Science teams, O’Reilly has three great pamphlets:
- Analyzing the Analyzers is a meta analysis of the 21 different types of Data Scientists
- Building Data Science Teams is a great tutorial on how to build a data science team once you’ve identified in your company that data is something you want to double down on
- Data Jujitsu shows you how to turn data into actual products.
On a more abstract level…
Competing on Analytics is a general business book on exactly how data turns companies into more valuable companies
Data Driven covers not just Data Analytics and Data Science, but also Data Warehousing, Data Project Management and a whole host of other data related stuff
About the author
Thomson Nguyen is the founder and CEO of Framed Data, a predictive analytics product. They use machine learning to analyze your analytics and tell you which customers are about to churn and for what reasons. They are used by SAS companies, mobile apps, game developers and more to determine which of their users are at a very high risk of leaving.
What is Extract?
Extract is one full day jam-packed with data stories that will entertain, educate and inspire you. It’s everything you’ve ever wanted to know about data, told by the people who know it best. Our speakers hail from some of the most successful and innovative companies in the business. You’ll hear data-driven talks on everything from beating the competition to creating the next unicorn. And our workshops will showcase the best of the best in data tooling. You’ll get an exclusive look at some of the latest technologies and pick up first-hand tips on implementing new strategies.