20 questions to detect fake data scientists

Now that the Data Scientist is officially the sexiest job of the 21st century, everyone wants a piece of the pie.

That means there are a few data posers out there. People who call themselves Data Scientists, but who don’t actually have the right skill set.

This isn’t always done out of a desire to deceive. The newness of data science and lack of a widely understood job description means that many people may think they are data scientists purely because they deal with data.


“Fake data scientists are often experts in one particular discipline and insist that their discipline is the one and only true data science. That belief misses the point that data science refers to the application of the full arsenal of scientific tools and techniques (mathematical, computational, visual, analytic, statistical, experimental, problem definition, model-building and validation, etc.) to derive discoveries, insights, and value from data collections.”

Kirk Borne, Principal Data Scientist at Booz Allen Hamilton and founder of RocketDataScience.org

The first way to detect fake Data Scientists is to understand the skill set you should be looking for. Knowing the difference between what makes a Data Scientists vs a Data Analyst vs a Data Engineer is important, especially if you’re planning on hiring one of these rare specimens.

To help you sort the true data scientist from the fake (or misguided) one, we’ve complied a list of 20 interview questions you can ask when interviewing data scientists.

  1. Explain what regularization is and why it is useful.
  2. Which data scientists do you admire most? which startups?
  3. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
  4. Explain what precision and recall are. How do they relate to the ROC curve?
  5. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
  6. What is root cause analysis?
  7. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
  8. What is statistical power?
  9. Explain what resampling methods are and why they are useful. Also explain their limitations.
  10. Is it better to have too many false positives, or too many false negatives? Explain.
  11. What is selection bias, why is it important and how can you avoid it?
  12. Give an example of how you would use experimental design to answer a question about user behavior.
  13. What is the difference between “long” and “wide” format data?
  14. What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject?
  15. Explain Edward Tufte’s concept of “chart junk.”
  16. How would you screen for outliers and what should you do if you find one?
  17. How would you use either the extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  18. What is a recommendation engine? How does it work?
  19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
  20. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?

“A “real” data scientist knows how to apply mathematics, statistics, how to build and validate models using proper experimental designs. Having IT skills without statistics skills makes you a data scientist as much as it makes you a surgeon to know how to build a scalpel.”

~ Lisa Winter, Senior Analyst at Towers Watson

How do you quantify a real data scientist?

Let us know in the comments below!


I think using the term “Fake” data scientist is in itself misleading. Data science is as broad a subject as the data itself. I think the only qualifier here is to discern anyone claiming that their expertise is the ONLY true data science. In that regard, the interview questions you posted are rendered useless, cause they can be constrained to just one particular field and not applicable to another. The article assumes that the author has definitely defined what data is (and therefore data science) and what information can be collected from the data, and further what knowledge can be derived from the information. There is data everywhere, nothing like useless data just because you have not learned anything from it (Dark Data).

I wonder why it is more critical to demonstrate why you have a favorite data scientist or that you have read Tufte but not that you have the slightest idea of how to write good code.

I know a few days scientists that would be misclassified by some of the domain-specific questions above.

They would ask instead:

1. How would you use Partial Least Squares to estimate the titer in a biotechnology upstream process? (domain-specific statistics)

2. Could you explain how do plant historians achieve their extraordinary data compression? (domain-specific computer science)

Yeap. A lot of good data scientists would be incorrectly labeled as fake too.

Great comments gentlemen…

My 2 cents in the context: The question “How do you Quantify a Real Data Scientist” in itself sounds very discriminating and unfair. I guess there can be no yard stick to measure the effectiveness of a Data Scientists approach and contribution. As it stands Data Scientist role would span across various specializations starting from mere basic understanding of Statistics to Advanced Computational techniques and algorithms. An effective Data scientist is one who can either self demonstrate his limited knowledge(as it is very uncommon for an individual to master Stats, Computational Algorithms, Data visualization techniques and AI) on these areas to come up with meaningful and applicable models or team up with folks around who can leverage their expertise in giving the needed inputs for the Data Scientist to do his part of the job. I am personally concerned that the terms Data Analyst, Engineer and Data Scientist are interchangeably used and put to practice not realizing the boundaries of their roles and responsibilities

The word “fake” is a bit harsh, as there really is no definitive definition of a “data scientist”. Instead, a conglomerate of many skills and methodologies synthesized to a cohesive list that result in a role, which very few people would qualify. 2cents.

The questions are very good, but I can think of a few very effective people who fail the test.

The best way to distinguish real data scientists from fake data scientists: give the person a problem and some data, and ask for a solution. The real data scientists solve the problem; the fake data scientists give you excuses.

Nice list… #19 and 10 are too similar to one another, and they even overlap really with #4, though. I’m also not a big fan of #7, as it’s sort of domain specific… I do a lot of health mining and while I could probably get a good answer out I feel like it’s got a very strong industry bias.

The rest seem pretty relevant, although sometimes academic, and this hilights the overlap with statistics a bit much. I think you’ll get a lot of statisticians who don’t really consider themselves “data scientists” in the current sense, breezing through these.

If I were qualifying a “real” data scientist, I’d worry about a few things: No questions about programming? No questions about specific models – SVM, GBM, linear regression or how to choose which model (Question #3 touches this, but not deeply – I guess ANOVA and similar would go there). #15 and #20 touch on visualization, but I may sacrifice one of those to hit scientific visualization that goes beyond “chart junk”, no disrespect to Tufte. Nothing at all on clustering, or dimensionality reduction (k-means, Hierarchical Clustering, SOM, t-SNE, PCA) which I’d list as important concepts. Nothing on deep learning, neural networks. Nothing on how to deal with abnormal data — raw text mining, image mining.

Those are the areas I see being “data science” specific that don’t fall squarely under more traditional statistics or modeling, but the line is very mushy and it depends a lot on what you’re trying to accomplish.

Actually we need to distinguish between
Data science versus Decision Science….
The million dollar question is wether you can build Real world Predictive Analytics Models and operationalize them?
If yes, you are an Accomplished Decision Scientist….

Comments are closed.

Extract data from almost any website