In our quest to make every business decision data-driven, getting the data is only half the battle. Next you’ve got to get people to actually use it.
In this short video, Daniel Frank from Stripe gives away his secret ingredient for getting the team at Stripe to love quality data reporting.
For more great stories like this, check out Extract!
Say that again!
Reproducibility: the secret ingredient for data decisions
The most important thing you can do if you want to be a data-driven company is to make data accessible to everyone in your organization.
Your data science teams shouldn’t be gatekeepers where people go to have their data questions answered from on high. They should be the tool builders and maintainers of a data stack that allows people to make large-scale queries.
To be truly data-driven, you have to empower people to use a self-service model with data. Everyone in your organization should be encouraged to use data when producing their reports or researching a question.
But in order to actually get them to do it – and not hate you – you have to make it easy. This article is about how to get people to actually write reproducible data reports.
A quick anecdote from Stripe
At Stripe, one of the services we offer is called Checkout. It’s essentially a pre-built UI that you can embed on your site, so you don’t have to go through all the work of building one yourself.
This means that at Stripe we own the checkout process. We get to see, not just that a transaction happened, but how people move through that process. For example, if someone opened up the checkout window and tried to enter their information in the wrong place.
All this data allows us to optimize conversions by understanding where people get tripped up with different kinds of businesses, something that no individual business would be able to do on their own.
Around 6 months ago we were thinking about adding a big feature to Checkout. It was a bit of a contentious feature, so we decided to do some A/B testing around it. A few days later we got an email that said the new feature increased conversions by 8% and that we should roll it out to everyone. And we just had to take their word for it. There was no data reporting to back it up.
Of course, I completely trust my co-workers, but I still had some questions. Like, what exactly improved by 8%? Was it at 80% and now it’s at 88%? Did you control for different countries? For recurring customers? And there was no way for me to answer these questions.
I realized though that it wasn’t really the fault of the person doing the analysis, we (the data team) were responsible for not providing the interface. It was a failure on our part that there wasn’t sufficient tooling to express all the steps the analyst must have gone through.
At Stripe, we want to enable people to write reports that will satisfy any questions that come up. Which made me think:
Wouldn’t it be nice if there was a data workspace that this person could work in that was totally natural for them. A place where they could find everything they needed to inform this decision, and, with no effort on their part, share not just the results but the methodology as well.
Learning from actual scientists
As data scientists, I think it’s time we take a page out of the book of scientists in the lab. In academia and scientific research you can’t just tell people that something is true, you need to show your methodology.
In classical science, they even demand your experiments be reproducible. It’s not just about showing your methodology, you have to describe your methodology with sufficient detail that people can reproduce exactly what you did.
In terms of data science, reproducibility means a lot of great things. If you are the report writer, it means you can revisit your original analysis with new data – if you don’t write down how you did it you will probably do it differently a second time. It allows your readers to deep dive into the methodology and understand exactly what you did and how you came to your conclusions. And it allows a new kind of collaboration because anybody can reproduce what you did and push it forward even more.
The open source model
In terms of actually making reproducible reports within your company, the Open Source model is probably a better model than the academic community.
Open source projects are reproducible by necessity. You cannot expect two people to work together on a software project if only one of them knows how to run the test. It just doesn’t make any sense.
The ideal would be if you could harness the programmatic reproducibility that works in open source and use it as a forcing function for people to get them to move towards the ideal of scientific reproducibility.
Putting it into practice
Taking all these ideas from academia and open source, I set out to give the people I worked with at Strips a workspace where they could do this kind of reproducible data work.
Based off the work of Fernando Perez – creator of iPython – I built a publishing environment where all the analysis is re-executed every time they want to create a report or show their work. So they don’t share the results, they share their code. Then I re-run their code, and I share the results that I get from that.
This method means they have to work within the constraints of reproducibility, but in exchange they get a beautiful publishing environment – which gives them an incentive to buy into this process. For the report viewers, they get to see the methodology – which is great – and they’re guaranteed that the results are in fact reproducible.
I’m basically trying to use good tooling as a guide towards good behavior. Users want their report to be published, we enable that, but on the condition that the report it reproducible. We’re aligning incentives between the users (anybody who wants to write a report) and the consumers (anyone who needs to buy into what that report says).
In order to get anyone to actually use this reporting system that I created, I had to make sure that it was easy to use. me so how do you get anybody to buy into this?
For example, data accessibility is a big thing for many data scientists because data is often scattered all over the place. So to automate this process of collecting all the data, was to give somebody a Python function into which they can stick in an SQL query and have it “just work”.
That’s not an easy process, and there’s no one size fits all solution, but when you get to the point where everybody realizes that this is the easiest way to get data, you’ve hooked them. And if I can give them this functionality in a nice environment, people will actually be excited to use it.
A kind of bonus that’s fallen out of this, which I didn’t entirely anticipate is that, because it was a lot easier to write these reports, a lot more of them started showing up. Stuff that people had previously just done on their own and didn’t bother sharing, they now publish and share around the organization.
This kind of sharing is great for our organization because it means we can get way more depth than we ever could have in an email or a meeting. That kind of detail about things like where we have financial losses with our partners allows the engineers to implement exactly the solutions that are suggested in the report.
Try it yourself
I’ve published some code that’s an approximation of what I built for Stripe. Not all of this is one size fits all, so you’ll need to fill in the details for your own organization. I really encourage you to give it a look and submit a pull request so you can try it out. I’m really excited to see more people doing this kind of analysis because I think it really helps push your company forward.
About the author
What is Extract?
Extract is one full day jam-packed with data stories that will entertain, educate and inspire you. It’s everything you’ve ever wanted to know about data, told by the people who know it best. Our speakers hail from some of the most successful and innovative companies in the business. You’ll hear data-driven talks on everything from beating the competition to creating the next unicorn. And our workshops will showcase the best of the best in data tooling. You’ll get an exclusive look at some of the latest technologies and pick up first-hand tips on implementing new strategies.