March 16, 2020

Building a Naive Bayes classifier from scratch with NumPy

Goal While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset. While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Read more

December 2, 2019

A clean way to share results from a Jupyter Notebook

I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team. Read more

November 25, 2019

Can you run an A/B test with unequal sample sizes?

I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. Most sample size calculators—including our own internal one—assumes an equal split between 2+ variations, so I had to take a step back to answer this question. TL;DR You can run an experiment with an unequal allocation (e. Read more

November 11, 2019

Planning A/B tests with a symmetric risk profile (α=β)

Here is a somewhat unconventional recommendation for the design of online experiments: Set your default parameters for alpha (α) and beta (β) to the same value. This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily use these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices1, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making. Read more

October 21, 2019

Making beautiful experiment visualizations with Matplotlib

Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow. Read more

© Geoff Ruddock 2019