# Scraping unlisted stock prices with BeautifulSoup

##### Dec 14, 2019
There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the yfinance package. But these sources generally only contain data for currently listed stocks.

# A clean way to share results from a Jupyter Notebook

##### Dec 02, 2019
I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.

# Can you run an A/B test with unequal sample sizes?

##### Nov 25, 2019
I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. TL;DR: Probably best to keep it 50–50, since your typical A/B test design involves enough factors to consider already.

# When Python is built-in random module is faster than NumpPy

##### Nov 19, 2019
TL;DR If you need a single random number (or up to 5) use the built-in random module instead of np.random. An instinct to vectorize An early learning for any aspiring pandas user is to always prefer “vectorized” operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at blazing-fast speeds behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops.

# Planning A/B tests with a symmetric risk profile (α=β)

##### Nov 11, 2019
Here is a somewhat unconventional recommendation for the design of online experiments: Set your default parameters for alpha (α) and beta (β) to the same value. This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily use these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices1, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making.

# Making beautiful experiment visualizations with Matplotlib

##### Oct 21, 2019
Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow.

# Sampling from an iteratively built array in Python

##### Oct 07, 2019
While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before… What’s the fastest way to sample from an array while building it? If you’re reading this, you should first question whether you actually need to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using np.

# Building a hurdle regression estimator in scikit-learn

##### Sep 16, 2019
Multimodal distributions are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.

# Creating a monthly + daily DAG pattern in Airflow

##### Aug 15, 2019
With a daily schedule, backfilling data from 5 years ago will take days to complete. Running the job less frequently (monthly?) would make backfills easier, but the data would be less fresh. We want to eat our cake and have it too. We can achieve this by creating two separate DAGs—one daily and one monthly—using the same underlying logic.

# One-hot encoding + linear regression = multi-collinearity

##### Jul 29, 2019
You don’t need to be a dummy to fall for the ‘Dummy Variable Trap’ while fitting a linear model, especially if you are using default parameters for one-hot encoding in scikit-learn. By default, `OneHotEncoder` sets the parameter `drop=None` which in turn causes it to output \$ k \$ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients.