December 2, 2019

A clean way to share results from a Jupyter Notebook

I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team. Read more

October 21, 2019

Making beautiful experiment visualizations with Matplotlib

Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow. Read more

October 7, 2019

Calculating the bearing (angle) between coordinates in Redshift

I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the direction of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse. Read more

September 16, 2019

Building a hurdle regression estimator in scikit-learn

What are hurdle models? Google explains best, The hurdle model is a two-part model that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a hurdle is cleared. — Getting started with hurdle models [University of Virginia Library] What are hurdle models useful for? Many statistical learning models—particularly linear models—assume some level of normality in the response variable being predicted. Read more

July 29, 2019

One-hot encoding + linear regression = multi-collinearity

My coefficients are bigger than your coefficients I was attempting to fit a simple linear regression model the other day with sklearn.linear_model.LinearRegression but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a crazy magnitude, on the order of billions. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100. Read more

© Geoff Ruddock 2019