September 16, 2019

Building a hurdle regression estimator in scikit-learn

What are hurdle models? Google explains best, The hurdle model is a two-part model that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a hurdle is cleared. — Getting started with hurdle models [University of Virginia Library] What are hurdle models useful for? Many statistical learning models—particularly linear models—assume some level of normality in the response variable being predicted. Read more

August 15, 2019

Creating a monthly + daily DAG pattern in Airflow

Problem You initially built a data pipeline for a project you were working on, but eventually other members of your team started using it as well. You move the logic into Airflow, so that the pipeline is updated automatically on some regular basis. You’d like to set schedule_interval to daily so that the data is always fresh, but you’d also like the ability to execute relatively quick backfills. With a daily schedule, backfilling data from 5 years ago will take days to complete. Read more

July 29, 2019

One-hot encoding + linear regression = multi-collinearity

My coefficients are bigger than your coefficients I was attempting to fit a simple linear regression model the other day with sklearn.linear_model.LinearRegression but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a crazy magnitude, on the order of billions. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100. Read more

June 17, 2019

Reflections on three years of spaced repetition with Anki

I was looking at my Anki deck stats the other day and realized that I have been using it for just over three years now. During that time I have added 20k cards and reviewed 140k. On average I spent 17 minutes each day to review 130 cards. Since this amounts to over 300 hours of my life at this point, I figured it would be worth reflecting on this habit and deciding whether it is a worthwhile investment of time going forward. Read more

May 13, 2019

Embed markdown documentation directly into your Airflow DAGs

Why you should do it I recently discovered that Apache Airflow allows you to embed markdown documentation directly into the Web UI. This is very neat feature, because it enables you locate your documentation as close as possible to the thing itself, rather than hiding it away in some google doc or confluence wiki. This, in turn, increases the chance it is actually read, rather than being promptly forgotten about and undiscovered by new team members. Read more

© Geoff Ruddock 2019