September 2, 2019

Save Entire Webpages for Reference With SingleFile

I’ve been reading through a lot of Tiago Forte’s writing on his members-only publication Praxis. Since reading through his series on progressive summarization, I have become more concientious with regards to saving the “work-in-progress” artifacts of my thinking process to Evernote. Often this involves a link to a piece of content, a couple highlights, and a bullet point or two about key takeaways. The problem It’s pretty easy to surface relevant notes using the Search function if I’ve added enough contextual info to the note, but less so if it’s just a link. Read more

August 15, 2019

Creating a Monthly + Daily DAG Pattern in Airflow

Problem You initially built a data pipeline for a project you were working on, but eventually other members of your team started using it as well. You move the logic into Airflow, so that the pipeline is updated automatically on some regular basis. You’d like to set schedule_interval to daily so that the data is always fresh, but you’d also like the ability to execute relatively quick backfills. With a daily schedule, backfilling data from 5 years ago will take days to complete. Read more

July 29, 2019

One-hot encoding + linear regression = Multi-collinearity

My coefficients are bigger than your coefficients I was attempting to fit a simple linear regression model the other day with sklearn.linear_model.LinearRegression but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a crazy magnitude, on the order of billions. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100. Read more

May 13, 2019

Embed markdown documentation directly into your Airflow DAGs

Why you should do it I recently discovered that Apache Airflow allows you to embed markdown documentation directly into the Web UI. This is very neat feature, because it enables you locate your documentation as close as possible to the thing itself, rather than hiding it away in some google doc or confluence wiki. This, in turn, increases the chance it is actually read, rather than being promptly forgotten about and undiscovered by new team members. Read more

February 11, 2019

The best way to manage dependencies between DAGs in Airflow

Airflow provides a few different sensors and operators which enable you to coordinate scheduling between different DAGs, including: ExternalTaskSensor TriggerDagRunOperator SubDagOperator Which one is the best to use? I have previously written about how to use ExternalTaskSensor in Airflow but have since realized that this is not always the best tool for the job. Depending on your specific decision criteria, one of the other approaches may be more suitable to your problem. Read more

© Geoff Ruddock 2019