## › Geoff Ruddock

👋   Hi, I’m Geoff.

I work at the intersection of data science and product. I help teams to build narratives around user behaviour at scale using quantitative data. These are some of my notes around work, personal projects, and general learning. I write primarily as a way of clarifying my own thinking, but I hope you’ll find some value in here as well!

# How to soundproof a Synology NAS

##### Mar 14, 2021
After years of using SSDs, I had almost entirely forgotten how annoying the sound of actual, spinning hard disk platters is. That is, until I bought a NAS earlier this year to set up as a home media server. Lockdown projects, yay! Below are some reflections from my noise reduction journey.

##### Dec 03, 2020
My main Black Friday purchase this year was a Tado° system (thermostat + smart radiator valves), which I acquired with the goal in mind of regulating the temperature in my bedroom by: Turning the heat down early enough to be consistently cold at night. Turning the heat up in the morning to make it easier to wake up. Ideally 1h before. The first part is easy, but the second part can’t quite be achieved, at least out-of-the-box.

# Accidental abstract art (ft. matplotlib)

##### Oct 10, 2020
A collection of accidental art that I have created while trying to plot something actually useful with matplotlib or other tools. Available as NFTs upon request.

# Keep your SQL queries DRY with Jinja templating

##### Jul 01, 2020
A usecase for templating your SQL queries Suppose you have a table raw_events which contains events related to an email marketing campaign. You’d like to see the total number of each event type per day. This is a classic use-case for a pivot table, but let’s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function. The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of CASE WHEN expressions.

# Geotagging Lightroom photos with Google Timeline data

##### Jun 16, 2020
One feature of Lightroom I have not made much use of is the Maps view. While all my smartphone photos are automatically geotagged, I have historically neglected adding geotag info to the 80% of my photos shot on my dedicated camera, which does not have GPS. As a COVID lockdown project, I decided to try to use the location tracking data from Google Timeline to geo-tag my photos “automatically”.

# How to learn mental models with spaced repetition

##### May 01, 2020
As a subscriber to the Farnam Street newsletter, I enjoy reading Shane’s articles about using various mental models from other disciplines to improve our decision-making. Reading about these mental models is fun, but I am cognizant of the fact that reading about them did not equate to learning them. I have been using Anki flashcards in language-learning and technical contexts for a few years now. So the question is: what is the best way to use Anki to facilitate learning mental models?

# Bulk compress videos to H.265 (x265) with ffmpeg

##### Apr 21, 2020
Despite being a relatively modern phone, my OnePlus 6T records video using the H.264 codec rather than the newer H.265 HEVC codec. A minute of 1080p video takes up ~150MB of storage, and double that for 60fps mode or 4K. Even though the phone has a decent amount of storage (64GB) it quickly fills up if you record a lot of video. The storage savings from HEVC are pretty astounding. It typically requires 50% less bitrate (and hence storage space) to achieve the same level of quality as H.264.

# Building an AdaBoost classifier from scratch in Python

##### Mar 20, 2020
A few weeks ago while learning about Naive Bayes, I wrote a post about implementing Naive Bayes from scratch with Python. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.

# Building a Naive Bayes classifier from scratch with NumPy

##### Mar 16, 2020
While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.

# Render LaTeX math expressions in Hugo with MathJax 3

##### Feb 04, 2020
This blog runs on Hugo, a publishing framework which processes markdown text files into static web assets which can be conveniently hosted on a server without a database. It is great for a number of reasons (speed, simplicity) but one area where I find it lacking is in support for math typesetting.

# 8 Big Ideas from Scott Page's “The Model Thinker”

##### Jan 10, 2020
I recently finished reading Scott Page’s wonderful book The Model Thinker. This book does a great job of spotlighting some more niche and technical models from the social sciences and explaining them in an ELI5 manner. He touches on 50 models in the book, but here is a quick summary of the big ideas that jumped out to me.

# Scraping unlisted stock prices with BeautifulSoup

##### Dec 14, 2019
There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the yfinance package. But these sources generally only contain data for currently listed stocks.

# A clean way to share results from a Jupyter Notebook

##### Dec 02, 2019
I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.

# Can you run an A/B test with unequal sample sizes?

##### Nov 25, 2019
I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. TL;DR: Probably best to keep it 50–50, since your typical A/B test design involves enough factors to consider already.

# When Python is built-in random module is faster than NumpPy

##### Nov 19, 2019
TL;DR If you need a single random number (or up to 5) use the built-in random module instead of np.random. An instinct to vectorize An early learning for any aspiring pandas user is to always prefer “vectorized” operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at blazing-fast speeds behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops.

# Planning A/B tests with a symmetric risk profile (α=β)

##### Nov 11, 2019
Here is a somewhat unconventional recommendation for the design of online experiments: Set your default parameters for alpha (α) and beta (β) to the same value. This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily use these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices1, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making.

# Making beautiful experiment visualizations with Matplotlib

##### Oct 21, 2019
Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow.

# Sampling from an iteratively built array in Python

##### Oct 07, 2019
While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before… What’s the fastest way to sample from an array while building it? If you’re reading this, you should first question whether you actually need to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using np.

# Building a hurdle regression estimator in scikit-learn

##### Sep 16, 2019
Multimodal distributions are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.

# Creating a monthly + daily DAG pattern in Airflow

##### Aug 15, 2019
With a daily schedule, backfilling data from 5 years ago will take days to complete. Running the job less frequently (monthly?) would make backfills easier, but the data would be less fresh. We want to eat our cake and have it too. We can achieve this by creating two separate DAGs—one daily and one monthly—using the same underlying logic.

# One-hot encoding + linear regression = multi-collinearity

##### Jul 29, 2019
You don’t need to be a dummy to fall for the ‘Dummy Variable Trap’ while fitting a linear model, especially if you are using default parameters for one-hot encoding in scikit-learn. By default, OneHotEncoder sets the parameter drop=None which in turn causes it to output $k$ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients.

# Reflections on three years of spaced repetition with Anki

##### Jun 17, 2019
I was looking at my Anki deck stats the other day and realized that I have been using it for just over three years now. During that time I have added 20k cards and reviewed 140k. On average I spent 17 minutes each day to review 130 cards. Since this amounts to over 300 hours of my life at this point, I figured it would be worth reflecting on this habit and deciding whether it is a worthwhile investment of time going forward.

# Embed markdown documentation into your Airflow DAGs

##### May 13, 2019
I recently discovered that Apache Airflow allows you to embed markdown documentation directly into the Web UI. This is very neat feature, because it enables you locate your documentation as close as possible to the thing itself, rather than hiding it away in some google doc or confluence wiki. This, in turn, increases the chance it is actually read, rather than being promptly forgotten about and undiscovered by new team members.

# Save entire webpages for reference With SingleFile

##### Apr 15, 2019
I recently came across a neat Chrome extension called SingleFile which saves webpages as HTML files, but first waits for lazy-loading javascript, images and CSS to render. It doesn’t work perfectly—it sometimes includes the blurry version of lazy-loaded photos unless you first scroll to the end of the page—but it works lightyears better than anything else I’ve tried.

# Every good data analysis starts with "Why?"

##### Apr 02, 2019
The modern knowledge worker works in a highly specialized environment. Specialization improves efficiency, but it comes at a less reactivity and adaptability to change. As units of work grow beyond the span of a single agent, it imposes a trade-off. But we can hack this trade-off. In an organization with multiple actors, the question shouldn’t be Is collaboration worth it? but rather How can we reduce the cost of collaboration? A natural place to start is in the written form of the request/job/project/task.

# Calculating the bearing between coordinates in Redshift

##### Mar 11, 2019
I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the direction of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse.

# How to convert a cooler into a DIY insulated sous-vide container

##### Sep 01, 2018
Since acquiring an Anova sous-vide cooker, it has become an essential component of my weekly cooking routine. Their marketing materials show the device being used in any large pot you probably already have. This is fine for occasional use, but since I use the device frequently I started looking for a dedicated vessel. A dedicated vessel also lets you cook a larger quantity of food, or something awkwardly large like a rack of ribs. You can buy a pre-built container but it costs \$70 and is not insulated. So I decided to build a simple dedicated container that was semi-insulated, so that it would be energy efficient when cooking ribs for 48 hours.

# Redshift function of the week: RATIO_TO_REPORT

##### Jun 10, 2018
A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Amazon Redshift provides an off-the-shelf window function called ratio_to_report which basically solves what we are trying to accomplish. Running this function gives us the exact same output as the previous query, but with half the lines of code, and a more readable result.

# The hidden costs of poor data quality

##### Aug 02, 2017
The phrase “data quality” is frequently—and often ambiguously—thrown around many data analytics organizations. It can be used as an object of concern, an excuse for a failure, or a goal for future improvement. We’d all love 100% accuracy, but in the era of moving fast and breaking things, don’t we want to sacrifice a little accuracy in the name of speed? After all, isn’t it often better to make fast decisions with imperfect information and adjust course if necessary at a later point?

# Essential productivity apps for Mac users

##### Jun 05, 2017
Once a year I try to reevaluate my “personal tech stack” to see if I am using fundamental tools as effectively as possible. Not just bigger tools such as todo lists, calendars, and note-taking, but also the smaller utility apps that get used so frequently they blend into our daily work routine. Our fluency with the tools we use every day is the foundation of personal productivity, so it makes sense to optimize even small interactions such as switching between windows. With that in mind, here are three key Mac apps that make me a tiny bit more efficient but do so very frequently.

# Jupyter Notebooks for Interactive SQL Exploration

##### Apr 16, 2017
I’m always hesitant to tell people that I work as a data scientist. Partially because it’s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title “scientist” to describe work which does not necessarily involve the scientific method. Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study.

# Typesetting math equations with Anki

##### Mar 27, 2017
Anki 2.1+ now has built-in support for MathJax. This is now the best approach to math typesetting, since it removes the dependency on LaTeX being installed on your computer. Besides being a pain in the ass to configure, this also required a bunch of configurations that you had to keep track of if you regularly use multiple computers with Anki. As a bonus, the MathJax syntax is cleaner, and you can now edit expressions on AnkiDroid and they will render immediately.

##### Jul 17, 2016
A good chunk of the job of being a PM or analyst involves spending time analyzing patterns of user behaviour, often to answer specific questions. Over time though, we build up mental models and heuristics which allow us to use our prior knowledge to answer questions more quickly. More knowledge is good, right? On one hand, past experience calibrates our sense of prior probability, which allows us to make better decisions in noisy contexts.

# Book review: Remote Research (user research)

##### Jun 07, 2016
Remote Research lays out a comprehensive framework for starting to conduct research studies at your company, and is useful for beginners or for filling in the gaps in your mental model. However it seems more targeted towards large companies with established UX practices than towards startups. If you are executing alone—perhaps as a one-man UX team—you may still feel a gap between theory and execution.

# Book review: Web Form Design

##### May 11, 2016
I finished reading Web Form Design recently on the recommendation of a mentor. The author makes a good case about web forms being a high leverage area to invest design efforts. The combination of forms being mandatory, complex, and not particularly sexy, results in an experience that is often the worst part of a user’s interaction with your product. He then breaks down the form into the building blocks of Labels, Input Fields, and Actions, then lays out best practices for each. Here are a few snippets from the book that resonated with me.

# Tracking: Organizational Challenges

##### Feb 12, 2016
There are plenty of technical guides online about tracking user behaviour using GTM. But I haven’t found as much about dealing with the organizational challenges that may arise when making changes to tracking. One of my main projects at Carmudi was improving our tracking. The key challenge was that I was not building tracking entirely from scratch. We already had a buggy tracking implementation that was feeding data into some of the most important reports in the organization.

# The Best of Seth Godin for Product Managers

##### Jul 10, 2015
One of the consistent must-reads that has remained in my RSS feed over the years is Seth Godin’s blog. Seth consistently puts out a stream of incredibly wise thoughts. I have found that some of his posts resonate with me even more when I re-read them at a later point in my life/career. Here are some of my favourite Seth Godin posts, as they relate to the role of Product Manager.