👋 Hi, I’m Geoff.
I work at the intersection of data science and product. I help teams to build narratives around user behaviour at scale using quantitative data. These are some of my notes around work, personal projects, and general learning. I write primarily as a way of clarifying my own thinking, but I hope you’ll find some value in here as well!
After years of using SSDs, I had almost entirely forgotten how annoying the sound of actual, spinning hard disk platters is. That is, until I bought a NAS earlier this year to set up as a home media server. Lockdown projects, yay! Below are some reflections from my noise reduction journey.
My main Black Friday purchase this year was a Tado° system (thermostat + smart radiator valves), which I acquired with the goal in mind of regulating the temperature in my bedroom by:
Turning the heat down early enough to be consistently cold at night. Turning the heat up in the morning to make it easier to wake up. Ideally 1h before. The first part is easy, but the second part can’t quite be achieved, at least out-of-the-box.
A collection of accidental art that I have created while trying to plot something actually useful with matplotlib or other tools. Available as NFTs upon request.
A usecase for templating your SQL queries Suppose you have a table raw_events which contains events related to an email marketing campaign. You’d like to see the total number of each event type per day. This is a classic use-case for a pivot table, but let’s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function.
The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of CASE WHEN expressions.
One feature of Lightroom I have not made much use of is the Maps view. While all my smartphone photos are automatically geotagged, I have historically neglected adding geotag info to the 80% of my photos shot on my dedicated camera, which does not have GPS. As a COVID lockdown project, I decided to try to use the location tracking data from Google Timeline to geo-tag my photos “automatically”.
As a subscriber to the Farnam Street newsletter, I enjoy reading Shane’s articles about using various mental models from other disciplines to improve our decision-making. Reading about these mental models is fun, but I am cognizant of the fact that reading about them did not equate to learning them. I have been using Anki flashcards in language-learning and technical contexts for a few years now. So the question is: what is the best way to use Anki to facilitate learning mental models?
Despite being a relatively modern phone, my OnePlus 6T records video using the H.264 codec rather than the newer H.265 HEVC codec. A minute of 1080p video takes up ~150MB of storage, and double that for 60fps mode or 4K. Even though the phone has a decent amount of storage (64GB) it quickly fills up if you record a lot of video. The storage savings from HEVC are pretty astounding. It typically requires 50% less bitrate (and hence storage space) to achieve the same level of quality as H.264.
A few weeks ago while learning about Naive Bayes, I wrote a post about implementing Naive Bayes from scratch with Python. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.
While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.
This blog runs on Hugo, a publishing framework which processes markdown text files into static web assets which can be conveniently hosted on a server without a database. It is great for a number of reasons (speed, simplicity) but one area where I find it lacking is in support for math typesetting.
I recently finished reading Scott Page’s wonderful book The Model Thinker. This book does a great job of spotlighting some more niche and technical models from the social sciences and explaining them in an ELI5 manner. He touches on 50 models in the book, but here is a quick summary of the big ideas that jumped out to me.
There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the yfinance package. But these sources generally only contain data for currently listed stocks.
I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.
I got an interesting question this week from a PM this week, asking if we could run an experiment with a traffic allocation of 10% to control and 90% to the variation, rather than a traditional 50–50 split. TL;DR: Probably best to keep it 50–50, since your typical A/B test design involves enough factors to consider already.
TL;DR If you need a single random number (or up to 5) use the built-in random module instead of np.random.
An instinct to vectorize An early learning for any aspiring pandas user is to always prefer “vectorized” operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at blazing-fast speeds behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops.
Here is a somewhat unconventional recommendation for the design of online experiments:
Set your default parameters for alpha (α) and beta (β) to the same value.
This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily use these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices1, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making.
Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow.
While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before…
What’s the fastest way to sample from an array while building it? If you’re reading this, you should first question whether you actually need to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using np.
Multimodal distributions are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.
With a daily schedule, backfilling data from 5 years ago will take days to complete. Running the job less frequently (monthly?) would make backfills easier, but the data would be less fresh. We want to eat our cake and have it too. We can achieve this by creating two separate DAGs—one daily and one monthly—using the same underlying logic.
You don’t need to be a dummy to fall for the ‘Dummy Variable Trap’ while fitting a linear model, especially if you are using default parameters for one-hot encoding in scikit-learn. By default,
OneHotEncoder sets the parameter
drop=None which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients.
I was looking at my Anki deck stats the other day and realized that I have been using it for just over three years now. During that time I have added 20k cards and reviewed 140k. On average I spent 17 minutes each day to review 130 cards. Since this amounts to over 300 hours of my life at this point, I figured it would be worth reflecting on this habit and deciding whether it is a worthwhile investment of time going forward.
I recently discovered that Apache Airflow allows you to embed markdown documentation directly into the Web UI
. This is very neat feature, because it enables you locate your documentation as close as possible
to the thing itself, rather than hiding it away in some google doc or confluence wiki. This, in turn, increases the chance it is actually read, rather than being promptly forgotten about and undiscovered by new team members.
The modern knowledge worker works in a highly specialized environment. Specialization improves efficiency, but it comes at a less reactivity and adaptability to change. As units of work grow beyond the span of a single agent, it imposes a trade-off. But we can hack this trade-off. In an organization with multiple actors, the question shouldn’t be Is collaboration worth it? but rather How can we reduce the cost of collaboration? A natural place to start is in the written form of the request/job/project/task.
I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the direction of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse.
Last year I built a DIY insulated sous-vide container using $10 of IKEA parts. It worked pretty well, using 60% less electricity than an uninsulated container. But it was a bit of an eye-sore, and I got tired of leaving a mess of towels out on my kitchen counter. Can we do better?
I have previously written about how to use ExternalTaskSensor in Airflow
but have since realized that this is not always the best tool for the job. Depending on your specific decision criteria, one of the other approaches may be more suitable to your problem.
Airflow offers rich options for specifying intra-DAG scheduling and dependencies, but it is not immediately obvious how to do so for inter-DAG dependencies. Airflow provides an out-of-the-box sensor called ExternalTaskSensor that we can use to model this one-way dependency between two DAGs.
One textbook which is frequently recommended on Hacker News threads about self-study math material is Blitzstein and Hwang’s An Introduction to Probability. Having just recently finished the book, I realized that this is the first textbook I have truly worked through end-to-end while studying a topic outside of a school course. Here are some thoughts on what the book does well, and my (minor) grievances.
This is Water is a 22-minute commencement speech given by David Foster Wallace at Kenyon College in 2015 which was later adapted into a short book. It is difficult to overstate how powerful it is. Even after listening to this speech countless times, it never fails to send a shiver down my spine. In the interest of periodically forcing myself out of this default mode of thinking, I created a recurring calendar event to re-listen to the speech every sixth months. Here is an abridged audio version for that purpose.
Since acquiring an Anova sous-vide cooker, it has become an essential component of my weekly cooking routine. Their marketing materials show the device being used in any large pot you probably already have. This is fine for occasional use, but since I use the device frequently I started looking for a dedicated vessel. A dedicated vessel also lets you cook a larger quantity of food, or something awkwardly large like a rack of ribs. You can buy a pre-built container but it costs $70 and is not insulated. So I decided to build a simple dedicated container that was semi-insulated, so that it would be energy efficient when cooking ribs for 48 hours.
A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Amazon Redshift provides an off-the-shelf window function called ratio_to_report which basically solves what we are trying to accomplish. Running this function gives us the exact same output as the previous query, but with half the lines of code, and a more readable result.
The phrase “data quality” is frequently—and often ambiguously—thrown around many data analytics organizations. It can be used as an object of concern, an excuse for a failure, or a goal for future improvement.
We’d all love 100% accuracy, but in the era of moving fast and breaking things, don’t we want to sacrifice a little accuracy in the name of speed? After all, isn’t it often better to make fast decisions with imperfect information and adjust course if necessary at a later point?
Once a year I try to reevaluate my “personal tech stack” to see if I am using fundamental tools as effectively as possible. Not just bigger tools such as todo lists, calendars, and note-taking, but also the smaller utility apps that get used so frequently they blend into our daily work routine. Our fluency with the tools we use every day is the foundation of personal productivity, so it makes sense to optimize even small interactions such as switching between windows. With that in mind, here are three key Mac apps that make me a tiny bit more efficient but do so very frequently.
I’m always hesitant to tell people that I work as a data scientist. Partially because it’s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title “scientist” to describe work which does not necessarily involve the scientific method.
Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study.
Anki 2.1+ now has built-in support for MathJax. This is now the best approach to math typesetting, since it removes the dependency on LaTeX being installed on your computer. Besides being a pain in the ass to configure, this also required a bunch of configurations that you had to keep track of if you regularly use multiple computers with Anki. As a bonus, the MathJax syntax is cleaner, and you can now edit expressions on AnkiDroid and they will render immediately.
A good chunk of the job of being a PM or analyst involves spending time analyzing patterns of user behaviour, often to answer specific questions. Over time though, we build up mental models and heuristics which allow us to use our prior knowledge to answer questions more quickly.
More knowledge is good, right? On one hand, past experience calibrates our sense of prior probability, which allows us to make better decisions in noisy contexts.
Remote Research lays out a comprehensive framework for starting to conduct research studies at your company, and is useful for beginners or for filling in the gaps in your mental model. However it seems more targeted towards large companies with established UX practices than towards startups. If you are executing alone—perhaps as a one-man UX team—you may still feel a gap between theory and execution.
I finished reading Web Form Design recently on the recommendation of a mentor. The author makes a good case about web forms being a high leverage area to invest design efforts. The combination of forms being mandatory, complex, and not particularly sexy, results in an experience that is often the worst part of a user’s interaction with your product. He then breaks down the form into the building blocks of Labels, Input Fields, and Actions, then lays out best practices for each. Here are a few snippets from the book that resonated with me.
There are plenty of technical guides online about tracking user behaviour using GTM. But I haven’t found as much about dealing with the organizational challenges that may arise when making changes to tracking.
One of my main projects at Carmudi was improving our tracking. The key challenge was that I was not building tracking entirely from scratch. We already had a buggy tracking implementation that was feeding data into some of the most important reports in the organization.
One of the consistent must-reads that has remained in my RSS feed over the years is Seth Godin’s blog. Seth consistently puts out a stream of incredibly wise thoughts. I have found that some of his posts resonate with me even more when I re-read them at a later point in my life/career. Here are some of my favourite Seth Godin posts, as they relate to the role of Product Manager.
I’m a big fan of Blinkist, which is a subscription service that provides really well-written summaries of popular non-fiction books. These aren’t the SparkNotes you remember from your high school days—each summary is split into thematic bites, and the information is presented in a form that is already partially synthesized.
Each day Blinkist offers free access to one of their new summaries through Blinkist Daily. I find that the curation of books they use for Blinkist Daily is very high-quality, and I can usually find at least 2 summaries per week that I am interested in.
When I started as a Product Manager last year, I knew I had a lot to learn. I scoured through the internet, reading everything I could find on Product Management and how to succeed starting out as a non-technical PM. I have compiled a list of some of the most useful things I have read, partially so that I can revisit them myself from time-to-time.
Some of these articles are not strictly product-related—many of them involve design, project management, and elements of software development.
Most people use some combination of a calendar and todo list to organize their lives, whether it be a paper organizer or one of the myriad task list apps that pop up every day in the App Store. Personally I use a combination of Google Calendar and Todoist. Working together, these two do a pretty good job of keeping me organized. That said, the one type of task I have found awkward to manage are those tasks that you’d like to complete on a regular basis, but aren’t particularly time sensitive.