‹ Geoff Ruddock

Tags / analytics


I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.
Here is a somewhat unconventional recommendation for the design of online experiments: Set your default parameters for alpha (α) and beta (β) to the same value. This implies that you incur equal cost from a false positive as from a false negative. I am not suggesting you necessarily use these parameters for every experiment you run, only that you set them as the default. As humans, we are inescapably influenced by default choices1, so it is worthwhile to pick a set of default risk parameters that most closely match the structure of our decision-making.
The modern knowledge worker works in a highly specialized environment. Specialization improves efficiency, but it comes at a less reactivity and adaptability to change. As units of work grow beyond the span of a single agent, it imposes a trade-off. But we can hack this trade-off. In an organization with multiple actors, the question shouldn’t be Is collaboration worth it? but rather How can we reduce the cost of collaboration? A natural place to start is in the written form of the request/job/project/task.
The phrase “data quality” is frequently—and often ambiguously—thrown around many data analytics organizations. It can be used as an object of concern, an excuse for a failure, or a goal for future improvement. We’d all love 100% accuracy, but in the era of moving fast and breaking things, don’t we want to sacrifice a little accuracy in the name of speed? After all, isn’t it often better to make fast decisions with imperfect information and adjust course if necessary at a later point?
I’m always hesitant to tell people that I work as a data scientist. Partially because it’s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title “scientist” to describe work which does not necessarily involve the scientific method. Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study.
A good chunk of the job of being a PM or analyst involves spending time analyzing patterns of user behaviour, often to answer specific questions. Over time though, we build up mental models and heuristics which allow us to use our prior knowledge to answer questions more quickly. More knowledge is good, right? On one hand, past experience calibrates our sense of prior probability, which allows us to make better decisions in noisy contexts.
There are plenty of technical guides online about tracking user behaviour using GTM. But I haven’t found as much about dealing with the organizational challenges that may arise when making changes to tracking. One of my main projects at Carmudi was improving our tracking. The key challenge was that I was not building tracking entirely from scratch. We already had a buggy tracking implementation that was feeding data into some of the most important reports in the organization.
« Older posts Newer posts »