# Working notes

This is a subset of my personal markdown notes, shared here in the spirit of learning in public.

# Causal inference

Why it matters 💡 Probabilistic conditioning gives us total effect $P(Y|X=x)$. But we are really interested in the direct causal effect $P(Y|do(X=x))$. Confounding occurs when the two are not equal due to the presence of “back door paths” between $X$ and $Y$. If we perform a proper experiment, we can close these back doors by “playing god”. But in observational studies or poorly designed experiments, these paths must be carefully carefully considered.

# Exponential family

A distribution is in the single-parameter exponential family if its PDF can be expressed in a particular form. Formally, it is a set $\{ p(\cdot|\theta): \theta \in \Theta \}$ of PMFs or PDFs on $\mathbb{R}^d$ such that… $$f(x|\theta) = h(x) \exp \big[ \eta(\theta)^T \cdot T(x) -A(\theta) \big]$$ with parameter $\theta \in \mathbb{R}^k$, data $x \in \mathbb{R}^d$, and four new functions: Parameter Meaning $\eta_i: \Theta \to \mathbb{R}$ Natural parameter; if $\eta(\theta) = \theta$: canonical form $T_i: \mathbb{R}^d \to \mathbb{R}$ Sufficient statistic $h: \mathbb{R}^d \to [0, \infty)$ Base measure – affects the support/scaling of the distribution $A: \Theta \to [0, \infty)$ Log partition function: acts as normalizing constant, so that PDF is valid Alternate parameterizations There are a couple other ways to parameterize the same concept, the most popular being with the normalization term moved out of the exponential: $A(\boldsymbol{\eta}) = - \log g(\boldsymbol{\eta})$.

# Hierarchical models

Hierarchical models are predictive models which combine multiple probability distributions into a structure which (hopefully) reflects the true underlying data generating process more precisely. These models typically involve greater conceptual and computational complexity, but are able to make better predictions and parameter estimations if built effectively. Also referred to as: multilevel models, mixed effects models. Key concept: Partial pooling 💡 When some of our predictor variables exhibit a natural hierarchical structure, we can make use of this information to improve our model.

# Linear regression

Linear regression is a simple approach for modeling the relationship between a continuous and unbounded quantity of interest $y$ and a set of predictor variables $X$. Model definition Simple linear regression Remember the equation of a line $y=mx+b$ from grade school? The equation for a simple linear regression is exactly this, just with different symbols. $$y = \alpha + \beta x + \epsilon$$ where $x$ is our observed predictor variable, and $y$ is the (unobserved) quantity we’d like to predict.

# Markov chain Monte Carlo

What is it The goal of Markov chain Monte Carlo (MCMC) is to draw samples from some probability distribution without having to know its exact height at any point. The way MCMC achieves this is to “wander around” on that distribution in such a way that the amount of time spent in each location is proportional to the height of the distribution. Once it converges, we get a stationary distribution which matches the distribution from which you are sampling.

# Missing data

Why you can’t just drop nulls ☠ Suppose you are analyzing a dataset with a non-catastrophic proportion of missing values—say 10%. The naive way to move forward is to simply drop all rows which have any values missing. Think df.dropna(). The best case is that the missing values are caused randomly (MAR), so this will not introduce bias into your analysis, just reduce your effective sample size. But that’s a big assumption.

# My content consumption strategy

If information discovery plays such a central role in how we make sense of the world in this new media landscape, then it is a form of creative labor in and of itself. — Maria Popova, [Brainpickings.org] Recently I have giving more thought to how I source and consume digital information. Without a bit of deliberate effort here, it is easy to slowly outsource the job of curation to social media feed algorithms, until one day you wake up and realize you are living in a filter bubble.

# Probability distributions

Summary Key facts and properties for common probability distributions. Largely copied from this probability cheatsheet, with some omissions and changes of parameterization. Discrete distributions Distribution Notation PMF Expected value Variance MGF Bernoulli $\text{Bern}(p)$ $P(X=k) = \begin{cases}q = 1-p & \text{ if } k=0 \\ p & \text{ if } k=1\end{cases}$ $p$ $q+pe^t$ Binomial $\text{Bin}(n, p)$ $P(X=k) = {n \choose k} p^k q^{n-k}$ $np$ $(q+pe^t)^n$ Geometric $\text{Geom}(p)$ $P(x=k) = q^k p$ $\frac{q}{p}$ $\frac{p}{1-qe^t}$ Negative Binomial $\text{NBin}(r, p)$ $P(X=n) = {r+n-1 \choose r-1} p^r q^n$ $r \cdot \frac{q}{p}$ $\Big(\frac{p}{1-qe^t}\Big)^r$ Hypergeometric $\text{HGeom}(w, b, n)$ $P(X=k) = {w \choose k} {b \choose n-k} / {w+b \choose n}$ Poisson $\text{Pois}(\lambda)$ \$P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!

# Thoughts on ankification

Anki is an incredibly powerful tool for augmenting human memory. It is easy to make cards, but deceptively difficult to make cards well. The art of “ankification” involves transforming newly learned concepts into a card structure which is both: Effective — the cards trigger actual recall in some situation outside of SRS practice, such as finding the right word when speaking a foreign language, or recognizing a link between two concepts.