# Probability distributions

## Summary

Key facts and properties for common probability distributions. Largely copied from this probability cheatsheet, with some omissions and changes of parameterization.

### Discrete distributions

Distribution Notation PMF Expected value Variance MGF
Bernoulli $\text{Bern}(p)$ $P(X=k) = \begin{cases}q = 1-p & \text{ if } k=0 \\ p & \text{ if } k=1\end{cases}$ $p$ $q+pe^t$
Binomial $\text{Bin}(n, p)$ $P(X=k) = {n \choose k} p^k q^{n-k}$ $np$ $(q+pe^t)^n$
Geometric $\text{Geom}(p)$ $P(x=k) = q^k p$ $\frac{q}{p}$ $\frac{p}{1-qe^t}$
Negative Binomial $\text{NBin}(r, p)$ $P(X=n) = {r+n-1 \choose r-1} p^r q^n$ $r \cdot \frac{q}{p}$ $\Big(\frac{p}{1-qe^t}\Big)^r$
Hypergeometric $\text{HGeom}(w, b, n)$ $P(X=k) = {w \choose k} {b \choose n-k} / {w+b \choose n}$
Poisson $\text{Pois}(\lambda)$ $P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}$ $\lambda$ $\lambda$ $e^{\lambda(e^t-1)}$

### Continuous distributions

Distribution Notation PDF Expected value Variance MGF
Uniform $\text{Unif}(a,b)$ $f(x) = \frac{1}{b-a}$ $\frac{a+b}{2}$ $\frac{(b-a)^2}{12}$ $\frac{e^{tb} - e^{ta}}{t(b-a)}$
Normal $\mathcal{N}(\mu, \sigma^2)$ $f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}$ $\mu$ $\sigma^2$ $\exp(\mu t + \frac{1}{2}\sigma^2 t^2)$
Exponential $\text{Expo}(\lambda)$ $f(x) = \lambda e^{-\lambda x}$ $\frac{1}{\lambda}$ $\frac{1}{\lambda^2}$ $\frac{\lambda}{\lambda-t}$
Gamma $\text{Ga}(a, \lambda)$ $f(x) = \frac{\lambda^a}{\Gamma(a)} x^{a-1} e^{-\lambda x}$ $\frac{a}{\lambda}$ $\frac{a}{\lambda^2}$ $\big(\frac{\lambda}{\lambda-t}\big)^a$
Beta $\text{Be}(a, b)$ $f(x) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1}$ $\frac{a}{a+b}$
Log-Normal $\mathcal{LN}(\mu, \sigma^2)$
Chi-Square $\chi_n^2$
Student-t $t_n$

## Relationships

The purposes of various distributions really “click” when we consider how they relate to each other.

### Flowchart from Statistical Rethinking

Source: Statistical Rethinking, 2e, ch. 10, p. 327

### Flowchart from Introduction to Probability

Source: Blitzstein, An Introduction to Probability

## Stories

Some further notes on how the distributions above and others relate to each other, and how they arise in natural scenarios.

### Bernoulli trial

A random variable which represents the outcome of a single binary outcome (heads/tails, true/false, success/fail, etc.) with probability $p$ of a positive outcome.

Not really a “distribution” since it only involves two possible outcomes: zero or one.

#### Binomial distribution

The cumulation of $n$ independent Bernoulli trials, each with the same success probability $p$.

We can represent a Bernoulli trial a Binomial r.v. with a single trial: $\text{Bern}(p) \sim \text{Bin}(1, p)$.

#### Hypergeometric distribution

The Binomial distribution assumes our trials are independent, but what if our probability of each trial depends on previous trials? This arises when sampling without replacement, e.g. drawing marbles from a bag one at a time.

More abstractly, we have items of a population classified using two sets of tags (e.g. black/white and sampled/unsampled) with at least one set assigned randomly.

### Multinomial

What if we have more than two possible outcomes? The Multinomial distribution is a generalization of the Binomial with $k$ categories, into which each of $n$ objects are independently placed, with probability $p_j$.

$$X \sim \text{Mult}_k(n, \vec{p})$$

PMF: $P(X_1=n_1, \ldots, X_k=n_k) = \frac{n!}{n_1! \ n_1! \ \ldots n_k!} \cdot p_1^{n_1} p_2^{n_12} \ldots p_k^{n_p}$

Useful for keeping track of trials whose outcomes can fall into MECE categories (e.g. excellent, adequate, poor)

The lumping property tells us that if $X \sim \text{Multi}_k(n, p)$ then for any distinct $i, j$ , $X_i + X_j \sim \text{Bin}(n, p_i + p_j)$.

### Counting trials with fixed outcomes

#### Geometric distribution

The probability distribution of the number of failures of independent Bernoulli trials before the first success.

Closely related to the first success distribution: If $Y \sim FS(p)$ then $Y-1 \sim \text{Geom}(p)$.

#### Negative Binomial distribution

A discrete probability distribution of the number of failures in a sequence of i.i.d. Bernoulli trials before a specified number of successes occurs.

A generalization of the Geometric distrfibution which can be represented as the sum of i.i.d. geometric r.v.s.

### Counting events with fixed time

#### Poisson distribution

A discrete probability distribution representing the number of events occuring in a fixed period of time when they occur independently with a known constant rate.

#### Exponential

A continuous probability distribution that describes the time between events of a Poisson process, also known as interarrival times. The continuous counterpart of the Geometric distribution.

#### Gamma distribution

A two-parameter family of continuous probability distributions, of which the exponential distribution is a special case. In Bayesian statistics, this distribution is used as a conjugate prior for rate-parameterized distributions such as exponential, poisson, or itself.

While an Exponential r.v. represents the waiting time for the first success under conditions of memorylessness, a Gamma r.v. represents the total waiting time for multiple successes.

$$\Big( \sum_i^n X_i \overset{iid}{\sim} \text{Expo}(\lambda) \Big) \sim \text{Gamma}(n, \lambda)$$

### Combining random variables

#### Normal Distribution

A continuous probability distribution to which the sampling distributions of other r.v.s converge via the central limit theorem.

#### Log-Normal

If $X \sim \mathcal(0, 1)$ we can create a Log-Normal r.v. as $Y=e^X$. Then we can apply change-of-variables to find its PDF, noting the inverse transformation $x = \log y$.

\begin{aligned} f_Y(y) &= f_X(x) \bigg| \frac{dx}{dy} \bigg| \\ &= \varphi(x) \cdot \frac{1}{e^x} \\ &= \frac{1}{y} \varphi(\log y) \;\text{ for } y>0 \end{aligned}

Notes

• Arises naturally as the multiplicative effect of i.i.d. random variables (as opposed to their sum, which converges to Normal via the CLT)
• Does not mean “log of a Normal” but rather “log is Normal” ☠

#### Cauchy distribution

The distribution of a ratio of Normally distributed r.v.s: $\frac{X}{Y}$ with $X, Y \stackrel{i.i.d.}{\sim} \mathcal{N}(0,1)$.

Notable property is it does not have a well defined expectation or variance. In practice, this means that the CLT does not apply: as we collect more Cauchy r.v.s, their sum remains Cauchy rather than converging to the Normal distribution. This makes it difficult to perform statistical inference.

#### Chi-squared

The distribution of a sum of the squares of $k$ independent standard normal random variables. \begin{aligned} \text{If} &\quad V = Z_1^2 + Z_2^2 + \ldots + Z_n^2 \\ \text{where} &\quad Z_i \sim \mathcal{N}(0, 1) \\ \text{then} &\quad V \sim \chi_n^2 \\ \text{with} &\quad n \text{ degrees of freedom.} \end{aligned}

Important in statistics because the distribution of sample variance after appropriate scaling is Chi-Square

Relationships to other distributions

• Normal (see above)
• Gamma: $\chi^2(n) \sim \text{Ga}\big(\tfrac{n}{2}, \tfrac{1}{2}\big)$

### Multivariate Normal (MVN)

An extension of the Normal distribution into multiple dimensions $m$. Our mean parameter $\mu$ becomes a vector of length $m$, and our $\sigma$ becomes a square covariance matrix rather than a scalar quantity.

Properties

• $(X_1, \ldots, X_k) \sim \text{MVN}$ if every linear combination of $X_j$ has a normal distribution.
• For r.v.s whose joint distribution is MVN, independence and correlation are equivalent conditions

### Other

#### Beta distribution

A continuous probability distribution defined on the interval $[0,1]$ which is frequently used to model our prior belief of an unobserved probability $p$ of some other distribution.

A generalization of the uniform distribution (continuous and bounded)

• $\text{Beta}(1,1) = \text{Unif}(0, 1)$
• Normalizing constant: $\text{B}(a, b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)}$

Shape of distribution w.r.t. parameters: α and β

• Convex when $a < 1 \text{ and } b < 1$ .
• Concave when $a > 1 \text{ and } b > 1$ .
• Symmetric about ½ when $a=b$ .
• Positive skew (lean left) when $a>b$ .
• Negative skew (leans right) when $a<b$ .

#### Dirichlet

The Dirichlet distribution is the multivariate extension of the beta distribution. It models a collection of probabilities (between zero and one) which all sum to one.

### Discrete uniform distribution

A symmetric probability distribution whereby a finite number of values are equally likely to be observed; every one of $n$ values has equal probability $\frac{1}{n}$.

\begin{aligned} X &\sim \text{DUnif}( C ) \\ p(x|C) &= \tfrac{1}{\lvert C \rvert} \end{aligned}

• Hierarchical models
• Hierarchical models are predictive models which combine multiple Probability distributions into a structure which (hopefully) reflects the true underlying data generating process more precisely.