Summary
Key facts and properties for common probability distributions. Largely copied from this probability cheatsheet, with some omissions and changes of parameterization.
Discrete distributions
Distribution  Notation  PMF  Expected value  Variance  MGF 

Bernoulli  $\text{Bern}(p)$ 
$P(X=k) = \begin{cases}q = 1p & \text{ if } k=0 \\ p & \text{ if } k=1\end{cases}$ 
$p$ 
$q+pe^t$ 

Binomial  $\text{Bin}(n, p)$ 
$P(X=k) = {n \choose k} p^k q^{nk}$ 
$np$ 
$(q+pe^t)^n$ 

Geometric  $\text{Geom}(p)$ 
$P(x=k) = q^k p$ 
$\frac{q}{p}$ 
$\frac{p}{1qe^t}$ 

Negative Binomial  $\text{NBin}(r, p)$ 
$P(X=n) = {r+n1 \choose r1} p^r q^n$ 
$r \cdot \frac{q}{p}$ 
$\Big(\frac{p}{1qe^t}\Big)^r$ 

Hypergeometric  $\text{HGeom}(w, b, n)$ 
$P(X=k) = {w \choose k} {b \choose nk} / {w+b \choose n}$ 

Poisson  $\text{Pois}(\lambda)$ 
$P(X=k) = \frac{e^{\lambda}\lambda^k}{k!}$ 
$\lambda$ 
$\lambda$ 
$e^{\lambda(e^t1)}$ 
Continuous distributions
Distribution  Notation  Expected value  Variance  MGF  

Uniform  $\text{Unif}(a,b)$ 
$f(x) = \frac{1}{ba}$ 
$\frac{a+b}{2}$ 
$\frac{(ba)^2}{12}$ 
$\frac{e^{tb}  e^{ta}}{t(ba)}$ 
Normal  $\mathcal{N}(\mu, \sigma^2)$ 
$f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x\mu)^2}{2 \sigma^2}}$ 
$\mu$ 
$\sigma^2$ 
$\exp(\mu t + \frac{1}{2}\sigma^2 t^2)$ 
Exponential  $\text{Expo}(\lambda)$ 
$f(x) = \lambda e^{\lambda x}$ 
$\frac{1}{\lambda}$ 
$\frac{1}{\lambda^2}$ 
$\frac{\lambda}{\lambdat}$ 
Gamma  $\text{Ga}(a, \lambda)$ 
$f(x) = \frac{\lambda^a}{\Gamma(a)} x^{a1} e^{\lambda x}$ 
$\frac{a}{\lambda}$ 
$\frac{a}{\lambda^2}$ 
$\big(\frac{\lambda}{\lambdat}\big)^a$ 
Beta  $\text{Be}(a, b)$ 
$f(x) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a1} (1x)^{b1}$ 
$\frac{a}{a+b}$ 

LogNormal  $\mathcal{LN}(\mu, \sigma^2)$ 
—  
ChiSquare  $\chi_n^2$ 

Studentt  $t_n$ 
— 
Relationships
The purposes of various distributions really “click” when we consider how they relate to each other.
Flowchart from Statistical Rethinking
Source: Statistical Rethinking, 2e, ch. 10, p. 327
Flowchart from Introduction to Probability
Source: Blitzstein, An Introduction to Probability
Stories
Some further notes on how the distributions above and others relate to each other, and how they arise in natural scenarios.
Counting outcomes with fixed trials
Bernoulli trial
A random variable which represents the outcome of a single binary outcome (heads/tails, true/false, success/fail, etc.) with probability $p$
of a positive outcome.
Not really a “distribution” since it only involves two possible outcomes: zero or one.
Binomial distribution
The cumulation of $n$
independent Bernoulli trials, each with the same success probability $p$
.
We can represent a Bernoulli trial a Binomial r.v. with a single trial: $\text{Bern}(p) \sim \text{Bin}(1, p)$
.
Hypergeometric distribution
The Binomial distribution assumes our trials are independent, but what if our probability of each trial depends on previous trials? This arises when sampling without replacement, e.g. drawing marbles from a bag one at a time.
More abstractly, we have items of a population classified using two sets of tags (e.g. black/white and sampled/unsampled) with at least one set assigned randomly.
Multinomial
What if we have more than two possible outcomes? The Multinomial distribution is a generalization of the Binomial with $k$
categories, into which each of $n$
objects are independently placed, with probability $p_j$
.
$$
X \sim \text{Mult}_k(n, \vec{p})
$$
PMF: $P(X_1=n_1, \ldots, X_k=n_k) = \frac{n!}{n_1! \ n_1! \ \ldots n_k!} \cdot p_1^{n_1} p_2^{n_12} \ldots p_k^{n_p}$
Useful for keeping track of trials whose outcomes can fall into MECE categories (e.g. excellent, adequate, poor)
The lumping property tells us that if $X \sim \text{Multi}_k(n, p)$
then for any distinct $i, j$
, $X_i + X_j \sim \text{Bin}(n, p_i + p_j)$
.
Counting trials with fixed outcomes
Geometric distribution
The probability distribution of the number of failures of independent Bernoulli trials before the first success.
Closely related to the first success distribution: If $Y \sim FS(p)$
then $Y1 \sim \text{Geom}(p)$
.
Negative Binomial distribution
A discrete probability distribution of the number of failures in a sequence of i.i.d. Bernoulli trials before a specified number of successes occurs.
A generalization of the Geometric distrfibution which can be represented as the sum of i.i.d. geometric r.v.s.
Counting events with fixed time
Poisson distribution
A discrete probability distribution representing the number of events occuring in a fixed period of time when they occur independently with a known constant rate.
Exponential
A continuous probability distribution that describes the time between events of a Poisson process, also known as interarrival times. The continuous counterpart of the Geometric distribution.
Gamma distribution
A twoparameter family of continuous probability distributions, of which the exponential distribution is a special case. In Bayesian statistics, this distribution is used as a conjugate prior for rateparameterized distributions such as exponential, poisson, or itself.
While an Exponential r.v. represents the waiting time for the first success under conditions of memorylessness, a Gamma r.v. represents the total waiting time for multiple successes.
$$
\Big( \sum_i^n X_i \overset{iid}{\sim} \text{Expo}(\lambda) \Big) \sim \text{Gamma}(n, \lambda)
$$
Combining random variables
Normal Distribution
A continuous probability distribution to which the sampling distributions of other r.v.s converge via the central limit theorem.
LogNormal
If $X \sim \mathcal(0, 1)$
we can create a LogNormal r.v. as $Y=e^X$
. Then we can apply changeofvariables to find its PDF, noting the inverse transformation $x = \log y$
.
$$
\begin{aligned}
f_Y(y) &= f_X(x) \bigg \frac{dx}{dy} \bigg \\
&= \varphi(x) \cdot \frac{1}{e^x} \\
&= \frac{1}{y} \varphi(\log y) \;\text{ for } y>0
\end{aligned}
$$
Notes
 Arises naturally as the multiplicative effect of i.i.d. random variables (as opposed to their sum, which converges to Normal via the CLT)
 Does not mean “log of a Normal” but rather “log is Normal” ☠
Cauchy distribution
The distribution of a ratio of Normally distributed r.v.s: $\frac{X}{Y}$
with $X, Y \stackrel{i.i.d.}{\sim} \mathcal{N}(0,1)$
.
Notable property is it does not have a well defined expectation or variance. In practice, this means that the CLT does not apply: as we collect more Cauchy r.v.s, their sum remains Cauchy rather than converging to the Normal distribution. This makes it difficult to perform statistical inference.
Chisquared
The distribution of a sum of the squares of $k$
independent standard normal random variables.
$$
\begin{aligned}
\text{If} &\quad V = Z_1^2 + Z_2^2 + \ldots + Z_n^2 \\
\text{where} &\quad Z_i \sim \mathcal{N}(0, 1) \\
\text{then} &\quad V \sim \chi_n^2 \\
\text{with} &\quad n \text{ degrees of freedom.}
\end{aligned}
$$
Important in statistics because the distribution of sample variance after appropriate scaling is ChiSquare
Relationships to other distributions
 Normal (see above)
 Gamma:
$\chi^2(n) \sim \text{Ga}\big(\tfrac{n}{2}, \tfrac{1}{2}\big)$
Multivariate Normal (MVN)
An extension of the Normal distribution into multiple dimensions $m$
. Our mean parameter $\mu$
becomes a vector of length $m$
, and our $\sigma$
becomes a square covariance matrix rather than a scalar quantity.
Properties
$(X_1, \ldots, X_k) \sim \text{MVN}$
if every linear combination of$X_j$
has a normal distribution. For r.v.s whose joint distribution is MVN, independence and correlation are equivalent conditions
Other
Beta distribution
A continuous probability distribution defined on the interval $[0,1]$
which is frequently used to model our prior belief of an unobserved probability $p$
of some other distribution.
A generalization of the uniform distribution (continuous and bounded)
$\text{Beta}(1,1) = \text{Unif}(0, 1)$
 Normalizing constant:
$\text{B}(a, b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)}$
Shape of distribution w.r.t. parameters: α and β
 Convex when
$a < 1 \text{ and } b < 1$
.  Concave when
$a > 1 \text{ and } b > 1$
.  Symmetric about ½ when
$a=b$
.  Positive skew (lean left) when
$a>b$
.  Negative skew (leans right) when
$a<b$
.
Dirichlet
The Dirichlet distribution is the multivariate extension of the beta distribution. It models a collection of probabilities (between zero and one) which all sum to one.
Discrete uniform distribution
A symmetric probability distribution whereby a finite number of values are equally likely to be observed; every one of $n$
values has equal probability $\frac{1}{n}$
.
$$
\begin{aligned}
X &\sim \text{DUnif}( C ) \\
p(xC) &= \tfrac{1}{\lvert C \rvert}
\end{aligned}
$$
Backlinks
 Hierarchical models
 Hierarchical models are predictive models which combine multiple Probability distributions into a structure which (hopefully) reflects the true underlying data generating process more precisely.