Markov chain Monte Carlo

What is it

The goal of Markov chain Monte Carlo (MCMC) is to draw samples from some probability distribution without having to know its exact height at any point. The way MCMC achieves this is to “wander around” on that distribution in such a way that the amount of time spent in each location is proportional to the height of the distribution. Once it converges, we get a stationary distribution which matches the distribution from which you are sampling.

Why it matters

Integration is key to Bayesian statistics as optimization is to classical statistics.

— I forgot where I found this quote.

Calculating multi-dimensional integrals is often either computationally expensive or entirely intractable. Computation can be manageable when our choice of likelihood and prior form a conjugate pair. But when this is not the case—or when we are working with hierarchical models—we need to employ clever approaches to approximate the distribution of our variable of interest.


Metropolis–Hastings

The Metropolis–Hastings algorithm is a general recipe that lets us start with any irreducible Markov chain on the state space of interest and then modify it into a new Markov chain that has the desired stationary distribution.

Algorithm

  1. From the current state $i$, propose a new state $j$ based on a proposal distribution. With a discrete state space, this simply corresponds to the $i^{th}$ row of our transition probability matrix $P$.
  2. Compute the acceptance probability $a_{ij}= \min \Big( \frac{s_j}{s_i} \frac{p_{ji}}{p_{ij}}, 1 \Big)$.
  3. Flip a coin that lands Heads with probability $a_{ij}$.
  4. If heads, accept the proposal, setting $X_{n+1}=j$. Otherwise, reject the proposal, setting $X_{n+1}=i$.

The cumulative list of all states visited after many iterations reflects our estimate of the posterior density.

Tuning for performance

Acceptance rate – Intuitively, a proposal distribution with a very low acceptance rate will be slow to converge, while one with a very high acceptance rate may indicate that the chain makes small, timid proposals, in which case it would take a very long time to explore the entire space.

Judging convergence – Run multiple chains from randomized starting points, more widely dispersed than you thinkk necessary, just to be safe.

Drawbacks 💩

As we move into higher dimensional space, the guess-and-check approach simply does not scale, regardless of how well we tune our parameters.


Gibbs sampling

An algorithm for obtaining approximate draws from a joint distribution, based on sampling from full conditional distributions one at a time. Systematic scan sweeps through components in a deterministic order, while random scan updates a randomly chosen compnent at each stage.

Algorithm (systemic scan)

Let $X, Y$ be discrete r.v.s with joint PMF $p_{X,Y}(x,y) = P(X=x, Y=y)$. We wish to construct a two-dimensional Markov chain $(X_n, Y_n)$ whose statiionary distribution is $p_{X,Y}$.

We achieve this by iteratively updating one component while holding the other fixed until converged: 1. Draw a value from full conditional distribution $P(X|Y=y_n)$ and use it to set $X_{n+1} = x_{n+1}$. 2. Draw a value from $P(Y|X=x_{n+1})$ and use it to set $Y_{n+1} = y_{n+1}$.

When converged, the stationary distribution of the chain is $p_{X,Y}$.

Advantages 💪🏼

Gibbs sampling shines when the conditional distributions (e.g. $f(x | x, y, \theta)$) are straightforward (possibly trivial) to sample from. Can be more efficient than random walk metropolis, since no iterations are rejected.

Hazards ☠

If parameters are correlated, a Gibbs sampler will explore the parameter space slowly, since each partial conditional distribution will be quite narrow.


Hamiltonian Monte Carlo (HMC)

HMC is the “new hotness” of MCMC algorithms, and is used to power most modern engines including Stan and PyMC3.

Rather than using “guess and check” approach of Metropolis–Hastings, with proposal distributions centered around the current value, HMC takes the “momentum” of the parameter into account.

Physics analogy 🔗

  • Flip our posterior distribution upside down and think of it as a frictionless plane. This is negative log posterior space.
  • We flick a marble (our parameter value) in a random direction and let it roll around the plane for a fixed amount of time (number of iterations).
  • Ignore friction entirely. Then the marble’s energy can be decomposed into kinetic energy and potential energy.
  • As it moves uphill it trades kinetic energy for potential energy, and vice versa as it moves downhill.
  • The marble tends to end up in valleys, which corresponds to areas of the posterior with high probability.
  • Kinetic energy is essentially momentum, which we can use to guide our sampling process.

Advantages 💪🏼

  • Because (nearly?) all proposals are accepted, HMC is more efficient, in the sense that it does not spend time on samples which are ultimately rejected. This is helpful for complex models, where Metropolis–Hastings may reject the majority of proposals.
  • HMC can be used with models with many more parameters, since its overall efficiency does not fall off a cliff as complexity increases.
  • It is somewhat self-aware. HMC can sometimes notify you when sample sare unreliale, by checking that the total energy of the “marble” at the end of sampling does not equal the total energy a the start. While some energy will be lost due to approximations at each step, they should be approximately equal.

Hazards ☠

  • Each sample generated by HMC is more computationally expensive, as it requires calculating the gradient at each step. This somewhat counteracts the above-mentioned benefit of less rejections.
  • Requires the parameter space to be continuous, since HMC relies on gradients, and it is not possible to calculate the gradient of a discrete variable.
  • We sometimes run into the “U-Turn” problem, in which samples follow cyclical paths. If our sample duration roughly matches the duration of the cycle, we can end up with a bunch of samples in nearly the same place.

No U-Turn Sampler (NUTS)

  • During warm-up, NUTS determines best step size which maximizes acceptance rate.
  • During sampling, we sample both forward and backward in time. When we detect the path is going to loop back, we prematurely terminate that sample.

Diagnostics

Trace plot

We can visually inspect the trace plot to at least check if things went terribly wrong. Richard McElreath calls this the “Hairy caterpillar ocular inspection test”. We are looking for three properties:

  1. Stationarity –The mean is centered at some value across time.
  2. Mixing – The parameter does not get “stuck” in specific areas.
  3. Convergence – Each individual chain looks similar.

Numeric diagnostics

The summary output gives us a few quantities which an be informative.

Effective sample size

  • ESS is a crude estimate of the number of independent samples, based on the autocorrelation between samples.
  • Since HMC draws are sometimes negatively autocorrelated, this number can be larger than the actual number of samples drawn.
  • If it is very low, there is likely a problem with your model specification.

R hat

  • r_hat is an estimate of the convergence of chains to the target distribution.
  • Ideal value is 1.0. If it is much higher than 1.1, your chains are having trouble converging.

Divergent transitions

  • These are detected by the “law of conservation of energy” mentioned above.
  • Divergent transitions indicate there is a structural problem with your model that needs to be addressed.
  • Note samplers have the same problems, but are just silent about it. HMC is able to raise it to your attention.

Frequently occurs in Hierachical models in which one variable is conditional on another.

$$ \begin{aligned} v &\sim N(0, 3) \\ x &\sim N(0, e^v) \end{aligned} $$

Because the topology of the parameter space changes, there will be no single step_size which can efficiently explore the entire space.

We can mitigate this by sampling a standard normal variable, and then applying a location–scale transformation outside of the sampling to transform it into our quantity of interest. Read more here or here.


Further reading


  • Hierarchical models
    • Often requires use of MCMC
  • Statistical rethinking
    • MCMC
  • Markov chains
    • MCMC
  • Model evaluation
    • Although CV is the gold-standard to estimate predictive accuracy, it is not practical for guiding model building and refinement. A single Hierarchical model may take hours to fit via MCMC, and evaluating it with CV would require repeating this many more times.
  • Soliciting priors
    • Best case, this will cause our MCMC model to take longer to sample the chain, since it will spend more time proposing implausible parameter values. Worst case, it may cause our model to not converge at all. Much better in this scenario to use weakly informative priors.

© Geoff Ruddock 2020