Hierarchical models are predictive models which combine multiple probability distributions into a structure which (hopefully) reflects the true underlying data generating process more precisely.

These models typically involve greater conceptual and computational complexity, but are able to make better predictions and parameter estimations if built effectively.

Also referred to as: multilevel models, mixed effects models.

## Key concept: Partial pooling 💡

When some of our predictor variables exhibit a natural hierarchical structure, we can make use of this information to improve our model. Typically these hierarchical variables variables are *discrete* and *exchangeable*.

Examples: - Individuals in families - Cities in states in countries - Employees in departments in companies

Examples below will consider the hierarchical relationship between cities and states. Although members of a sub-level (e.g. cities) are *somewhat* independent of each other, they are not *completely* independent. Each city provides *some* information that can be used to improve the estimates for all of the other cities.

The question then, is how to effectively “pool” their information together. Partial pooling is a compromise between the two extreme cases of complete pooling (treat all cities the same) and no pooling (treat each city as independent).

### Complete pooling

On one extreme, suppose we treat all data points as members of the same population and take the “grand mean” (collective average).

- This is a good approach if our goal is to make draw conclusions on the data in
*aggregate*. Complete pooling would give us a model with the best possible in-sample fit on the grand mean. - This is a bad approach if our goal is to make predictions on specific data points. Since we entirely ignore heterogeneity across groups, our model will underestimate uncertainty, and hence will be “underfit” to the true variation between individual data points.
- Complete pooling is the “default” linear regression scenario, in which parameters are scalar values rather than vectors.

### No pooling

On the other extreme, suppose we treat each value of a categorical predictor as its own population, completely unrelated from each other. So we treat each city is its own universe, and the data from one provides us zero information about the others.

- Good approach if our goal is to make predictions on a specific city.
- Bad approach if our goal is to make inference on overall population, since we have only city-level estimates.
- Hazard is that this can easily lead to
*overfitting*, since there is zero regularization across groups which are in practice somewhat related. We miss out on the opportunity to make inference about the collective properties of cities.

### Partial pooling

Partial pooling attempts to share information across data points in an optimal way, effectively performing *adaptive regularization*. The degree of pooling is determined by the data itself.

- The mean for each particular city is drawn from a distribution of city means:
`$\mu_i \sim \mathcal{N}(\bar{\mu}, \bar{\sigma})$`

- The amount of pooling is automatically determined by the amount of variation in group parameters. If groups are quite similar (lower $\sigma$) greater shrinkage/regularization is applied.
- In effect, we are allowing the data to
*set its own priors*.

### Identifying each approach from model definition

Approach | Specification | Notes |
---|---|---|

Complete pooling | `$y_i = \alpha + \beta x_i$` |
Parameters as scalars. |

No pooling | `$y_i = \alpha_{\text{idx} [i]} + \beta x_i$` |
Parameter(s) as vectors, but with no hyper-prior. |

Varying intercept | `$\begin{aligned}y_i &= \alpha_{\text{idx[i]}} + \beta x_i \\[8pt] \alpha_{\text{idx[i]}} &\sim \mathcal{N}(\mu_\alpha, \sigma_\alpha^2) \end{aligned}$` |
Intercept as a vector, with a hyper-prior. |

Varying slope | `$\begin{aligned}y_i &= \alpha + \beta_{\text{idx[i]}} x_i \\[8pt] \beta_{\text{idx[i]}} &\sim \mathcal{N}(\mu_\alpha, \sigma_\alpha^2) \end{aligned}$` |
Slope as a vector, with a hyper-prior. |

Varying intercept and slope | `$\begin{aligned}y_i &= \alpha_{\text{idx[i]}} + \beta_{\text{idx[i]}} x_i \\[8pt] \alpha_{\text{idx[i]}} &\sim \mathcal{N}(\mu_\alpha, \sigma_\alpha^2) \\ \beta_{\text{idx[i]}} &\sim \mathcal{N}(\mu_\alpha, \sigma_\alpha^2) \end{aligned}$` |
Both intercept and slope as vectors, both with hyper-priors. |

## Metaphors 🔗

### Cafe waiting times

Suppose you are on a trip through Europe, and you spend each afternoon at a cafe in whatever city you are currently in. How long do you expect you’ll be waiting for your coffee?

If you wait 5 minutes at a cafe in Paris, then arrive at a cafe in Berlin, you might expect it to take somewhere around 5 minutes.

Each cafe—and city—is different, but you have some prior belief on the overal population of cafes that you draw priors from.

## Advantages 💪🏼

### More precise

Because hierarchical models partially pool data across groups, the sample size is effectively greater than the traditional “no pooling” scenario in which we fit an individual model for each group. This greater sample size allows our model to make more precise estimates.

### More robust

Partial pooling acts as a form of regularization—individual parameter estimates tend to “shrink” towards the grand mean. As a result, hierarchical models are naturally more robust, less likely to overfit, and better at prediction.

### More flexible

Crafting a generative model for the data demands some greater initial complexity, but it rewards us with much greater flexibility with our analysis. It is easy to extend our model to handle various edge cases such as:

- Incorporating uncertainty from missing data.
- Performing meta-analysis to detect effects across multiple scientific studies or experiments which may have used different methodologies.
- Implementing regularization via priors are more intuitive than abstract parameters (e.g. λ in ridge regression) and allow us to better utilize domain expertise.

## Hazards ☠

### Conceptual complexity

Hierarchical models involve more parameters, which increases complexity in a few ways:

- Requires greater care in model specification and more thoughtful priors than single-level models.
- More difficult to debug incorrectly specified models.
- Potentially more difficult to interpret or explain.

We can mitigate the increased complexity in model specification and debugging by being more careful:

- Sample from your priors to ensure they are weakly informative. The majority of samples should reflect
*a priori*plausible values. - Geenrate a fake dataset, and see if your model can recover the generated parameters before fitting it on the real dataset.

### Computational complexity

If we are building our model based on beliefs about the true underlying process, its specification will probably not be conjugate. Years ago this would have been a show-stopper, but now we can make use of Markov chain Monte Carlo sampling techniques to achieve approximations of computationally complex model specifications.

## Unsorted

### Priors on variance

- Must be strictly positive, since variance is.
`$\text{Expo}(1)$`

is typically a good starting point.- If the chain is not healthy, try HalfNormal, which has thinner tails than
`$\text{Expo}(1)$`

.

### Varying slopes & intercepts

- If we are varying
*both*intercepts and slopes, we should model them jointly using a Multivariate Normal distribution. This allows us to leverage the fact that intercepts and slopes are typically negatively correlated. - Set priors on covariance matrix by factoring into diagonal matrix of standard deviations and separate correlation matrix
`$\mathbf{R}$`

(matrix of $\rho$).

```
$$
\begin{aligned}
\mathbf{S} =
\begin{bmatrix}
\sigma_\alpha & 0 \\
0 & \sigma_\beta
\end{bmatrix}
\mathbf{R}
\begin{bmatrix}
\sigma_\alpha & 0 \\
0 & \sigma_\beta
\end{bmatrix}
\end{aligned}
$$
```

- Set priors on correlation matrix
`$\mathbf{R}$`

using LKJcorr distribution, which gives us a weakly informative prior which is skeptical of extreme correlations (near -1 or 1). - Do not use the inverse Wishart distribution, since it does not allow us to set separate priors on standard deviation and correlation. It was historically used since it is a conjugate prior, but that doesn’t matter for MCMC.
^{1}

## Further reading

- Richard McElreath’s Statistical Rethinking, 2e — Chapter 12 onwards.
- John K. Kruschke’s Doing Bayesian Data Analysis — Chapter 9.
- A Primer on Bayesian Multilevel Modeling using PyStan – Nice jupyter notebook which runs through the Radon example from Gelman’s
*Bayesian Data Analysis*. - What is Modern Statistical Workflow? — Goood overview of the steps to take when building a hierarchical model.

## Backlinks

- Missing data
- If you have generative model (e.g. Hierarchical models) we can do better than dropping or imputing values. By integrating the missingness into our model’s formulation, we take advantage of the information it provides.

- Statistical rethinking
- Hierarchical models

- Model evaluation
- Once we go beyond linear models with flat priors, we cannot simply add
`$k$`

, since our effective number of parameters will be smaller due to shrinkage which occurs in Hierarchical models.

- Once we go beyond linear models with flat priors, we cannot simply add

- Why LKJcorr is a good prior for correlation matrix? (CrossValidated)
^{[return]}