Causal inference

Why it matters πŸ’‘

  • Probabilistic conditioning gives us total effect $P(Y|X=x)$.
  • But we are really interested in the direct causal effect $P(Y|do(X=x))$.
  • Confounding occurs when the two are not equal due to the presence of “back door paths” between $X$ and $Y$.
  • If we perform a proper experiment, we can close these back doors by “playing god”.
  • But in observational studies or poorly designed experiments, these paths must be carefully carefully considered.
  • It is still possible to perform inference if we select the correct set of covariates to condition on.

Key concepts

Directed acyclic graphs (DAGs)

  • We use DAGs to model the causal relationships among variables.
  • Some of these variables are observable/measurable, while others are not.
  • If we can accurately model the true dynamics of the system, DAGs can inform our choice of covariates used in regression models.

Conditional independence

Two variables $X$ and $Y$ are conditionally independent given a third variable $Z$ if they provide no additional information about each other when we already know the value of $Z$.

$$ X \perp \hspace{-1em} \perp Y \mid Z $$

If two variables are conditionally independent, then their joint conditional probability factors into the product of their marginal probabilities: $P(X,Y|Z) = P(X|Z)P(Y|Z)$.

Depending on the underlying causal pattern between variables, we may or may not want to condition on $Z$.

Pattern Model $X \perp \hspace{-1em} \perp Y$ $X \perp \hspace{-1em} \perp Y \mid Z$
Fork $X \leftarrow Z \to Y$ ✘ βœ”οΈŽ
Chain $X \to Z \to Y$ ✘ βœ”οΈŽ
Collider $X \to Z \leftarrow Y$ βœ”οΈŽ ✘

d-Separation

Let $G$ be a DAG, and let $A, B, C$ be disjoint subsets of $G$, so each can represent more than one event.

A path is any consecutive sequence of edges, regardless of their directionalities. A path is blocked if either of the following occur:

  1. The path passes through a vertex $v \in C$ which is either head-to-tail (chain) or tail-to-tail (fork).
  2. The path passes through a vertex $v \not \in C$ which is head-to-head (collider) and none of whose descendants are in $C$.

$A$ and $B$ are d-separated by $C$ if all paths from a vertex of $A$ to a vertex of $B$ are blocked with respect to $C$. If they are d-separated, then they are also conditionally independent: $A \perp \hspace{-1em} \perp B \mid C$.

Examples of d-connected variables


Examples of d-separated variables


Types of confounding variables

Confounding is any context in which the association between an outcome $Y$ and a predictor of interest $X$ is not the same as it would be if we had experimentally determined the values of $X$.

There are four fundamental relationships between variables in a DAG, each of which can lead to a particular type of bias if not handled correctly.

Name Structure Hazard Adjust?
The Fork $X \leftarrow Z \to Y$ Confounding bias βœ”οΈŽ
The Pipe $X \to Z \to Y$ Post-treatment bias ✘
The Collider $X \to Z \leftarrow Y$ Spurious relationships ✘
The Descendant $\begin{aligned} X \to &Z \to Y \\[-5pt] &\downarrow \\[-3pt] &W \end{aligned}$ Multi-collinearity ?

The Fork

The textbook definition of a confounding variable, in which $Z$ is a common cause of $X$ and $Y$.

$$ X \leftarrow Z \to Y $$

If we fail to identiy this relationship and condition on $Z$ so that $X \perp \hspace{-1em} \perp Y \mid Z$, we fall victim to the so-called confounder bias.

The Pipe (or Chain)

In this scenario, $Z$ mediates the association between $X$ and $Y$. Conditioning on $Z$ would remove the statistical relationship, which is not desirable here, since in this case there is a true mediating effect.

$$ X \to Z \to Y $$

The risk of thoughtlessly adding all variables to a model is that we may inadvertently “block a pipe”. This introduces post-treatment bias, and hides a true causal relationship between $X$ and $Y$.

The Collider

In this scenario, $Z$ is a common result of both $X$ and $Y$. The variables $X, Y$ are truly independent, but jointly cause $Z$. This path is closed by default: there is no problem as long as $Z$ is not in our model.

$$ X \to Z \leftarrow Y $$

We should not condition on $Z$, as this create a spurious statistical association between $X$ and $Y$. By conditioning on $Z$, we are opening a back-door path from $X \to Y$.

The Descendant

In this scenario, our variable $W$ is influenced (to some degree) by variable $Z$. So conditioning on $W$ is like weakly conditioning on $Z$.

$$ \begin{aligned} X \to &Z \to Y \\[0pt] &\downarrow \\[0pt] &W \end{aligned} $$

In our particular example $Z$ is a chain, but a descendant variable will will mirror whatever pattern its parent variable exhibits.


Hazards ☠

Post-treatment bias

Carefully controlled experiments can be ruined just as easily as uncontrolled observational studies. Blindly tossing variables into the causal salad is never a good idea no matter how the data were collected. 1

If our model is designed to perform causal inference on some “treatment”β€”an intervention or something you can controlβ€”we must be careful not to include downstream varibles as predictors in the model.

It makes sense to control for pre-treatment differences, but including post-treatment varibles can actually mask the treatment itself.

Spurious relationship

A spurious relationship is one which appears in your statistical model but does not inform causal relationships.

Examples

  • $\text{Switch} \leftarrow \text{Light} \to \text{Electricity}$
  • $\text{Newsworthy} \leftarrow \text{Published} \to \text{Trustworthy}$
  • No correlation between height and scoring among NBA players
  • Armor on WWII planes: put it where you don’t observe damage, because those planes didn’t come back.

Multicollinearity

When two predictor variables are very strongly correlated, including both in a model will not harm predictive power, but will make the model more difficult to interpret.

But the problem that arises in real data sets is that we may not anticipate a clash between highly correlated predictors. And therefore we may mistakenly read the posterior distribution to say that neither predictor is important.

Property Effect Reason
Coefficient estimates β–Ό
Uncertainty of estimates β–²
Predictive power β€”

Techniques πŸ’ͺ🏼

Back-door criterion

A back-door path is any path from $X$ to $Y$ with an arrow pointing into $X$.

Our goal is to condition on a cleverly selected set of covariates $Z$ which block all indirect paths between $X$ and $Y$ but which leave the direct path “open”. We select this set $Z$ to fulfill the following two criteria:

  1. It closes all back-door paths between $X$ and $Y$.
  2. No variable in $Z$ is a descendant of $X$.

Then we can estimate the direct causal effect $X \to Y$ by conditioning on $Z$.

$$ P(Y|do(X=x)) = \sum_z P(Y|X=x,Z=z) P(Z=z) $$

Front-door criterion


If we are able to find a variable (or set of variables) $Z$ which completely mediate the direct causal effect of $X \to Y$, then we can use it to measure this causal effect, even in the presence of some unobserved confound $U$. For this to work, the following must hold:

  1. All directed paths from $X$ to $Y$ flow through $Z$.
  2. There are no back-door paths from $X$ to $Z$.
  3. There are no back-door paths from $Z$ to $Y$ not blocked by conditioning on $X$.

$$ P(Y|do(X=x)) = \sum_z P(Y|do(Z=z)) P(Z=z|do(X=x)) $$

Instrumental variables

An instrument is a variable $Z$ that influences $X$ but not $Y$.


If we can argue that $Z$ is unrelated to our confounder $U$, we can use it as a “natural experiment” by turning $X$ into a collider of $Z$ and $U$.

Advantages πŸ’ͺ🏼

Sometimes it is simply not possible to either run an experiment or to quantify all possible confounding variables in an observational study.

Using instrumental variables, it is still possible to perform causal inference in these situations.

Hazards ☠

The key assumption is that $Z$ is independent of our unobserved confounder $U$. Since we don’t have measurements of $U$, this relationship must be inferred logicaly rather than statistically.


Further reading

Variable Relationships in DAGs (dagitty docs) – Overview of basic terminology in DAGs, including a little interactive quiz to test your knowledge.

Lecture 23, Estimating Causal Models (cmu.edu) – An approachable 20-page overview of basic causal inference.

d-Separation Without Tears (dagitty docs) – Introduction to the concept of d-separation with interactive DAGs that show the effect of conditioning on any variable.

Causality - Inferring Causal Effects from Data (YouTube) – A good explanation of the back-door criterion with easy-to-follow examples.

Statistical Rethinking: Lecture 06 – The entire chapter 6 from Richard McElreath’s textbook Statistical Rethinking, 2e is excellent.


  • Simpson’s paradox
    • An example of a problem which arises when attempting to perform Causal inference.
  • Statistical rethinking
    • Causal inference

  1. McElreath, p. 175 [return]

© Geoff Ruddock 2020