Missing data

Why you can’t just drop nulls ☠

Suppose you are analyzing a dataset with a non-catastrophic proportion of missing values—say 10%. The naive way to move forward is to simply drop all rows which have any values missing. Think df.dropna().

The best case is that the missing values are caused randomly (MAR), so this will not introduce bias into your analysis, just reduce your effective sample size.

But that’s a big assumption. If the missing data is not MAR, you will introduce bias into your results. If you are doing some sort of time-based analysis such as Survival analysis, you simply can’t drop the missing data.

Types of “missingness”

Below I take helpful examples from Richard McElreath’s Statistical Rethinking, 2e to explain types of missingness. In these examples, we are attempting to measure the impact of studying $S$ on homework quality $H$. But the homework may or may not be eaten by their dog $D$, so we only actually get to observe the submitted homework $H^*$.

Missing Completely at Random (MCAR)

Our missing values are MCAR when there is absolutely zero relationship between which values are missing and either or predictor or response variables.

This is the best possible type of missing data, since it implies that we can ignore the missing data without biasing our analysis. However it’s also often the least plausible a priori explanation for missing data.

Since the missing values are completely random, missingness doesn’t necessarily change the overall distribution of homework scores. It removes data, and that makes estimation less efficient. But missing homework doesn’t necessarily bias our estimate of the causal effect of studying.1

Missing at Random (MAR)

Our missing values are MAR if their likelihood of being missing is related only to our predictor variable $X$, but not to our response variable $Y$.

Example: if men are more likely to tell you their weight than women, weight is MAR.

Missing Not at Random (MNAR)

The worst-case scenario is that our missing values are MNAR, which means that the missingness mechanism depends on the missing observations themselves.

  • e.g. a scale which cannot measure objects too light or too heavy

This is a problem that cannot be overcome with statistical approaches. Your only hope is to attempt to model the underlying misingness mechanism itself. If you can do this reasonably well, you could add it to the model and condition on it.

Missing in action (MIA)

If your data is MIA, you have bigger problems—even Chuck Norris cannot save your analysis.

What to do instead ⚡

If you have generative model (e.g. Hierarchical models) we can do better than dropping or imputing values. By integrating the missingness into our model’s formulation, we take advantage of the information it provides.

  • every missing value becomes a parameter

The fact that a variable has an unobserved value is still an observation. It is data, just with a very special value. The meaning of this value depends upon the context. Consider for example a questionnaire on personal income. If some people refuse to fill in their income, this may be associated with low (or high) income. Therefore a model that tries to predict the missing values can be enlightening. 1

Further reading

  • Statistical Rethinking 2e, ch. 15.2
  • Hierarchical models
    • Incorporating uncertainty from Missing data
  • Statistical rethinking
    • Missing data

  1. Statistical Rethinking 2e, p. 509 [return]

© Geoff Ruddock 2020