Geoff Ruddock

Building a hurdle regression estimator in scikit-learn

What are hurdle models?

Google explains best,

The hurdle model is a two-part model that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a hurdle is cleared.

Getting started with hurdle models [University of Virginia Library]

What are hurdle models useful for?

Many statistical learning models—particularly linear models—assume some level of normality in the response variable being predicted. If we have a dataset with a heavily skewed response or one which contains extreme outliers, it is a common practice to apply something like a Box-Cox power transformation before fitting.

But what do you do if you come across a clearly multi-modal distribution like the one below? Applying a power transform here will just change the scale of the variable, it won’t help with the fact that there is a huge spike of values at zero. The fact that it is multi-modal is a good indicator that we are over-aggregating data which belong to two or more distinct underlying data generation processes.

Example of a multi-modal distribution

Distributions like this are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.

In the toy example above we have two underlying processes: Does a customer come back? If so, how many purchases does he or she make? The first is modeled as a binomial random variable (coin flip) and the second as a $ \text{Pois}(\lambda=4) $ random variable, which represents discrete event counts.

Example of a multi-modal distribution

How can I implement a hurdle model?

So we want to fit and predict two sub-models, and then multiply their predictions together:

  1. A classifier, trained and tested on all of our data.
  2. A regressor, trained only on true positive samples, but used to make predictions on all test data.

The most straightforward way to achieve this would be to just train two separate models, make predictions on the same test dataset, and multiply their predictions together before evaluating. However with this approach we lose the ability to interface our model with the rest of the scikit-learn ecosystem, including passing it into GridSearchCV or any of the evaluation functions such as cross_val_predict.

A better approach is to implement our hurdle model as a valid scikit-learn estimator object by extending from the provided BaseEstimator class.