# Building a hurdle regression estimator in scikit-learn

##### Sep 16, 2019
Multimodal distributions are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.

# One-hot encoding + linear regression = multi-collinearity

##### Jul 29, 2019
You don’t need to be a dummy to fall for the ‘Dummy Variable Trap’ while fitting a linear model, especially if you are using default parameters for one-hot encoding in scikit-learn. By default, OneHotEncoder sets the parameter drop=None which in turn causes it to output $k$ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients.