# One-hot encoding + linear regression = multi-collinearity

##### Jul 29, 2019

## My coefficients are bigger than your coefficients

I was attempting to fit a simple linear regression model the other day with `sklearn.linear_model.LinearRegression`

but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a *crazy* magnitude, on the order of *billions*. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100.

```
feature_A_1 4060461707040.634
feature_A_2 4060461707005.303
feature_A_3 4060461706988.173
feature_B_1 -2529776773226.519
feature_B_2 -2529776773214.394
feature_B_3 -2529776773206.096
feature_B_4 -2529776773204.950
feature_B_5 -2529776773203.577
feature_B_6 -2529776773201.271
feature_B_7 -2529776773195.004
Name: coef, dtype: float64
```

What is going on here? It turns out it was related to my use of `OneHotEncoder`

in my preprocessing pipeline to convert categorical features into a numeric format suitable for linear models. The best practice to convert a categorical feature containing $ k $ values is to output only $ k-1 $ one-hot encoded features, leaving one of them as the “default” value when all other $ k-1 $ booleans are zero. Unfortunately I overlooked the fact that by default, `OneHotEncoder`

sets the parameter `drop=None`

which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients. This is known as the dummy variable trap.

## An easy fix…

Since we do not want to remove the intercept, the solution is to call encode our categorical features with the parameter `drop='first'`

to produce only $ k-1 $ columns for each categorical feature.

```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
cat_cols = X.select_dtypes('category').dtypes.index.values.tolist()
pipeline = ([
('one_hot', OneHotEncoder(drop='first'), cat_cols),
('lin_reg', LinearRegession())
])
pipeline.fit_predict(X, y)
```

## …but it doesn’t play nicely with CV pipelines

An additional challenge I faced was that my `OneHotEncoder`

was part of a pipeline which was ultimately fed into the `cross_val_predict`

function. This function splits up the dataset into a number of folds and runs the preprocessing pipeline separately for each fold. It is possible that the training dataset used in one or more of the CV folds may not include every possible value for every categorical feature. When the pipeline is subsequently applied to the test dataset in that fold, it will throw an error about an unknown value, unless you use the parameter `OneHotEncoder(handle_unknowns='ignore)`

.

Unfortunately is not possible to simultaneously set `drop='first'`

and `handle_unknowns='ignore'`

on `OneHotEncoder`

, else you get the error below.

```
ValueError: `handle_unknown` must be 'error' when the drop parameter is specified, as both would create categories that are all zero.
```

I have not found an elegant solution to this problem. If you know one, please let me know. For now, I fell back to a non-pipeline solution in which I fit `OneHotEncoder`

against the entire dataset, and then make predictions against a manually-split test set.

```
numeric_cols = X.select_dtypes(np.number).dtypes.index.values.tolist()
cat_cols = X.select_dtypes('category').dtypes.index.values.tolist()
# Train the transformer on the full dataset (causes some leakage for PowerTransformer)
col_tx = ColumnTransformer(transformers=[
('num', PowerTransformer(), numeric_cols),
('cat', OneHotEncoder(drop='first', handle_unknown='error'), cat_cols)
]).fit(X)
# Transform training data and fit model
X_train_tx = col_tx.transform(X_train)
model = LinearRegression()
model.fit(X_train_tx, y_train)
# Transform test data and make predictions
X_test_tx = col_tx.transform(X_test)
preds = model.predict(X_test_tx)
```