One-hot encoding + linear regression = multi-collinearity

My coefficients are bigger than your coefficients

I was attempting to fit a simple linear regression model the other day with sklearn.linear_model.LinearRegression but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a crazy magnitude, on the order of billions. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100.

feature_A_1    4060461707040.634
feature_A_2    4060461707005.303
feature_A_3    4060461706988.173
feature_B_1   -2529776773226.519
feature_B_2   -2529776773214.394
feature_B_3   -2529776773206.096
feature_B_4   -2529776773204.950
feature_B_5   -2529776773203.577
feature_B_6   -2529776773201.271
feature_B_7   -2529776773195.004
Name: coef, dtype: float64

What is going on here? It turns out it was related to my use of OneHotEncoder in my preprocessing pipeline to convert categorical features into a numeric format suitable for linear models. The best practice to convert a categorical feature containing $ k $ values is to output only $ k-1 $ one-hot encoded features, leaving one of them as the “default” value when all other $ k-1 $ booleans are zero. Unfortunately I overlooked the fact that by default, OneHotEncoder sets the parameter drop=None which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients. This is known as the dummy variable trap.

An easy fix…

Since we do not want to remove the intercept, the solution is to call encode our categorical features with the parameter drop='first' to produce only $ k-1 $ columns for each categorical feature.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

cat_cols = X.select_dtypes('category').dtypes.index.values.tolist()

pipeline = ([
  ('one_hot', OneHotEncoder(drop='first'), cat_cols),
  ('lin_reg', LinearRegession())
])

pipeline.fit_predict(X, y)

…but it doesn’t play nicely with CV pipelines

An additional challenge I faced was that my OneHotEncoder was part of a pipeline which was ultimately fed into the cross_val_predict function. This function splits up the dataset into a number of folds and runs the preprocessing pipeline separately for each fold. It is possible that the training dataset used in one or more of the CV folds may not include every possible value for every categorical feature. When the pipeline is subsequently applied to the test dataset in that fold, it will throw an error about an unknown value, unless you use the parameter OneHotEncoder(handle_unknowns='ignore) .

Unfortunately is not possible to simultaneously set drop='first' and handle_unknowns='ignore' on OneHotEncoder , else you get the error below.

ValueError: `handle_unknown` must be 'error' when the drop parameter is specified, as both would create categories that are all zero.

I have not found an elegant solution to this problem. If you know one, please let me know. For now, I fell back to a non-pipeline solution in which I fit OneHotEncoder against the entire dataset, and then make predictions against a manually-split test set.

numeric_cols = X.select_dtypes(np.number).dtypes.index.values.tolist()
cat_cols = X.select_dtypes('category').dtypes.index.values.tolist()

# Train the transformer on the full dataset (causes some leakage for PowerTransformer)
col_tx = ColumnTransformer(transformers=[
    ('num', PowerTransformer(), numeric_cols),
    ('cat', OneHotEncoder(drop='first', handle_unknown='error'), cat_cols)
]).fit(X)

# Transform training data and fit model
X_train_tx = col_tx.transform(X_train)
model = LinearRegression()
model.fit(X_train_tx, y_train)

# Transform test data and make predictions
X_test_tx = col_tx.transform(X_test)
preds = model.predict(X_test_tx)