One-hot encoding + linear regression = multi-collinearity
Jul 29, 2019
My coefficients are bigger than your coefficients
I was attempting to fit a simple linear regression model the other day with
sklearn.linear_model.LinearRegression but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a crazy magnitude, on the order of billions. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100.
feature_A_1 4060461707040.634 feature_A_2 4060461707005.303 feature_A_3 4060461706988.173 feature_B_1 -2529776773226.519 feature_B_2 -2529776773214.394 feature_B_3 -2529776773206.096 feature_B_4 -2529776773204.950 feature_B_5 -2529776773203.577 feature_B_6 -2529776773201.271 feature_B_7 -2529776773195.004 Name: coef, dtype: float64
What is going on here? It turns out it was related to my use of
OneHotEncoder in my preprocessing pipeline to convert categorical features into a numeric format suitable for linear models. The best practice to convert a categorical feature containing $ k $ values is to output only $ k-1 $ one-hot encoded features, leaving one of them as the “default” value when all other $ k-1 $ booleans are zero. Unfortunately I overlooked the fact that by default,
OneHotEncoder sets the parameter
drop=None which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients. This is known as the dummy variable trap.
An easy fix…
Since we do not want to remove the intercept, the solution is to call encode our categorical features with the parameter
drop='first' to produce only $ k-1 $ columns for each categorical feature.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder cat_cols = X.select_dtypes('category').dtypes.index.values.tolist() pipeline = ([ ('one_hot', OneHotEncoder(drop='first'), cat_cols), ('lin_reg', LinearRegession()) ]) pipeline.fit_predict(X, y)
…but it doesn’t play nicely with CV pipelines
An additional challenge I faced was that my
OneHotEncoder was part of a pipeline which was ultimately fed into the
cross_val_predict function. This function splits up the dataset into a number of folds and runs the preprocessing pipeline separately for each fold. It is possible that the training dataset used in one or more of the CV folds may not include every possible value for every categorical feature. When the pipeline is subsequently applied to the test dataset in that fold, it will throw an error about an unknown value, unless you use the parameter
Unfortunately is not possible to simultaneously set
OneHotEncoder , else you get the error below.
ValueError: `handle_unknown` must be 'error' when the drop parameter is specified, as both would create categories that are all zero.
I have not found an elegant solution to this problem. If you know one, please let me know. For now, I fell back to a non-pipeline solution in which I fit
OneHotEncoder against the entire dataset, and then make predictions against a manually-split test set.
numeric_cols = X.select_dtypes(np.number).dtypes.index.values.tolist() cat_cols = X.select_dtypes('category').dtypes.index.values.tolist() # Train the transformer on the full dataset (causes some leakage for PowerTransformer) col_tx = ColumnTransformer(transformers=[ ('num', PowerTransformer(), numeric_cols), ('cat', OneHotEncoder(drop='first', handle_unknown='error'), cat_cols) ]).fit(X) # Transform training data and fit model X_train_tx = col_tx.transform(X_train) model = LinearRegression() model.fit(X_train_tx, y_train) # Transform test data and make predictions X_test_tx = col_tx.transform(X_test) preds = model.predict(X_test_tx)