Linear regression

Linear regression is a simple approach for modeling the relationship between a continuous and unbounded quantity of interest $y$ and a set of predictor variables $X$.

Model definition

Simple linear regression

Remember the equation of a line $y=mx+b$ from grade school? The equation for a simple linear regression is exactly this, just with different symbols.

$$ y = \alpha + \beta x + \epsilon $$

where $x$ is our observed predictor variable, and $y$ is the (unobserved) quantity we’d like to predict. The intercept $b$ is denoted by $\alpha$, the slope is denoted by $\beta$, and we recognize that our line may not perfectly fit the data, so there will be some error $\epsilon$ between the fitted line and each data point.

Multiple linear regression

In practice, we are almost always trying to fit a line with multiple predictor variables, not just a single variable. So we generalize the equation for simple linear regression to handle $p$ predictors, assigning each one a its own slope $\beta_p$. Usually we see the intercept denoted as $\beta_0$.

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon $$

Vector notation

If we assemble our predictor variables $x_1, x_2, \ldots x_p$ into a vector $\mathbf{x}$ and our coefficients $\beta_0, \beta_1, \ldots \beta_p$ into a vector of “weights” $\mathbf{w}$ then we can express the model more compactly with vector notation.

$$ y(\bf{x}) = \bf{w}^T \bf{x} + \epsilon \text{ with } \epsilon \sim \mathcal{N}(\mu, \sigma^2) $$

Advantages 💪🏼

Linear regression is typically the first algorithm taught in ML courses. In real-world problems with large datasets—such as challenges found on Kaggle—the winning solution is almost always some sort of more non-linear algorithm (tree-based, neural nets). Yet it still makes sense to understand linear models, because they serve as the basis for more complex techniques.

A few reasons why linear models are so pervasive:


If you have truly linear relationships between variables, and a large number of observations $n$ relative to your number of predictors $p$, then linear regression is the fastest and accurate way to model the relationship.


Linear regression model outputs are easy to interpret and explain to stakeholders. If you perform some sort of feature selection, you can entirely remove unnecessary variables from the model, and then clearly show how the remaining variables relate to predictions. Have fun trying to explain the SHAP values from your gradient-boosted decision tree to your CEO.


Linear regression serves as a jumping-off point into more complex models or approaches. We can:

  • Capture non-linear relationships using Polynomial regression.
  • Perform classification using Logistic regression.
  • Perform feature selection using Regularized regression such as Lasso.
  • Model non-continous outcomes using Generalized linear models.
  • Replace OLS with more complex fitting procedures to mitigate some of its weaknesses.

Fitting a linear model

Fitting the model involves using the data to determine estimates for the model’s parameters.

Simple linear regression

With a simple linear regression on a single predictor variable, we can calculate the parameters by hand.

The slope is calculated as follows:

$$ \hat{\beta} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{ \sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \hat{y})} }{ \sum_{i=1}^{n}{(x_i - \bar{x})^2} } $$

# python code
slope = - meanx, ys - meany) / len(xs) / np.var(xs)

The intercept is calculated as the difference between the average observed value $\bar{y}$ and the slope times the average predictor value $\bar{x}$.

$$ \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x} $$

Multiple linear regression

When we have multiple predictor variables, we use Ordinary Least Squares (OLS) to fit the model. OLS is a generalized version of minimizing sum of squared residuals that deals with more than one explanatory variable.

# R code to fit and evaluate a multiple linear regression model
model = lm(y~x1+x2+x3, data=data)

Why do we square the residuals? 🤔

It is common practice to use Mean Square Error (MSE) as our objective function for the following reasons:

  1. We need to use something which removes the sign of residuals, so that positive and negative residuals do not cancel out.
  2. Squaring the residuals removes the sign and is easy to calculate. Taking the absolute value would also remove the sign, but it makes the loss function non-differentiable, and hence more difficult to solve.
  3. Squared error places a heavier penalty on a few large mistakes than on many small mistakes. While this can lead to problems with outliers (see section below) it is intuitive for many prediction tasks.

MSE may not be optimal if it does not reflect the true underling “cost function” of the problem you are trying to solve. Particularly if the cost of an error is assymmetric.


Coefficient estimates

In the context of an multiple linear regression model, $\hat{\beta}_i$ represents the expected change in the response variable for a one unit change in the predictor variable, while holding all other predictors fixed.

  • In a simple linear regression there are no other predictor variables, so the conditional interpretation is unnecessary.
  • In reality, predictor variables are often correlated, and it is not possible to adjust one while holding others constant.

Log transformations

When one or both of our $X, y$ variables are log-transformed, then the interpretation of their coefficients changes.

Name Model Interpretation
Linear $$\hat{y}_i = \hat{\alpha} +\hat{ \beta} x_i$$ A one-unit increase in $x_i$ results in an increase of $\beta$ in the response variable.
Linear–Log $$\hat{y}_i = \hat{\alpha} + \hat{ \beta} \log x_i$$ A 1% increase in $x_i$ will produce an expected increase of $\hat{\beta}/100$ unit increase in the response variable.
Log–Linear $$\log (\hat{y}_i) = \hat{\alpha} + \hat{ \beta} x_i$$ A one-unit increase in $x_i$ will produce an expected increase of $\hat{\beta} * 100 \%$ relative increase in the response variable.
Log–Log $$\log \hat{y}_i = \hat{\alpha} + \hat{ \beta} \log x_i$$ A 1% increase in $x_i$ will produce an expected $\hat{\beta} \ \%$ increase in the response variable.

Making predictions

Estimation vs. Prediction

It is important to keep in mind the difference between parameter estimation and prediction.

Our coefficient estimates involve some uncertainty around the true population parameters. This error is “reducible” in the sense that it will decrease if we use a larger training sample to fit our model.

Our predictions involve both the above-mentioned parameter uncertainty, but also include some irreducible error $\epsilon$ .

So our prediction intervals will always be wider than our confidence intervals on coefficients, because they contain two sources of error instead of one.

# R code for generating prediction intervals
X_pred = data.frame(colA=1, colB=2, colC=3)
predict(model, newdata=X_pred, type='response', interval='prediction', level=0.95) 


The model should not be usd to make predictions based on values of $X$ which are far outside the range of values used to fit the model.

  • Values at the tails can be entirely non-sensical, e.g.: a regression of age vs. income may have a non-zero intercept, which implies that newborn babies make $2000 per year. But of course your sample probably doesn’t include any observations with age of zero.
  • If we use polynomial regression to fit a non-linear relationship, the values outside our observed range will be even more unreliable.



The “coefficient of determination”—denoted as $R^2$— reflects the proportion of variance in our response variable $Y$ which is explained by the model.

  • Equal to the squared value of the Pearson correlation coefficient $\rho$, which itself takes a value in the range $[-1, 1]$ indicating both the direction and magnitude of the relationship between $X$ and $Y$.
  • By squaring $\rho$, we get something in the range $[0, 1]$ (usually) which reflects only the magnitude.

$$ R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS} = 1 - \frac {\sum(y_i - \hat{y})^2} {\sum (y_i - \bar{y})^2 } $$

# R (software) code to manually calculate R^2
r2 = 1 - np.var(resid) / np.var(ys)

Hazards ☠

  • While generally in the range $[0, 1]$, it is possible to get a negative $R^2$ value if the model’s predictions are worse than a simple arithmetic mean.
  • $R^2$ increases monotonically as $p$ increases. We can’t use it to perform compare models different complexity, because it will always favour the more complex model.

Adjusted R-squared

$R^2_{adj}$ is a tweak to $R^2$ which places a penalty on model complexity, based on the number of predictors $k$. So if we add an additional variable which is not useful to the model, $R^2_{adj}$ will decrease.

  • It will always be lower than plain R-squared: $R^2_{adj} \leq R^2$
  • 👍🏼 Advantage of being very popular and intuitive to explain
  • 👎🏼 Not as well rooted in statistical theory as $C_p$, AIC or BIC.

$$ R_{adj}^2 = 1 - \frac{RSS}{TSS} \times \frac{n-1}{n-k-1} $$

# python code to manually calculate adjusted R^2
r2_adj = 1 - ((1 - r2_test) * (n - 1) / (n - k - 1))

Model assumptions ☠

In practice, linear regression models require us to make a relatively large number of assumptions.

Name Formula Explanation
Linearity $E(\epsilon_i) = 0$ The residuals should be evenly distributed above/below zero at all predicted values.
Homoscedasticity $\text{Var}(\epsilon_i) = \sigma^2$ The spread of the residuals should be equal across all fitted values.
Independence $\{ \epsilon_1, …, \epsilon_n \} \sim\text{i.i.d.}$ Since we cannot verify independence without understanding how the data was collected, we check for uncorrelated errors instead.
Normality $\epsilon_i \sim N(\mu, \sigma^2)$ Check the QQ-plot or rough check using histogram of residuals.

What to do if the assumptions do not hold?


What can you do if the non-constant variance assumption does not hold when attempting to fit a linear model?

  1. Transform the response variable (Box-Cox)
  2. If variance is proportional to expectation, try using a poission regression
  3. Look into Weighted Least Squares (WLS) or Generalized Least Squares (GLS) models


What can you do in the presence of outliers when attempting to fit a linear model?

Use Robust regression approach, in which we minimize absolute residuals rather than squared residuals, and estimate the median value rather than the mean.

Linearity of residuals

What can you do if the linearity assumption does not hold when attempting to fit a linear model?

  1. If the data infers a relationship → fit a non-linear relationship
  2. If unknown relationship → try to transform some variables
  3. Still no luck → try Generalized Additive Models


Outliers & leverage points

An outlier is a data point that is not well fit by the model, as evidenced by an exceptionally large residual.

A high-leverage data point is one whose value $x_i$ is far away from the mean $\bar{x}$, and so has the potential to “pull” the regression line towards it. Since OLS regression fits a line that will necessarily pass through the center of the data $(\bar{x}, \bar{y})$ , changing the slope pivots the regression line around that point like a lever on a fulcrum.

An influential data point is one which unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. A data point’s net influence is a combination of how unexpected it is (outlier) and how much leverage it has.

☠ An outlier is not necessarily influential if it has low leverage. An outlier right on $\bar{x}$ will not change the fitted model parameters as much as one farther away.

Cook’s distance

A method of detecting outliers by measuring their “influence” by measuring the change in model parameters caused by leaving out the $i^{th}$ observation. This is helpful, because it is cumbersome to visually check each of the $p$ plots between predictor and outcome variables when we have many variables.

# R code to find potential outliers based on Cook's Distance
cooks_distances = cooks.distance(model)
cooks_distances[cooks_distances > 1]
data = data[-578,]

Data points with a value of $D_i > 1 \text{ or } D_i > 4/n$ warrant closer inspection to determine whether they are outliers, and should be removed from the dataset.


Multicollinearity refers to a situation where a sub-group of predictor variables are highly correlated.

  • Visual inspection of pair plots can help find pairs of correlated predictors, but groups of 3+ may not be obvious.
  • A robust approach is to calculate VIF.

Variance Inflation Factor (VIF)

Expresses how much the variance of a coefficient increases when fit together with other predictors compared to when fit alone.

$$ VIF(\hat{\beta}_j) = \frac{1}{1-R^2_{X_j \mid X-j}} $$

where $R^2_{X_j \mid X-j}$ is the R-squared value from a separate “sub-model” regression of $X_j$ onto all other predictors $X_i, i \neq j$.

  • A VIF > 10 is considered as a strong indicator of multicollinearity.
  • Resolve by either removing one of the related predictors or combining them together before fitting.

Troubleshooting checklist

Extending linear regression

Interaction effects

Basic linear regression requires the additive assumption, which states that the effect of any given predictor $x_i$ on $y$ is entirely independent of other predictors. The values of one predictor cannot impact the coefficients of another.

We can relax this assumption by creating an interaction term in the form of a new coefficient that is multiplied with both $x_1$ and $x_2$

$$ \hat{y} = β_0 +β_1 x_1 +β_2 x_2 +β_3 x_1 x_2 + \epsilon $$

Hierarchical principle – If we include an interaction term in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

KNN Regression

  • Non-parametric approach
  • Using a K-Nearest Neighbours approach to predicting the regression line.
  • underperforms linear regression when the relationship is truly linear
  • In general, non-parametric approaches underperform relative to linear ones when the ratio of observations per predictor is low.

Partial Least Squares (PLS)

  • A supervised alternative to PCR
  • identifies features which not only approximate the original featuers well, but are also related to the response

© Geoff Ruddock 2020