Linear regression is a simple approach for modeling the relationship between a continuous and unbounded quantity of interest `$y$`

and a set of predictor variables `$X$`

.

## Model definition

### Simple linear regression

Remember the equation of a line `$y=mx+b$`

from grade school? The equation for a simple linear regression is exactly this, just with different symbols.

```
$$
y = \alpha + \beta x + \epsilon
$$
```

where `$x$`

is our observed predictor variable, and `$y$`

is the (unobserved) quantity we’d like to predict. The intercept `$b$`

is denoted by `$\alpha$`

, the slope is denoted by `$\beta$`

, and we recognize that our line may not perfectly fit the data, so there will be some error `$\epsilon$`

between the fitted line and each data point.

### Multiple linear regression

In practice, we are almost always trying to fit a line with multiple predictor variables, not just a single variable. So we generalize the equation for simple linear regression to handle `$p$`

predictors, assigning each one a its own slope `$\beta_p$`

. Usually we see the intercept denoted as `$\beta_0$`

.

```
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon
$$
```

### Vector notation

If we assemble our predictor variables `$x_1, x_2, \ldots x_p$`

into a vector `$\mathbf{x}$`

and our coefficients `$\beta_0, \beta_1, \ldots \beta_p$`

into a vector of “weights” `$\mathbf{w}$`

then we can express the model more compactly with vector notation.

```
$$
y(\bf{x}) = \bf{w}^T \bf{x} + \epsilon \text{ with } \epsilon \sim \mathcal{N}(\mu, \sigma^2)
$$
```

## Advantages 💪🏼

Linear regression is typically the first algorithm taught in ML courses. In real-world problems with large datasets—such as challenges found on Kaggle—the winning solution is almost always some sort of more non-linear algorithm (tree-based, neural nets). Yet it still makes sense to understand linear models, because they serve as the basis for more complex techniques.

A few reasons why linear models are so pervasive:

### Simple

If you have truly linear relationships between variables, and a large number of observations `$n$`

relative to your number of predictors `$p$`

, then linear regression is the fastest and accurate way to model the relationship.

### Interpretable

Linear regression model outputs are easy to interpret and explain to stakeholders. If you perform some sort of feature selection, you can entirely remove unnecessary variables from the model, and then clearly show how the remaining variables relate to predictions. Have fun trying to explain the SHAP values from your gradient-boosted decision tree to your CEO.

### Extensible

Linear regression serves as a jumping-off point into more complex models or approaches. We can:

- Capture non-linear relationships using Polynomial regression.
- Perform classification using Logistic regression.
- Perform feature selection using Regularized regression such as Lasso.
- Model non-continous outcomes using Generalized linear models.
- Replace OLS with more complex fitting procedures to mitigate some of its weaknesses.

## Fitting a linear model

Fitting the model involves using the data to determine estimates for the model’s parameters.

### Simple linear regression

With a simple linear regression on a single predictor variable, we can calculate the parameters by hand.

The slope is calculated as follows:

```
$$
\hat{\beta}
= \frac{\text{Cov}(X,Y)}{\text{Var}(X)}
= \frac{
\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \hat{y})}
}{
\sum_{i=1}^{n}{(x_i - \bar{x})^2}
}
$$
```

```
# python code
slope = np.dot(xs - meanx, ys - meany) / len(xs) / np.var(xs)
```

The intercept is calculated as the difference between the average observed value `$\bar{y}$`

and the slope times the average predictor value `$\bar{x}$`

.

```
$$
\hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}
$$
```

### Multiple linear regression

When we have multiple predictor variables, we use Ordinary Least Squares (OLS) to fit the model. OLS is a generalized version of minimizing sum of squared residuals that deals with more than one explanatory variable.

```
# R code to fit and evaluate a multiple linear regression model
model = lm(y~x1+x2+x3, data=data)
summary(model)
summary(model)$r.squared
```

#### Why do we square the residuals? 🤔

It is common practice to use *Mean Square Error* (MSE) as our objective function for the following reasons:

- We need to use
*something*which removes the sign of residuals, so that positive and negative residuals do not cancel out. - Squaring the residuals removes the sign and is easy to calculate. Taking the absolute value would also remove the sign, but it makes the loss function non-differentiable, and hence more difficult to solve.
- Squared error places a heavier penalty on a few large mistakes than on many small mistakes. While this can lead to problems with outliers (see section below) it is intuitive for many prediction tasks.

MSE may not be optimal if it does not reflect the true underling “cost function” of the problem you are trying to solve. Particularly if the cost of an error is *assymmetric*.

## Interpretation

### Coefficient estimates

In the context of an multiple linear regression model, `$\hat{\beta}_i$`

represents the expected change in the response variable for a one unit change in the predictor variable, *while holding all other predictors fixed*.

- In a simple linear regression there are no other predictor variables, so the conditional interpretation is unnecessary.
- In reality, predictor variables are often correlated, and it is not possible to adjust one while holding others constant.

### Log transformations

When one or both of our `$X, y$`

variables are log-transformed, then the interpretation of their coefficients changes.

Name | Model | Interpretation |
---|---|---|

Linear | `$$\hat{y}_i = \hat{\alpha} +\hat{ \beta} x_i$$` |
A one-unit increase in `$x_i$` results in an increase of `$\beta$` in the response variable. |

Linear–Log | `$$\hat{y}_i = \hat{\alpha} + \hat{ \beta} \log x_i$$` |
A 1% increase in `$x_i$` will produce an expected increase of `$\hat{\beta}/100$` unit increase in the response variable. |

Log–Linear | `$$\log (\hat{y}_i) = \hat{\alpha} + \hat{ \beta} x_i$$` |
A one-unit increase in `$x_i$` will produce an expected increase of `$\hat{\beta} * 100 \%$` relative increase in the response variable. |

Log–Log | `$$\log \hat{y}_i = \hat{\alpha} + \hat{ \beta} \log x_i$$` |
A 1% increase in `$x_i$` will produce an expected `$\hat{\beta} \ \%$` increase in the response variable. |

## Making predictions

### Estimation vs. Prediction

It is important to keep in mind the difference between *parameter estimation* and *prediction*.

Our *coefficient estimates* involve some uncertainty around the true population parameters. This error is “reducible” in the sense that it will decrease if we use a larger training sample to fit our model.

Our *predictions* involve both the above-mentioned parameter uncertainty, but also include some irreducible error `$\epsilon$`

.

So our *prediction intervals* will always be wider than our *confidence intervals* on coefficients, because they contain two sources of error instead of one.

```
# R code for generating prediction intervals
X_pred = data.frame(colA=1, colB=2, colC=3)
predict(model, newdata=X_pred, type='response', interval='prediction', level=0.95)
```

### Extrapolation

The model should not be usd to make predictions based on values of `$X$`

which are far outside the range of values used to fit the model.

- Values at the tails can be entirely non-sensical, e.g.: a regression of age vs. income may have a non-zero intercept, which implies that newborn babies make $2000 per year. But of course your sample probably doesn’t include any observations with age of zero.
- If we use polynomial regression to fit a non-linear relationship, the values outside our observed range will be
*even more*unreliable.

## Evaluation

### R-squared

The “coefficient of determination”—denoted as $R^2$— reflects the proportion of variance in our response variable `$Y$`

which is explained by the model.

- Equal to the squared value of the Pearson correlation coefficient
`$\rho$`

, which itself takes a value in the range`$[-1, 1]$`

indicating both the direction and magnitude of the relationship between`$X$`

and`$Y$`

. - By squaring
`$\rho$`

, we get something in the range`$[0, 1]$`

(usually) which reflects only the magnitude.

```
$$
R^2 =
\frac{TSS - RSS}{TSS} =
1 - \frac{RSS}{TSS} =
1 - \frac {\sum(y_i - \hat{y})^2} {\sum (y_i - \bar{y})^2 }
$$
```

```
# R (software) code to manually calculate R^2
r2 = 1 - np.var(resid) / np.var(ys)
```

#### Hazards ☠

- While generally in the range
`$[0, 1]$`

, it is possible to get a negative`$R^2$`

value if the model’s predictions are*worse*than a simple arithmetic mean. `$R^2$`

increases monotonically as`$p$`

increases. We can’t use it to perform compare models different complexity, because it will always favour the more complex model.

### Adjusted R-squared

`$R^2_{adj}$`

is a tweak to `$R^2$`

which places a penalty on model complexity, based on the number of predictors `$k$`

. So if we add an additional variable which is not useful to the model, `$R^2_{adj}$`

will *decrease*.

- It will always be lower than plain R-squared:
`$R^2_{adj} \leq R^2$`

- 👍🏼 Advantage of being very popular and intuitive to explain
- 👎🏼 Not as well rooted in statistical theory as
`$C_p$`

, AIC or BIC.

```
$$
R_{adj}^2 = 1 - \frac{RSS}{TSS} \times \frac{n-1}{n-k-1}
$$
```

```
# python code to manually calculate adjusted R^2
r2_adj = 1 - ((1 - r2_test) * (n - 1) / (n - k - 1))
```

## Model assumptions ☠

In practice, linear regression models require us to make a relatively large number of assumptions.

Name | Formula | Explanation |
---|---|---|

Linearity | `$E(\epsilon_i) = 0$` |
The residuals should be evenly distributed above/below zero at all predicted values. |

Homoscedasticity | `$\text{Var}(\epsilon_i) = \sigma^2$` |
The spread of the residuals should be equal across all fitted values. |

Independence | `$\{ \epsilon_1, …, \epsilon_n \} \sim\text{i.i.d.}$` |
Since we cannot verify independence without understanding how the data was collected, we check for uncorrelated errors instead. |

Normality | `$\epsilon_i \sim N(\mu, \sigma^2)$` |
Check the QQ-plot or rough check using histogram of residuals. |

### What to do if the assumptions do not hold?

#### Homoscedasticity

What can you do if the non-constant variance assumption does not hold when attempting to fit a linear model?

- Transform the response variable (Box-Cox)
- If variance is proportional to expectation, try using a poission regression
- Look into Weighted Least Squares (WLS) or Generalized Least Squares (GLS) models

#### Outliers

What can you do in the presence of outliers when attempting to fit a linear model?

Use Robust regression approach, in which we minimize absolute residuals rather than squared residuals, and estimate the median value rather than the mean.

#### Linearity of residuals

What can you do if the linearity assumption does not hold when attempting to fit a linear model?

- If the data infers a relationship → fit a non-linear relationship
- If unknown relationship → try to transform some variables
- Still no luck → try Generalized Additive Models

## Troubleshooting

### Outliers & leverage points

An **outlier** is a data point that is not well fit by the model, as evidenced by an exceptionally large residual.

A **high-leverage** data point is one whose value `$x_i$`

is far away from the mean `$\bar{x}$`

, and so has the *potential* to “pull” the regression line towards it. Since OLS regression fits a line that will necessarily pass through the center of the data `$(\bar{x}, \bar{y})$`

, changing the slope pivots the regression line around that point like a lever on a fulcrum.

An **influential** data point is one which unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. A data point’s net influence is a combination of how unexpected it is (outlier) and how much leverage it has.

☠ An outlier is not necessarily influential if it has low leverage. An outlier right on `$\bar{x}$`

will not change the fitted model parameters as much as one farther away.

#### Cook’s distance

A method of detecting outliers by measuring their “influence” by measuring the change in model parameters caused by leaving out the `$i^{th}$`

observation. This is helpful, because it is cumbersome to visually check each of the `$p$`

plots between predictor and outcome variables when we have many variables.

```
# R code to find potential outliers based on Cook's Distance
cooks_distances = cooks.distance(model)
cooks_distances[cooks_distances > 1]
data = data[-578,]
```

Data points with a value of `$D_i > 1 \text{ or } D_i > 4/n$`

warrant closer inspection to determine whether they are outliers, and should be removed from the dataset.

### Multicollinearity

Multicollinearity refers to a situation where a sub-group of predictor variables are highly correlated.

- Visual inspection of pair plots can help find
*pairs*of correlated predictors, but groups of 3+ may not be obvious. - A robust approach is to calculate VIF.

#### Variance Inflation Factor (VIF)

Expresses how much the variance of a coefficient increases when fit together with other predictors compared to when fit alone.

```
$$
VIF(\hat{\beta}_j) = \frac{1}{1-R^2_{X_j \mid X-j}}
$$
```

where `$R^2_{X_j \mid X-j}$`

is the R-squared value from a separate “sub-model” regression of `$X_j$`

onto all other predictors `$X_i, i \neq j$`

.

- A VIF > 10 is considered as a strong indicator of multicollinearity.
- Resolve by either removing one of the related predictors or combining them together before fitting.

### Troubleshooting checklist

## Extending linear regression

### Interaction effects

Basic linear regression requires the *additive assumption*, which states that the effect of any given predictor `$x_i$`

on `$y$`

is entirely independent of other predictors. The values of one predictor cannot impact the coefficients of another.

We can relax this assumption by creating an *interaction term* in the form of a new coefficient that is multiplied with both `$x_1$`

and `$x_2$`

```
$$
\hat{y} = β_0 +β_1 x_1 +β_2 x_2 +β_3 x_1 x_2 + \epsilon
$$
```

**Hierarchical principle** – If we include an interaction term in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

### KNN Regression

- Non-parametric approach
- Using a K-Nearest Neighbours approach to predicting the regression line.
- underperforms linear regression when the relationship is truly linear
- In general, non-parametric approaches underperform relative to linear ones when the ratio of observations per predictor is low.

### Partial Least Squares (PLS)

- A
*supervised*alternative to PCR - identifies features which not only approximate the original featuers well, but are also related to the response