Ridge regression is a regularized version of ordinary least squares (OLS) that adds a penalty term to the objective function, specifically to handle multicollinearity. When predictor variables are highly correlated (think age and years of experience in a salary model), OLS coefficient estimates become wildly unstable. Small changes in the data can cause huge swings in the estimated coefficients, even flipping their signs.

Ridge regression fixes this by deliberately introducing a small amount of bias into the coefficient estimates. This shrinks the coefficients toward zero, which dramatically reduces their variance. The result is a model that's more stable and typically predicts better on new data, even though it fits the training data slightly worse.

The core motivation: accept a little bias in exchange for a lot of stability.

Addressing Multicollinearity

Under multicollinearity, the OLS solution involves inverting a matrix ( $X^TX$ ) that is nearly singular. This near-singularity is what causes the coefficient estimates to blow up or become erratic. Ridge regression adds a positive constant to the diagonal of $X^TX$ before inverting, which makes the matrix well-conditioned and the solution stable.

Ridge is especially useful in high-dimensional settings where the number of predictors $p$ is large relative to the number of observations $n$
Unlike variable selection methods, ridge keeps all predictors in the model but dampens their influence proportionally
The coefficients won't be driven to exactly zero (that's what lasso does), but they'll be pulled toward zero enough to stabilize the estimates

Ridge Regression Objective Function

Concept and Motivation, statistical learning - Why is ridge regression called "ridge", why is it needed, and what ...

Formulation

The ridge objective function extends the OLS objective by appending an L2 penalty on the coefficients:

$\hat{\beta}^{ridge} = \arg\min_{\beta} \left( \sum_{i=1}^{n}(y_i - x_i^T\beta)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right)$

Breaking this apart:

The first term is the residual sum of squares (RSS), the standard OLS loss that measures how well the model fits the data
The second term is the ridge penalty, $\lambda \sum_{j=1}^{p} \beta_j^2$ , which penalizes large coefficient values
$\lambda \geq 0$ is the regularization parameter that controls how much shrinkage you apply

The closed-form solution is:

$\hat{\beta}^{ridge} = (X^TX + \lambda I)^{-1} X^Ty$

Notice how $\lambda I$ gets added to $X^TX$ before inversion. That's exactly what stabilizes the matrix when multicollinearity is present.

Regularization Term

The L2 penalty (also called Tikhonov regularization) has a few properties worth understanding:

It's quadratic in the coefficients, so large coefficients get penalized much more heavily than small ones. A coefficient of 4 incurs 16 units of penalty, while a coefficient of 2 incurs only 4.
It penalizes all coefficients simultaneously, shrinking them proportionally rather than eliminating any of them entirely
The penalty applies only to the slope coefficients $\beta_1, \dots, \beta_p$ . The intercept $\beta_0$ is typically not penalized, since shrinking it would just shift all predictions and doesn't help with multicollinearity.

Because the penalty depends on the scale of the predictors, you should standardize your predictor variables before fitting ridge regression. Otherwise, coefficients measured in larger units will be penalized more heavily simply due to their scale.

Regularization Parameter and Bias-Variance Trade-off

Concept and Motivation, multicollinearity - How to interpret ridge regression plot - Cross Validated

Role of the Regularization Parameter

The parameter $\lambda$ is the single dial that controls everything in ridge regression:

$\lambda = 0$ : No penalty at all. Ridge reduces to OLS, and you get the same unstable estimates you started with.
Small $\lambda$ : Gentle shrinkage. Coefficients are pulled slightly toward zero, stabilizing the estimates without much bias.
Large $\lambda$ : Aggressive shrinkage. All coefficients are pushed close to zero, producing a very simple (nearly flat) model.
$\lambda \to \infty$ : All slope coefficients converge to zero. The model predicts the mean of $y$ for every observation.

Bias-Variance Trade-off

The bias-variance trade-off describes the tension between two sources of prediction error:

Bias is the error from oversimplifying the model. A high-bias model systematically misses the true relationship.
Variance is the error from sensitivity to the training data. A high-variance model fits noise and changes drastically across different samples.

In ridge regression, increasing $\lambda$ shifts you along this trade-off:

$\lambda$	Bias	Variance	Risk
Too small	Low	High	Overfitting / unstable coefficients
Optimal	Moderate	Moderate	Best prediction on new data
Too large	High	Low	Underfitting / coefficients near zero

The optimal $\lambda$ minimizes total prediction error (bias² + variance). The standard approach for finding it is k-fold cross-validation:

Choose a grid of candidate $\lambda$ values (often on a log scale, e.g., $10^{-4}$ to $10^{4}$ )
For each $\lambda$ , fit ridge regression on $k-1$ folds and evaluate prediction error on the held-out fold
Average the prediction error across all $k$ folds for each $\lambda$
Select the $\lambda$ that yields the lowest average cross-validated error

Ridge Regression Implementation and Interpretation

Implementation in Statistical Software

Ridge regression is available in most major platforms:

Python: sklearn.linear_model.Ridge or sklearn.linear_model.RidgeCV (with built-in cross-validation)
R: glmnet(x, y, alpha = 0) from the glmnet package (setting alpha = 0 specifies ridge)
MATLAB: the ridge function in the Statistics Toolbox

A typical implementation workflow:

Standardize all predictor variables (mean 0, standard deviation 1)
Define a grid of $\lambda$ values to search over
Run cross-validation to identify the optimal $\lambda$
Fit the final model using the selected $\lambda$ on the full training set
Evaluate using metrics like mean squared error (MSE) or $R^2$ on a test set

The optimization is typically solved via the closed-form solution or iterative methods like coordinate descent, depending on the software.

Interpreting the Results

Interpreting ridge coefficients requires some care because of the shrinkage effect:

Magnitude still reflects the strength of each predictor's relationship with the response, but all coefficients are biased toward zero. A ridge coefficient of 0.3 doesn't mean the same thing as an OLS coefficient of 0.3.
Sign (positive or negative) still indicates the direction of the relationship and is generally reliable unless multicollinearity is extreme.
Comparing coefficients across predictors is valid only if you standardized the predictors first. With standardized inputs, larger absolute coefficients indicate more important predictors.
Ridge coefficients will always be smaller in absolute value than OLS coefficients. This is by design, not a flaw.

When reporting results, it's useful to compare ridge against OLS on both training and test performance. If ridge substantially outperforms OLS on test data while showing smaller coefficients, that's a clear sign multicollinearity was inflating the OLS estimates. You can also compare against lasso regression (L1 penalty), which performs variable selection by driving some coefficients to exactly zero, unlike ridge which retains all predictors.

2,589 studying →