Ridge Regression for Multicollinearity
Concept and Motivation
Ridge regression is a regularized version of ordinary least squares (OLS) that adds a penalty term to the objective function, specifically to handle multicollinearity. When predictor variables are highly correlated (think age and years of experience in a salary model), OLS coefficient estimates become wildly unstable. Small changes in the data can cause huge swings in the estimated coefficients, even flipping their signs.
Ridge regression fixes this by deliberately introducing a small amount of bias into the coefficient estimates. This shrinks the coefficients toward zero, which dramatically reduces their variance. The result is a model that's more stable and typically predicts better on new data, even though it fits the training data slightly worse.
The core motivation: accept a little bias in exchange for a lot of stability.
Addressing Multicollinearity
Under multicollinearity, the OLS solution involves inverting a matrix () that is nearly singular. This near-singularity is what causes the coefficient estimates to blow up or become erratic. Ridge regression adds a positive constant to the diagonal of before inverting, which makes the matrix well-conditioned and the solution stable.
- Ridge is especially useful in high-dimensional settings where the number of predictors is large relative to the number of observations
- Unlike variable selection methods, ridge keeps all predictors in the model but dampens their influence proportionally
- The coefficients won't be driven to exactly zero (that's what lasso does), but they'll be pulled toward zero enough to stabilize the estimates
Ridge Regression Objective Function

Formulation
The ridge objective function extends the OLS objective by appending an L2 penalty on the coefficients:
Breaking this apart:
- The first term is the residual sum of squares (RSS), the standard OLS loss that measures how well the model fits the data
- The second term is the ridge penalty, , which penalizes large coefficient values
- is the regularization parameter that controls how much shrinkage you apply
The closed-form solution is:
Notice how gets added to before inversion. That's exactly what stabilizes the matrix when multicollinearity is present.
Regularization Term
The L2 penalty (also called Tikhonov regularization) has a few properties worth understanding:
- It's quadratic in the coefficients, so large coefficients get penalized much more heavily than small ones. A coefficient of 4 incurs 16 units of penalty, while a coefficient of 2 incurs only 4.
- It penalizes all coefficients simultaneously, shrinking them proportionally rather than eliminating any of them entirely
- The penalty applies only to the slope coefficients . The intercept is typically not penalized, since shrinking it would just shift all predictions and doesn't help with multicollinearity.
Because the penalty depends on the scale of the predictors, you should standardize your predictor variables before fitting ridge regression. Otherwise, coefficients measured in larger units will be penalized more heavily simply due to their scale.
Regularization Parameter and Bias-Variance Trade-off

Role of the Regularization Parameter
The parameter is the single dial that controls everything in ridge regression:
- : No penalty at all. Ridge reduces to OLS, and you get the same unstable estimates you started with.
- Small : Gentle shrinkage. Coefficients are pulled slightly toward zero, stabilizing the estimates without much bias.
- Large : Aggressive shrinkage. All coefficients are pushed close to zero, producing a very simple (nearly flat) model.
- : All slope coefficients converge to zero. The model predicts the mean of for every observation.
Bias-Variance Trade-off
The bias-variance trade-off describes the tension between two sources of prediction error:
- Bias is the error from oversimplifying the model. A high-bias model systematically misses the true relationship.
- Variance is the error from sensitivity to the training data. A high-variance model fits noise and changes drastically across different samples.
In ridge regression, increasing shifts you along this trade-off:
| Bias | Variance | Risk | |
|---|---|---|---|
| Too small | Low | High | Overfitting / unstable coefficients |
| Optimal | Moderate | Moderate | Best prediction on new data |
| Too large | High | Low | Underfitting / coefficients near zero |
The optimal minimizes total prediction error (bias² + variance). The standard approach for finding it is k-fold cross-validation:
- Choose a grid of candidate values (often on a log scale, e.g., to )
- For each , fit ridge regression on folds and evaluate prediction error on the held-out fold
- Average the prediction error across all folds for each
- Select the that yields the lowest average cross-validated error
Ridge Regression Implementation and Interpretation
Implementation in Statistical Software
Ridge regression is available in most major platforms:
- Python:
sklearn.linear_model.Ridgeorsklearn.linear_model.RidgeCV(with built-in cross-validation) - R:
glmnet(x, y, alpha = 0)from theglmnetpackage (settingalpha = 0specifies ridge) - MATLAB: the
ridgefunction in the Statistics Toolbox
A typical implementation workflow:
- Standardize all predictor variables (mean 0, standard deviation 1)
- Define a grid of values to search over
- Run cross-validation to identify the optimal
- Fit the final model using the selected on the full training set
- Evaluate using metrics like mean squared error (MSE) or on a test set
The optimization is typically solved via the closed-form solution or iterative methods like coordinate descent, depending on the software.
Interpreting the Results
Interpreting ridge coefficients requires some care because of the shrinkage effect:
- Magnitude still reflects the strength of each predictor's relationship with the response, but all coefficients are biased toward zero. A ridge coefficient of 0.3 doesn't mean the same thing as an OLS coefficient of 0.3.
- Sign (positive or negative) still indicates the direction of the relationship and is generally reliable unless multicollinearity is extreme.
- Comparing coefficients across predictors is valid only if you standardized the predictors first. With standardized inputs, larger absolute coefficients indicate more important predictors.
- Ridge coefficients will always be smaller in absolute value than OLS coefficients. This is by design, not a flaw.
When reporting results, it's useful to compare ridge against OLS on both training and test performance. If ridge substantially outperforms OLS on test data while showing smaller coefficients, that's a clear sign multicollinearity was inflating the OLS estimates. You can also compare against lasso regression (L1 penalty), which performs variable selection by driving some coefficients to exactly zero, unlike ridge which retains all predictors.