Fiveable

🤖Statistical Prediction Unit 7 Review

QR code for Statistical Prediction practice questions

7.1 Ridge Regression: L2 Regularization

7.1 Ridge Regression: L2 Regularization

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🤖Statistical Prediction
Unit & Topic Study Guides

Ridge regression adds a penalty term to linear regression, shrinking coefficients towards zero. This L2 regularization technique helps prevent overfitting and handles multicollinearity, striking a balance between model complexity and performance.

The regularization parameter λ controls the strength of shrinkage. As λ increases, coefficients are pulled closer to zero. Cross-validation helps find the optimal λ, balancing bias and variance for better generalization.

Ridge Regression Fundamentals

Overview and Key Concepts

  • Ridge regression extends linear regression by adding a penalty term to the ordinary least squares (OLS) objective function
  • L2 regularization refers to the specific type of penalty used in ridge regression, which is the sum of squared coefficients multiplied by the regularization parameter
  • The penalty term in ridge regression is λj=1pβj2\lambda \sum_{j=1}^{p} \beta_j^2, where λ\lambda is the regularization parameter and βj\beta_j are the regression coefficients
    • This penalty term is added to the OLS objective function, resulting in the ridge regression objective: i=1n(yiβ0j=1pβjxij)2+λj=1pβj2\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2
  • The regularization parameter λ\lambda controls the strength of the penalty
    • When λ=0\lambda = 0, ridge regression reduces to OLS
    • As λ\lambda \to \infty, the coefficients are shrunk towards zero
  • Shrinkage refers to the effect of the penalty term, which shrinks the regression coefficients towards zero compared to OLS
    • This can help prevent overfitting and improve the model's generalization performance

Geometric Interpretation

  • Ridge regression can be interpreted as a constrained optimization problem
    • The objective is to minimize the RSS (residual sum of squares) subject to a constraint on the L2 norm of the coefficients: j=1pβj2t\sum_{j=1}^{p} \beta_j^2 \leq t, where tt is a tuning parameter related to λ\lambda
  • Geometrically, this constraint corresponds to a circular region in the parameter space
    • The ridge regression solution is the point where the RSS contour lines first touch this circular constraint region
  • As the constraint becomes tighter (smaller tt, larger λ\lambda), the solution is pulled further towards the origin, resulting in greater shrinkage of the coefficients
Overview and Key Concepts, statistical learning - Why is ridge regression called "ridge", why is it needed, and what ...

Benefits and Tradeoffs

Bias-Variance Tradeoff

  • Ridge regression can improve a model's performance by reducing its variance at the cost of slightly increasing its bias
    • The penalty term constrains the coefficients, limiting the model's flexibility and thus reducing variance
    • However, this constraint also introduces some bias, as the coefficients are shrunk towards zero and may not match the true underlying values
  • The bias-variance tradeoff is controlled by the regularization parameter λ\lambda
    • Larger λ\lambda values result in greater shrinkage, lower variance, and higher bias
    • Smaller λ\lambda values result in less shrinkage, higher variance, and lower bias
  • The optimal λ\lambda value can be selected using techniques like cross-validation to balance bias and variance and minimize the model's expected test error
Overview and Key Concepts, Principal Components Regression vs Ridge Regression on NIR data in Python

Handling Multicollinearity

  • Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other
    • This can lead to unstable and unreliable coefficient estimates in OLS
  • Ridge regression can effectively handle multicollinearity by shrinking the coefficients of correlated predictors towards each other
    • This results in a more stable and interpretable model, as the impact of multicollinearity on the coefficient estimates is reduced
  • When predictors are highly correlated, ridge regression tends to assign similar coefficients to them, reflecting their shared contribution to the response variable

Model Selection via Cross-Validation

  • Cross-validation is commonly used to select the optimal value of the regularization parameter λ\lambda in ridge regression
  • The procedure involves:
    1. Splitting the data into kk folds
    2. For each λ\lambda value in a predefined grid:
      • Train ridge regression models on k1k-1 folds and evaluate their performance on the held-out fold
      • Repeat this process kk times, using each fold as the validation set once
      • Compute the average performance across the kk folds
    3. Select the λ\lambda value that yields the best average performance
  • This process helps identify the λ\lambda value that strikes the best balance between bias and variance, optimizing the model's expected performance on new, unseen data

Solving Ridge Regression

Closed-Form Solution

  • Ridge regression has a closed-form solution, which can be derived analytically by solving the normal equations with the addition of the penalty term
  • The closed-form solution for ridge regression is given by: β^ridge=(XTX+λI)1XTy\hat{\beta}^{ridge} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} where:
    • X\mathbf{X} is the n×pn \times p matrix of predictor variables
    • y\mathbf{y} is the n×1n \times 1 vector of response values
    • λ\lambda is the regularization parameter
    • I\mathbf{I} is the p×pp \times p identity matrix
  • Compared to the OLS solution β^OLS=(XTX)1XTy\hat{\beta}^{OLS} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}, ridge regression adds the term λI\lambda \mathbf{I} to the matrix XTX\mathbf{X}^T\mathbf{X} before inversion
    • This addition makes the matrix XTX+λI\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I} invertible even when XTX\mathbf{X}^T\mathbf{X} is not (e.g., in the presence of perfect multicollinearity)
    • The closed-form solution for ridge regression is computationally efficient and numerically stable, even when dealing with high-dimensional data or correlated predictors
Pep mascot
Upgrade your Fiveable account to print any study guide

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Click below to go to billing portal → update your plan → choose Yearly → and select "Fiveable Share Plan". Only pay the difference

Plan is open to all students, teachers, parents, etc
Pep mascot
Upgrade your Fiveable account to export vocabulary

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Plan is open to all students, teachers, parents, etc
report an error
description

screenshots help us find and fix the issue faster (optional)

add screenshot

2,589 studying →