Fiveable

🤖Statistical Prediction Unit 7 Review

QR code for Statistical Prediction practice questions

7.1 Ridge Regression: L2 Regularization

7.1 Ridge Regression: L2 Regularization

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🤖Statistical Prediction
Unit & Topic Study Guides

Ridge regression adds a penalty term to linear regression, shrinking coefficients towards zero. This L2 regularization technique helps prevent overfitting and handles multicollinearity, striking a balance between model complexity and performance.

The regularization parameter λ controls the strength of shrinkage. As λ increases, coefficients are pulled closer to zero. Cross-validation helps find the optimal λ, balancing bias and variance for better generalization.

Ridge Regression Fundamentals

Overview and Key Concepts

  • Ridge regression extends linear regression by adding a penalty term to the ordinary least squares (OLS) objective function
  • L2 regularization refers to the specific type of penalty used in ridge regression, which is the sum of squared coefficients multiplied by the regularization parameter
  • The penalty term in ridge regression is λj=1pβj2\lambda \sum_{j=1}^{p} \beta_j^2, where λ\lambda is the regularization parameter and βj\beta_j are the regression coefficients
    • This penalty term is added to the OLS objective function, resulting in the ridge regression objective: i=1n(yiβ0j=1pβjxij)2+λj=1pβj2\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2
  • The regularization parameter λ\lambda controls the strength of the penalty
    • When λ=0\lambda = 0, ridge regression reduces to OLS
    • As λ\lambda \to \infty, the coefficients are shrunk towards zero
  • Shrinkage refers to the effect of the penalty term, which shrinks the regression coefficients towards zero compared to OLS
    • This can help prevent overfitting and improve the model's generalization performance

Geometric Interpretation

  • Ridge regression can be interpreted as a constrained optimization problem
    • The objective is to minimize the RSS (residual sum of squares) subject to a constraint on the L2 norm of the coefficients: j=1pβj2t\sum_{j=1}^{p} \beta_j^2 \leq t, where tt is a tuning parameter related to λ\lambda
  • Geometrically, this constraint corresponds to a circular region in the parameter space
    • The ridge regression solution is the point where the RSS contour lines first touch this circular constraint region
  • As the constraint becomes tighter (smaller tt, larger λ\lambda), the solution is pulled further towards the origin, resulting in greater shrinkage of the coefficients
Overview and Key Concepts, statistical learning - Why is ridge regression called "ridge", why is it needed, and what ...

Benefits and Tradeoffs

Bias-Variance Tradeoff

  • Ridge regression can improve a model's performance by reducing its variance at the cost of slightly increasing its bias
    • The penalty term constrains the coefficients, limiting the model's flexibility and thus reducing variance
    • However, this constraint also introduces some bias, as the coefficients are shrunk towards zero and may not match the true underlying values
  • The bias-variance tradeoff is controlled by the regularization parameter λ\lambda
    • Larger λ\lambda values result in greater shrinkage, lower variance, and higher bias
    • Smaller λ\lambda values result in less shrinkage, higher variance, and lower bias
  • The optimal λ\lambda value can be selected using techniques like cross-validation to balance bias and variance and minimize the model's expected test error
Overview and Key Concepts, Principal Components Regression vs Ridge Regression on NIR data in Python

Handling Multicollinearity

  • Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other
    • This can lead to unstable and unreliable coefficient estimates in OLS
  • Ridge regression can effectively handle multicollinearity by shrinking the coefficients of correlated predictors towards each other
    • This results in a more stable and interpretable model, as the impact of multicollinearity on the coefficient estimates is reduced
  • When predictors are highly correlated, ridge regression tends to assign similar coefficients to them, reflecting their shared contribution to the response variable

Model Selection via Cross-Validation

  • Cross-validation is commonly used to select the optimal value of the regularization parameter λ\lambda in ridge regression
  • The procedure involves:
    1. Splitting the data into kk folds
    2. For each λ\lambda value in a predefined grid:
      • Train ridge regression models on k1k-1 folds and evaluate their performance on the held-out fold
      • Repeat this process kk times, using each fold as the validation set once
      • Compute the average performance across the kk folds
    3. Select the λ\lambda value that yields the best average performance
  • This process helps identify the λ\lambda value that strikes the best balance between bias and variance, optimizing the model's expected performance on new, unseen data

Solving Ridge Regression

Closed-Form Solution

  • Ridge regression has a closed-form solution, which can be derived analytically by solving the normal equations with the addition of the penalty term
  • The closed-form solution for ridge regression is given by: β^ridge=(XTX+λI)1XTy\hat{\beta}^{ridge} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} where:
    • X\mathbf{X} is the n×pn \times p matrix of predictor variables
    • y\mathbf{y} is the n×1n \times 1 vector of response values
    • λ\lambda is the regularization parameter
    • I\mathbf{I} is the p×pp \times p identity matrix
  • Compared to the OLS solution β^OLS=(XTX)1XTy\hat{\beta}^{OLS} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}, ridge regression adds the term λI\lambda \mathbf{I} to the matrix XTX\mathbf{X}^T\mathbf{X} before inversion
    • This addition makes the matrix XTX+λI\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I} invertible even when XTX\mathbf{X}^T\mathbf{X} is not (e.g., in the presence of perfect multicollinearity)
    • The closed-form solution for ridge regression is computationally efficient and numerically stable, even when dealing with high-dimensional data or correlated predictors