Fiveable

🥖Linear Modeling Theory Unit 16 Review

QR code for Linear Modeling Theory practice questions

16.4 Lasso and Elastic Net Regularization

16.4 Lasso and Elastic Net Regularization

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Lasso Regularization for Variable Selection

Lasso Penalty and Coefficient Shrinkage

Lasso stands for Least Absolute Shrinkage and Selection Operator. It's a regularization technique that does two things at once: it shrinks coefficients and performs variable selection. That dual capability is what sets it apart from ridge regression.

Lasso works by adding a penalty term to the ordinary least squares (OLS) objective function. The full Lasso objective looks like this:

minβ{i=1n(yixiTβ)2+λj=1pβj}\min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - x_i^T \beta)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}

The penalty is the sum of the absolute values of the coefficients, scaled by a tuning parameter λ\lambda.

  • As λ\lambda increases, more coefficients get shrunk toward zero. At high enough values, some coefficients hit exactly zero, which removes those variables from the model entirely.
  • When λ=0\lambda = 0, you get the standard OLS solution with no regularization.
  • The optimal λ\lambda is typically chosen through cross-validation (k-fold or leave-one-out).

One practical detail that's easy to overlook: Lasso is not invariant to the scale of your predictors. You need to standardize your variables before fitting a Lasso model. Otherwise, predictors measured on larger scales will be penalized more heavily, which distorts the results.

Sparse Models and Variable Selection

The defining feature of Lasso is that it produces sparse models, meaning some coefficients are set to exactly zero. This effectively removes those variables from the model.

  • This is especially useful in high-dimensional settings where the number of predictors exceeds the number of observations (p>np > n), or when you want a simpler, more interpretable model.
  • By dropping irrelevant or redundant variables, Lasso reduces overfitting and helps the model generalize better to new data. You're less likely to make predictions based on noise or spurious correlations.
  • Sparsity also serves as a form of feature selection and dimensionality reduction. If the true data-generating process only involves a handful of variables, Lasso is well-suited to recover that structure.

Lasso vs Ridge Regression

Lasso Penalty and Coefficient Shrinkage, A 3-mRNA-based prognostic signature of survival in oral squamous cell carcinoma [PeerJ]

Regularization Penalties

Both Lasso and ridge regression address multicollinearity and improve model stability, but they do so with different penalty terms added to the OLS objective:

  • Lasso (L1 penalty): λj=1pβj\lambda \sum_{j=1}^{p} |\beta_j|
  • Ridge (L2 penalty): λj=1pβj2\lambda \sum_{j=1}^{p} \beta_j^2

The geometric intuition helps here. The L1 penalty creates a diamond-shaped constraint region in coefficient space, which has corners along the axes. The optimization solution is more likely to land on one of those corners, where one or more coefficients equal zero. The L2 penalty creates a circular constraint region with no corners, so coefficients get shrunk toward zero but rarely reach it exactly.

Variable Selection and Coefficient Shrinkage

This difference in geometry leads to different behavior:

  • Lasso sets some coefficients exactly to zero, producing sparse models with only a subset of predictors retained.
  • Ridge shrinks all coefficients toward zero but keeps every variable in the model. No coefficient is eliminated entirely.

There's an important limitation to know about. When predictors are highly correlated with each other, Lasso tends to pick just one variable from the correlated group and zero out the rest. Which variable it picks can be somewhat arbitrary. Ridge regression, by contrast, shrinks the coefficients of correlated predictors toward each other, distributing the effect more evenly across the group.

This behavior matters for your choice of method: if you care about identifying which variables matter and your predictors aren't too correlated, Lasso is a strong choice. If you have groups of correlated predictors and want stable coefficient estimates across all of them, ridge may be more appropriate.

Elastic Net Regularization

Lasso Penalty and Coefficient Shrinkage, Frontiers | Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic ...

Combining Lasso and Ridge Penalties

Elastic net addresses the limitations of both Lasso and ridge by combining their penalties into a single objective:

minβ{i=1n(yixiTβ)2+λ[αj=1pβj+(1α)j=1pβj2]}\min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - x_i^T \beta)^2 + \lambda \left[ \alpha \sum_{j=1}^{p} |\beta_j| + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right] \right\}

Two tuning parameters control the penalty:

  • α\alpha (mixing proportion): determines the balance between L1 and L2 penalties. When α=1\alpha = 1, you get pure Lasso. When α=0\alpha = 0, you get pure ridge. Values between 0 and 1 blend both.
  • λ\lambda (regularization strength): controls the overall magnitude of the penalty, just as in Lasso and ridge.

Both α\alpha and λ\lambda are typically selected through cross-validation, often by searching over a grid of candidate values.

Handling Correlated Predictors

The key advantage of elastic net is the grouping effect: when predictors are strongly correlated, elastic net tends to include or exclude them together rather than arbitrarily picking one from the group.

  • This directly solves Lasso's weakness with correlated predictors. Instead of selecting a single variable from a correlated cluster, elastic net can retain the whole group.
  • At the same time, the L1 component still allows elastic net to set some coefficients to exactly zero, preserving the variable selection capability that ridge regression lacks.
  • By adjusting α\alpha, you can tune the trade-off between Lasso's sparsity and ridge's stability. A value like α=0.5\alpha = 0.5 gives equal weight to both penalties, but the best choice depends on your data.

Applying Lasso and Elastic Net Techniques

Using Statistical Software

In practice, you'll fit these models using software libraries that handle the optimization efficiently.

In R, the glmnet package is the standard tool:

  1. Call glmnet(x, y, family = "gaussian", alpha = 1) for Lasso, alpha = 0 for ridge, or any value between 0 and 1 for elastic net.
  2. Use cv.glmnet(x, y, alpha = ...) to perform cross-validation and find the optimal λ\lambda.
  3. Extract the best λ\lambda with $lambda.min (lowest CV error) or $lambda.1se (most regularized model within one standard error of the minimum).

In Python (scikit-learn), the classes are Lasso, Ridge, and ElasticNet:

  • The alpha parameter in scikit-learn controls regularization strength (this corresponds to λ\lambda, not the mixing proportion).
  • For ElasticNet, the l1_ratio parameter controls the mixing proportion (this corresponds to α\alpha in the formulation above). Be careful not to confuse the two naming conventions.

Interpreting the Results

Once you've fit a regularized model, interpretation involves a few steps:

  1. Examine the regularization path. This plot shows how each coefficient changes as λ\lambda varies. Variables whose coefficients remain non-zero at the chosen λ\lambda are the ones selected by the model.
  2. Identify the optimal λ\lambda. Cross-validation metrics like mean squared error (MSE) or mean absolute error (MAE) guide this choice. The value of λ\lambda that minimizes the CV error is a common default.
  3. Inspect the selected variables and their coefficients. These tell you which predictors matter most and the direction and magnitude of their effects on the response.
  4. Evaluate generalization performance. Always assess your model on a held-out test set or through cross-validation. A model that looks good on training data may still be overfit.
  5. Compare against OLS. Fit an unregularized model as a baseline. If the regularized model achieves similar or better predictive accuracy with fewer variables, that's a clear win for interpretability and robustness.