Why This Matters
Variable selection sits at the heart of building effective linear models. It's the difference between a model that captures true relationships and one that's memorizing noise. When you're tested on this material, you're being evaluated on your understanding of bias-variance tradeoffs, regularization principles, and model parsimony, not just your ability to name techniques.
The techniques below fall into distinct conceptual families: some search through predictor combinations sequentially, others penalize coefficient magnitudes mathematically, and still others transform the predictor space entirely. Focus on why you'd choose one approach over another and what tradeoffs each involves. That comparative understanding is exactly what FRQ prompts target.
Sequential Search Methods
These techniques build or trim models one predictor at a time, using statistical significance as the decision rule. The core mechanism is greedy optimization: making the locally optimal choice at each step without guaranteeing a globally optimal solution.
Forward Selection
- Starts with an empty model and adds predictors sequentially based on which one most improves model fit
- Stops when no remaining predictor provides a statistically significant improvement
- Works well when you expect only a few predictors to matter, but it can miss important variable combinations that only help in tandem
Backward Elimination
- Begins with all predictors included and removes the least significant one at each iteration
- Continues until all remaining predictors meet the significance threshold you've specified
- Works well when most predictors are relevant, but it requires enough observations to fit the full model initially (you need n>p)
Stepwise Regression
- Combines forward and backward logic, allowing predictors to enter or exit the model at any step
- Offers flexibility for exploratory analysis when you're uncertain which predictors matter most
- Prone to overfitting because of the multiple testing problem: each step involves a significance test, and running many tests inflates the chance of spurious results. Significance thresholds may need adjustment to compensate.
Compare: Forward Selection vs. Backward Elimination: both use significance testing to build models sequentially, but they start from opposite ends. Forward works better with limited data; backward requires n>p to fit the initial full model. If an FRQ asks about computational constraints, this distinction matters.
Regularization-Based Methods
Rather than testing predictors in or out, these methods keep all predictors but penalize coefficient magnitudes to prevent overfitting. The key mechanism is adding a penalty term to the ordinary least squares loss function, controlled by a tuning parameter ฮป. As ฮป increases, the penalty gets stronger and coefficients shrink more aggressively.
Lasso (Least Absolute Shrinkage and Selection Operator)
- Uses L1 regularization by adding ฮปโj=1pโโฃฮฒjโโฃ to the loss function
- Shrinks some coefficients exactly to zero, performing automatic variable selection alongside estimation. This is a direct consequence of the L1 penalty's geometry: the diamond-shaped constraint region has corners that sit on the axes, making zero-valued solutions likely.
- Ideal for high-dimensional data where you suspect many predictors are irrelevant, since the resulting sparse solutions are highly interpretable
Ridge Regression
- Uses L2 regularization by adding ฮปโj=1pโฮฒj2โ to the loss function
- Shrinks coefficients toward zero but never exactly to zero: all predictors remain in the model. The circular constraint region of the L2 penalty has no corners, so solutions rarely land exactly at zero.
- Excels when predictors are highly correlated. Multicollinearity inflates coefficient variance, and Ridge counteracts this by stabilizing the estimates through shrinkage, even though no variables are dropped.
Elastic Net
- Combines L1 and L2 penalties with two tuning parameters controlling the mix: ฮป1โโโฃฮฒjโโฃ+ฮป2โโฮฒj2โ
- Handles correlated predictor groups better than Lasso alone. Lasso tends to arbitrarily select just one predictor from a correlated set and zero out the rest; the Ridge component in Elastic Net encourages correlated predictors to share similar coefficients instead.
- Particularly powerful when p>n. The Ridge component stabilizes estimation (Lasso alone can select at most n variables), while the Lasso component still provides sparsity.
Compare: Lasso vs. Ridge: both shrink coefficients, but only Lasso performs variable selection by zeroing out coefficients. Choose Ridge when all predictors are theoretically important; choose Lasso when you need a sparse, interpretable model. Elastic Net hedges between them.
These aren't selection algorithms themselves but scoring metrics that guide model comparison. The underlying principle is penalized likelihood: rewarding model fit while punishing complexity.
- Balances fit and complexity using the formula AIC=2kโ2ln(L^), where k is the number of parameters and L^ is the maximized likelihood
- Lower values indicate better models, but AIC is a relative measure, only meaningful when comparing models fit to the same data
- Tends toward larger models than BIC, making it preferable when prediction accuracy is the primary goal
- Imposes a heavier complexity penalty using BIC=kln(n)โ2ln(L^), where n is sample size
- Favors simpler models than AIC, especially as sample size grows. The penalty per parameter is ln(n), which exceeds AIC's fixed penalty of 2 once n>7 (since ln(8)โ2.08). So for virtually any real dataset, BIC penalizes complexity more heavily.
- Better suited for identifying the "true" generating model when one exists. Use BIC when parsimony matters more than pure predictive performance.
Compare: AIC vs. BIC: both penalize complexity, but BIC's penalty grows with sample size while AIC's stays fixed. For large n, BIC selects simpler models. If an FRQ asks which criterion to use, consider whether the goal is prediction (AIC) or identifying true structure (BIC).
These approaches either search comprehensively through model space or transform predictors to reduce dimensionality directly.
Best Subset Selection
- Evaluates all 2p possible predictor combinations to find the optimal subset according to a chosen criterion (such as AIC, BIC, or adjusted R2)
- Guarantees the best model for any given subset size, with no greedy shortcuts that might miss optimal combinations
- Computationally prohibitive for large p. With 20 predictors you're fitting about 1 million models; at 30 predictors it's over 1 billion. Practically feasible only up to roughly p=20โ25.
Principal Component Analysis (PCA)
PCA doesn't select original predictors. Instead, it transforms correlated predictors into a new set of uncorrelated principal components, each a linear combination of the originals.
- Reduces dimensionality by retaining only the components that explain the most variance. Typically the first few components capture the bulk of the information.
- Addresses multicollinearity effectively since the components are uncorrelated by construction.
- The tradeoff is interpretability: components don't correspond to original variables, so you can't easily say "predictor X matters."
Compare: Best Subset vs. Stepwise Regression: both search for optimal predictor combinations, but best subset is exhaustive while stepwise is greedy. Best subset finds the true optimum but becomes infeasible around p=25; stepwise scales better but may miss the best model.
Quick Reference Table
|
| Greedy sequential search | Forward Selection, Backward Elimination, Stepwise Regression |
| L1 regularization (sparsity) | Lasso, Elastic Net |
| L2 regularization (shrinkage) | Ridge Regression, Elastic Net |
| Penalized model comparison | AIC, BIC |
| Exhaustive search | Best Subset Selection |
| Dimensionality reduction | PCA |
| High-dimensional data (p>n) | Lasso, Elastic Net, PCA |
| Multicollinearity handling | Ridge Regression, PCA, Elastic Net |
Self-Check Questions
-
Which two regularization methods can shrink coefficients exactly to zero, and what penalty term enables this behavior?
-
Compare Forward Selection and Best Subset Selection: what advantage does each have over the other, and when would computational constraints favor one approach?
-
A dataset has 50 observations and 200 predictors. Which variable selection techniques would be appropriate, and which would fail? Explain why.
-
You're comparing two nested models and need to choose between AIC and BIC. If your sample size is very large and you prioritize identifying the simplest adequate model, which criterion should you use and why?
-
Explain why Ridge Regression handles multicollinearity effectively even though it doesn't remove any predictors from the model. How does this differ from Lasso's approach to correlated predictors?