Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Variable selection sits at the heart of building effective linear models—it's the difference between a model that captures true relationships and one that's memorizing noise. When you're tested on this material, you're not just being asked to name techniques; you're being evaluated on your understanding of bias-variance tradeoffs, regularization principles, and model parsimony. Every selection method makes different assumptions about which predictors matter and how to balance model complexity against predictive power.
The techniques below fall into distinct conceptual families: some search through predictor combinations sequentially, others penalize coefficient magnitudes mathematically, and still others transform the predictor space entirely. Don't just memorize which method does what—know why you'd choose one approach over another and what tradeoffs each involves. That comparative understanding is exactly what FRQ prompts target.
These techniques build or trim models one predictor at a time, using statistical significance as the decision rule. The core mechanism is greedy optimization—making locally optimal choices at each step without guaranteeing a globally optimal solution.
Compare: Forward Selection vs. Backward Elimination—both use significance testing to build models sequentially, but they start from opposite ends. Forward works better with limited data; backward requires to fit the initial full model. If an FRQ asks about computational constraints, this distinction matters.
Rather than testing predictors in or out, these methods keep all predictors but penalize coefficient magnitudes to prevent overfitting. The key mechanism is adding a penalty term to the loss function, controlled by a tuning parameter .
Compare: Lasso vs. Ridge—both shrink coefficients, but only Lasso performs variable selection by zeroing out coefficients. Choose Ridge when all predictors are theoretically important; choose Lasso when you need a sparse, interpretable model. Elastic Net hedges between them.
These aren't selection algorithms themselves but scoring metrics that guide model comparison. The underlying principle is penalized likelihood—rewarding model fit while punishing complexity.
Compare: AIC vs. BIC—both penalize complexity, but BIC's penalty grows with sample size while AIC's stays fixed. For large , BIC selects simpler models. If an FRQ asks which criterion to use, consider whether the goal is prediction (AIC) or identifying true structure (BIC).
These approaches either search comprehensively through model space or transform predictors to reduce dimensionality directly.
Compare: Best Subset vs. Stepwise Regression—both search for optimal predictor combinations, but best subset is exhaustive while stepwise is greedy. Best subset finds the true optimum but becomes infeasible around ; stepwise scales better but may miss the best model.
| Concept | Best Examples |
|---|---|
| Greedy sequential search | Forward Selection, Backward Elimination, Stepwise Regression |
| L1 regularization (sparsity) | Lasso, Elastic Net |
| L2 regularization (shrinkage) | Ridge Regression, Elastic Net |
| Penalized model comparison | AIC, BIC |
| Exhaustive search | Best Subset Selection |
| Dimensionality reduction | PCA |
| High-dimensional data () | Lasso, Elastic Net, PCA |
| Multicollinearity handling | Ridge Regression, PCA, Elastic Net |
Which two regularization methods can shrink coefficients exactly to zero, and what penalty term enables this behavior?
Compare Forward Selection and Best Subset Selection: what advantage does each have over the other, and when would computational constraints favor one approach?
A dataset has 50 observations and 200 predictors. Which variable selection techniques would be appropriate, and which would fail? Explain why.
You're comparing two nested models and need to choose between AIC and BIC. If your sample size is very large and you prioritize identifying the simplest adequate model, which criterion should you use and why?
Explain why Ridge Regression handles multicollinearity effectively even though it doesn't remove any predictors from the model. How does this differ from Lasso's approach to correlated predictors?