upgrade
upgrade

🥖Linear Modeling Theory

Essential Variable Selection Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Variable selection sits at the heart of building effective linear models—it's the difference between a model that captures true relationships and one that's memorizing noise. When you're tested on this material, you're not just being asked to name techniques; you're being evaluated on your understanding of bias-variance tradeoffs, regularization principles, and model parsimony. Every selection method makes different assumptions about which predictors matter and how to balance model complexity against predictive power.

The techniques below fall into distinct conceptual families: some search through predictor combinations sequentially, others penalize coefficient magnitudes mathematically, and still others transform the predictor space entirely. Don't just memorize which method does what—know why you'd choose one approach over another and what tradeoffs each involves. That comparative understanding is exactly what FRQ prompts target.


Sequential Search Methods

These techniques build or trim models one predictor at a time, using statistical significance as the decision rule. The core mechanism is greedy optimization—making locally optimal choices at each step without guaranteeing a globally optimal solution.

Forward Selection

  • Starts with an empty model and adds predictors sequentially based on which one most improves model fit
  • Stops when no remaining predictor provides a statistically significant improvement to the model
  • Best for building parsimonious models when you expect only a few predictors to matter—but can miss important variable combinations

Backward Elimination

  • Begins with all predictors included and removes the least significant one at each iteration
  • Continues until all remaining predictors meet the significance threshold you've specified
  • Works well when most predictors are relevant—but requires enough observations to fit the full model initially

Stepwise Regression

  • Combines forward and backward logic, allowing predictors to enter or exit the model at any step
  • Offers flexibility for exploratory analysis when you're uncertain which predictors matter most
  • Computationally intensive and prone to overfitting—the multiple testing problem means significance thresholds need adjustment

Compare: Forward Selection vs. Backward Elimination—both use significance testing to build models sequentially, but they start from opposite ends. Forward works better with limited data; backward requires n>pn > p to fit the initial full model. If an FRQ asks about computational constraints, this distinction matters.


Regularization-Based Methods

Rather than testing predictors in or out, these methods keep all predictors but penalize coefficient magnitudes to prevent overfitting. The key mechanism is adding a penalty term to the loss function, controlled by a tuning parameter λ\lambda.

Lasso (Least Absolute Shrinkage and Selection Operator)

  • Uses L1 regularization by adding λj=1pβj\lambda \sum_{j=1}^{p} |\beta_j| to the loss function
  • Shrinks some coefficients exactly to zero, performing automatic variable selection alongside estimation
  • Ideal for high-dimensional data where you suspect many predictors are irrelevant—the sparse solutions are highly interpretable

Ridge Regression

  • Uses L2 regularization by adding λj=1pβj2\lambda \sum_{j=1}^{p} \beta_j^2 to the loss function
  • Shrinks coefficients toward zero but never exactly to zero—all predictors remain in the model
  • Excels when predictors are highly correlated, reducing multicollinearity effects without eliminating variables

Elastic Net

  • Combines L1 and L2 penalties with two tuning parameters controlling the mix: λ1βj+λ2βj2\lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2
  • Handles correlated predictor groups better than Lasso alone, which tends to arbitrarily select one from a correlated set
  • Particularly powerful when p>np > n—the ridge component stabilizes estimation while lasso provides sparsity

Compare: Lasso vs. Ridge—both shrink coefficients, but only Lasso performs variable selection by zeroing out coefficients. Choose Ridge when all predictors are theoretically important; choose Lasso when you need a sparse, interpretable model. Elastic Net hedges between them.


Information Criteria

These aren't selection algorithms themselves but scoring metrics that guide model comparison. The underlying principle is penalized likelihood—rewarding model fit while punishing complexity.

Akaike Information Criterion (AIC)

  • Balances fit and complexity using the formula AIC=2k2ln(L^)AIC = 2k - 2\ln(\hat{L}), where kk is parameter count and L^\hat{L} is maximized likelihood
  • Lower values indicate better models—but AIC is relative, only meaningful when comparing models on the same data
  • Tends toward larger models than BIC, making it preferable when prediction accuracy is the primary goal

Bayesian Information Criterion (BIC)

  • Imposes a heavier complexity penalty using BIC=kln(n)2ln(L^)BIC = k\ln(n) - 2\ln(\hat{L}), where nn is sample size
  • Favors simpler models than AIC, especially as sample size grows—the ln(n)\ln(n) term dominates
  • Better for identifying the "true" model when one exists—use this when parsimony matters more than pure prediction

Compare: AIC vs. BIC—both penalize complexity, but BIC's penalty grows with sample size while AIC's stays fixed. For large nn, BIC selects simpler models. If an FRQ asks which criterion to use, consider whether the goal is prediction (AIC) or identifying true structure (BIC).


Exhaustive and Transformation Methods

These approaches either search comprehensively through model space or transform predictors to reduce dimensionality directly.

Best Subset Selection

  • Evaluates all 2p2^p possible predictor combinations to find the optimal subset according to a chosen criterion
  • Guarantees the best model for any given subset size—no greedy shortcuts that might miss optimal combinations
  • Computationally prohibitive for large pp—practical only when you have roughly 20 or fewer candidate predictors

Principal Component Analysis (PCA)

  • Transforms correlated predictors into uncorrelated principal components that are linear combinations of originals
  • Reduces dimensionality by retaining only components explaining the most variance—typically the first few capture most information
  • Addresses multicollinearity effectively but sacrifices interpretability since components don't correspond to original variables

Compare: Best Subset vs. Stepwise Regression—both search for optimal predictor combinations, but best subset is exhaustive while stepwise is greedy. Best subset finds the true optimum but becomes infeasible around p=25p = 25; stepwise scales better but may miss the best model.


Quick Reference Table

ConceptBest Examples
Greedy sequential searchForward Selection, Backward Elimination, Stepwise Regression
L1 regularization (sparsity)Lasso, Elastic Net
L2 regularization (shrinkage)Ridge Regression, Elastic Net
Penalized model comparisonAIC, BIC
Exhaustive searchBest Subset Selection
Dimensionality reductionPCA
High-dimensional data (p>np > n)Lasso, Elastic Net, PCA
Multicollinearity handlingRidge Regression, PCA, Elastic Net

Self-Check Questions

  1. Which two regularization methods can shrink coefficients exactly to zero, and what penalty term enables this behavior?

  2. Compare Forward Selection and Best Subset Selection: what advantage does each have over the other, and when would computational constraints favor one approach?

  3. A dataset has 50 observations and 200 predictors. Which variable selection techniques would be appropriate, and which would fail? Explain why.

  4. You're comparing two nested models and need to choose between AIC and BIC. If your sample size is very large and you prioritize identifying the simplest adequate model, which criterion should you use and why?

  5. Explain why Ridge Regression handles multicollinearity effectively even though it doesn't remove any predictors from the model. How does this differ from Lasso's approach to correlated predictors?