Fiveable

🥖Linear Modeling Theory Unit 8 Review

QR code for Linear Modeling Theory practice questions

8.4 Best Subset Selection

8.4 Best Subset Selection

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Best Subset Selection for Models

Concept and Motivation

Best subset selection works by evaluating every possible combination of predictors and picking the one that performs best according to some criterion. The core idea is straightforward: if you have pp candidate predictors, you fit models for every possible subset and then compare them.

Why not just include all predictors? Because adding unnecessary variables degrades a model's predictive performance on new data (overfitting) and makes interpretation harder. Best subset selection directly addresses this by searching for the smallest, most effective combination of predictors.

This approach is most practical when you have a moderate number of candidate predictors. With too many, the number of subsets explodes and the method becomes computationally infeasible.

Criteria for Model Evaluation and Comparison

You can't compare subset models using ordinary R2R^2 because it always increases as you add predictors. Instead, best subset selection relies on criteria that penalize model complexity:

  • Adjusted R2R^2 penalizes the inclusion of unnecessary predictors, so it can decrease when a variable adds noise rather than signal. Higher values are better.
  • Mallows' CpC_p estimates the mean squared prediction error for a given subset model. Lower values are better, and a model where CppC_p \approx p (the number of parameters) suggests minimal bias.
  • AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) both penalize model complexity through the likelihood function, but BIC applies a heavier penalty as sample size grows, so it tends to favor simpler models than AIC does.

Models with higher adjusted R2R^2, lower CpC_p, or lower AIC/BIC values are generally preferred. These criteria don't always agree with each other, so it's worth checking multiple criteria and understanding why they might point to different "best" models.

Exhaustive Search and Branch-and-Bound Algorithms

Exhaustive Search (All-Subsets Approach)

The brute-force version of best subset selection simply fits every possible subset and compares them. With pp predictors, there are 2p2^p possible subsets (including the empty model). That number grows fast:

  • 10 predictors → 210=1,0242^{10} = 1{,}024 subsets
  • 20 predictors → 220=1,048,5762^{20} = 1{,}048{,}576 subsets
  • 30 predictors → over 1 billion subsets

For small pp, exhaustive search is perfectly fine. Beyond about 20–30 predictors, it becomes computationally infeasible.

Branch-and-Bound Algorithms

Branch-and-bound algorithms find the same optimal subset as exhaustive search but skip large portions of the search space. Here's how they work:

  1. Start with an empty subset and define the search space as a tree, where each branch represents adding or excluding a predictor.
  2. At each node, compute a bound on the best possible criterion value achievable by any subset extending from that node.
  3. If that bound is worse than the best complete model found so far, prune the entire branch. No subset along that branch can beat the current best.
  4. Continue exploring unpruned branches, updating the best-known solution as better models are found.
  5. Terminate when all branches have been either explored or pruned.

This pruning relies on the monotonicity property of certain criteria: for residual-based measures like SSESSE, adding a predictor can only decrease (or maintain) the error. So if a subset already performs poorly, all its supersets can be bounded and potentially discarded.

Branch-and-bound can dramatically reduce computation, but for very large pp, even this approach becomes too slow.

Model Evaluation and Comparison

Concept and Motivation, Goodness-of-fit calculations — IntroQG 2019 documentation

Adjusted R2R^2

Adjusted R2R^2 modifies ordinary R2R^2 to account for the number of predictors:

Radj2=1(1R2)(n1)np1R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}

where nn is the sample size and pp is the number of predictors. The denominator shrinks as pp increases, so adding a weak predictor can actually lower adjusted R2R^2. That's the penalty at work. You select the model with the highest adjusted R2R^2.

Mallows' CpC_p

CpC_p estimates the total mean squared prediction error for a subset model with pp predictors:

Cp=SSEpMSEfulln+2pC_p = \frac{SSE_p}{MSE_{full}} - n + 2p

where SSEpSSE_p is the sum of squared errors for the subset model, MSEfullMSE_{full} is the mean squared error from the full model (all predictors), and nn is the sample size. A well-fitting model without substantial bias will have CpC_p close to pp. You select the model with the lowest CpC_p, and values much larger than pp suggest the model is missing important predictors.

Information Criteria (AIC and BIC)

Both criteria penalize the log-likelihood by a term related to model size:

  • AIC: 2log(L)+2k-2\log(L) + 2k
  • BIC: 2log(L)+klog(n)-2\log(L) + k\log(n)

where LL is the maximized likelihood, kk is the number of estimated parameters, and nn is the sample size. The key difference is that BIC's penalty grows with log(n)\log(n), so for any reasonably sized dataset (n>8n > 8), BIC penalizes complexity more heavily than AIC. In practice, BIC tends to select sparser models. You select the model with the lowest value for either criterion.

Computational Challenges of Best Subset Selection

Exponential Growth of Possible Subsets

The fundamental limitation is combinatorial: 2p2^p subsets means the problem scales exponentially. With 40 predictors, you'd need to evaluate over 1 trillion subsets. With 50, it's over 101510^{15}. No amount of computing power makes that practical for routine analysis.

Limitations for High-Dimensional Data

Even with branch-and-bound, best subset selection is typically limited to problems with at most a few tens of predictors. Beyond that, two problems emerge:

  • Computational cost becomes prohibitive despite pruning, because the search space is simply too vast.
  • Statistical instability increases because with many candidate subsets, the method is more likely to find a subset that fits the training data well by chance. This is sometimes called "selection bias" or a consequence of multiple comparisons across so many models.

When pp is large relative to nn, the selected model can be highly sensitive to small changes in the data.

Alternatives for Large pp

When best subset selection isn't feasible, several alternatives are commonly used:

  • Stepwise selection (forward, backward, or both) explores a much smaller portion of the model space by adding or removing one predictor at a time. It's fast but not guaranteed to find the globally optimal subset.
  • LASSO (L1L_1 regularization) performs variable selection continuously by shrinking some coefficients exactly to zero. It scales well to high-dimensional settings.
  • Ridge regression (L2L_2 regularization) doesn't select variables but handles multicollinearity and high dimensionality through shrinkage.

The choice among these depends on the balance you need between computational cost, interpretability, and predictive accuracy for your specific problem.