Best subset selection works by evaluating every possible combination of predictors and picking the one that performs best according to some criterion. The core idea is straightforward: if you have $p$ candidate predictors, you fit models for every possible subset and then compare them.

Why not just include all predictors? Because adding unnecessary variables degrades a model's predictive performance on new data (overfitting) and makes interpretation harder. Best subset selection directly addresses this by searching for the smallest, most effective combination of predictors.

This approach is most practical when you have a moderate number of candidate predictors. With too many, the number of subsets explodes and the method becomes computationally infeasible.

Criteria for Model Evaluation and Comparison

You can't compare subset models using ordinary $R^2$ because it always increases as you add predictors. Instead, best subset selection relies on criteria that penalize model complexity:

Adjusted $R^2$ penalizes the inclusion of unnecessary predictors, so it can decrease when a variable adds noise rather than signal. Higher values are better.
Mallows' $C_p$ estimates the mean squared prediction error for a given subset model. Lower values are better, and a model where $C_p \approx p$ (the number of parameters) suggests minimal bias.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) both penalize model complexity through the likelihood function, but BIC applies a heavier penalty as sample size grows, so it tends to favor simpler models than AIC does.

Models with higher adjusted $R^2$ , lower $C_p$ , or lower AIC/BIC values are generally preferred. These criteria don't always agree with each other, so it's worth checking multiple criteria and understanding why they might point to different "best" models.

Exhaustive Search and Branch-and-Bound Algorithms

Exhaustive Search (All-Subsets Approach)

The brute-force version of best subset selection simply fits every possible subset and compares them. With $p$ predictors, there are $2^p$ possible subsets (including the empty model). That number grows fast:

10 predictors → $2^{10} = 1{,}024$ subsets
20 predictors → $2^{20} = 1{,}048{,}576$ subsets
30 predictors → over 1 billion subsets

For small $p$ , exhaustive search is perfectly fine. Beyond about 20–30 predictors, it becomes computationally infeasible.

Branch-and-Bound Algorithms

Branch-and-bound algorithms find the same optimal subset as exhaustive search but skip large portions of the search space. Here's how they work:

Start with an empty subset and define the search space as a tree, where each branch represents adding or excluding a predictor.
At each node, compute a bound on the best possible criterion value achievable by any subset extending from that node.
If that bound is worse than the best complete model found so far, prune the entire branch. No subset along that branch can beat the current best.
Continue exploring unpruned branches, updating the best-known solution as better models are found.
Terminate when all branches have been either explored or pruned.

This pruning relies on the monotonicity property of certain criteria: for residual-based measures like $SSE$ , adding a predictor can only decrease (or maintain) the error. So if a subset already performs poorly, all its supersets can be bounded and potentially discarded.

Branch-and-bound can dramatically reduce computation, but for very large $p$ , even this approach becomes too slow.

Model Evaluation and Comparison

Concept and Motivation, Goodness-of-fit calculations — IntroQG 2019 documentation

Adjusted $R^2$

Adjusted $R^2$ modifies ordinary $R^2$ to account for the number of predictors:

$R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$

where $n$ is the sample size and $p$ is the number of predictors. The denominator shrinks as $p$ increases, so adding a weak predictor can actually lower adjusted $R^2$ . That's the penalty at work. You select the model with the highest adjusted $R^2$ .

Mallows' $C_p$

$C_p$ estimates the total mean squared prediction error for a subset model with $p$ predictors:

$C_p = \frac{SSE_p}{MSE_{full}} - n + 2p$

where $SSE_p$ is the sum of squared errors for the subset model, $MSE_{full}$ is the mean squared error from the full model (all predictors), and $n$ is the sample size. A well-fitting model without substantial bias will have $C_p$ close to $p$ . You select the model with the lowest $C_p$ , and values much larger than $p$ suggest the model is missing important predictors.

Information Criteria (AIC and BIC)

Both criteria penalize the log-likelihood by a term related to model size:

AIC: $-2\log(L) + 2k$
BIC: $-2\log(L) + k\log(n)$

where $L$ is the maximized likelihood, $k$ is the number of estimated parameters, and $n$ is the sample size. The key difference is that BIC's penalty grows with $\log(n)$ , so for any reasonably sized dataset ( $n > 8$ ), BIC penalizes complexity more heavily than AIC. In practice, BIC tends to select sparser models. You select the model with the lowest value for either criterion.

Computational Challenges of Best Subset Selection

Exponential Growth of Possible Subsets

The fundamental limitation is combinatorial: $2^p$ subsets means the problem scales exponentially. With 40 predictors, you'd need to evaluate over 1 trillion subsets. With 50, it's over $10^{15}$ . No amount of computing power makes that practical for routine analysis.

Limitations for High-Dimensional Data

Even with branch-and-bound, best subset selection is typically limited to problems with at most a few tens of predictors. Beyond that, two problems emerge:

Computational cost becomes prohibitive despite pruning, because the search space is simply too vast.
Statistical instability increases because with many candidate subsets, the method is more likely to find a subset that fits the training data well by chance. This is sometimes called "selection bias" or a consequence of multiple comparisons across so many models.

When $p$ is large relative to $n$ , the selected model can be highly sensitive to small changes in the data.

Alternatives for Large $p$

When best subset selection isn't feasible, several alternatives are commonly used:

Stepwise selection (forward, backward, or both) explores a much smaller portion of the model space by adding or removing one predictor at a time. It's fast but not guaranteed to find the globally optimal subset.
LASSO ( $L_1$ regularization) performs variable selection continuously by shrinking some coefficients exactly to zero. It scales well to high-dimensional settings.
Ridge regression ( $L_2$ regularization) doesn't select variables but handles multicollinearity and high dimensionality through shrinkage.

The choice among these depends on the balance you need between computational cost, interpretability, and predictive accuracy for your specific problem.

2,589 studying →