Why This Matters
In linear modeling, building a model that fits your training data perfectly is actually easy: just throw in every predictor you can find. The real challenge is building a model that generalizes to new data without overfitting. This is where model selection criteria become essential.
These criteria all deal with the fundamental tradeoff between model complexity and predictive accuracy. They embody core statistical principles: parsimony, bias-variance tradeoff, out-of-sample performance, and hypothesis testing. Don't just memorize that "lower AIC is better." Understand why each criterion penalizes complexity and when one criterion outperforms another.
These criteria use information theory to quantify model quality, balancing goodness of fit against model complexity through explicit penalty terms. The core idea: a model that explains the data well but uses fewer parameters is preferred because it's less likely to be capturing noise.
- Formula: AIC=2kโ2ln(L^), where k is the number of estimated parameters and L^ is the maximized likelihood
- The 2k penalty term discourages overfitting. Theoretically, AIC is derived from minimizing the expected Kullback-Leibler divergence between the fitted model and the true data-generating process.
- Best use case: comparing multiple non-nested models when prediction accuracy is the primary goal. Lower AIC is better.
- Formula: BIC=kln(n)โ2ln(L^), where n is the sample size
- The ln(n) term creates a penalty that grows with sample size, making BIC more conservative than AIC. For any nโฅ8, BIC penalizes each additional parameter more heavily than AIC does.
- Derived from Bayesian model selection, BIC approximates the log of the marginal likelihood under certain regularity conditions. It tends to favor simpler models, especially in large datasets.
Mallow's Cp
- Formula: Cpโ=ฯ^2RSSpโโโn+2p, where p is the number of parameters (predictors plus intercept) and ฯ^2 is estimated from a full model
- Target value: Cpโโp. When Cpโ is much larger than p, the model is likely missing important predictors and suffering from bias. When Cpโ is close to p, the model's bias is small relative to its variance.
- Because it requires an estimate of ฯ2 from a larger reference model, Mallow's Cpโ is most useful for comparing subsets of predictors drawn from a known full model.
Compare: AIC vs. BIC: both penalize complexity, but BIC's ln(n) penalty grows with sample size while AIC's penalty stays constant at 2k. For large n, BIC selects simpler models. If you're working with a massive dataset and care about identifying the "true" model (assuming it's among your candidates), BIC is the stronger choice. If your goal is purely predictive accuracy, AIC tends to perform better because it's more willing to retain useful predictors.
Variance-Based Measures
These criteria focus on how well the model explains variation in the response variable, directly measuring the gap between predicted and observed values. They quantify fit quality but require careful interpretation when comparing models of different complexity.
Adjusted R-squared
- Formula: Radj2โ=1โnโkโ1(1โR2)(nโ1)โ, where k is the number of predictors (not counting the intercept)
- Unlike regular R2, which can only increase or stay the same as you add predictors, adjusted R2 can decrease when you add a predictor that doesn't improve fit enough to justify the lost degree of freedom.
- Interpretation stays intuitive: it estimates the proportion of variance explained, adjusted for model complexity. Use it instead of raw R2 whenever you're comparing models with different numbers of predictors.
- Formula: RSS=โi=1nโ(yiโโy^โiโ)2
- RSS is the foundation for most other fit measures, but it always decreases (or stays the same) as you add predictors. This means raw RSS alone cannot guide model selection, since it will always favor the most complex model.
- Think of RSS as a building block: it feeds into the calculation of R2, MSE, AIC, F-tests, and Mallow's Cpโ.
Mean Squared Error (MSE)
- Formula: MSE=nRSSโ for the simple version, or nโkRSSโ for the unbiased estimator of error variance (sometimes written as ฯ^2)
- Because squaring amplifies large residuals, MSE is sensitive to outliers. Keep this in mind when your data contains extreme values.
- The distinction between training MSE and test MSE is critical. Low training MSE paired with high test MSE is the classic signature of overfitting.
Compare: RSS vs. Adjusted R2: RSS always improves with more predictors, making it useless for model comparison on its own. Adjusted R2 corrects this by penalizing unnecessary complexity. Always report adjusted R2, not raw R2, when comparing models with different numbers of predictors.
Hypothesis Testing Approaches
These methods use formal statistical tests to determine whether additional model complexity is justified. The logic: if adding parameters doesn't significantly improve fit beyond what you'd expect by chance, keep the simpler model.
F-test for Nested Models
The F-test checks whether the reduction in RSS from adding predictors is statistically significant. It requires models to be nested, meaning the simpler (reduced) model is a special case of the more complex (full) model.
- Formula: F=RSSfullโ/dffullโ(RSSreducedโโRSSfullโ)/(dfreducedโโdffullโ)โ
- Compare this statistic to the F-distribution with (dfreducedโโdffullโ) and dffullโ degrees of freedom.
- A significant result (small p-value) means the additional predictors in the full model provide a meaningful improvement in fit that's unlikely due to chance alone.
Likelihood Ratio Test
- Test statistic: ฮ=โ2ln(LfullโLreducedโโ), which follows a ฯ2 distribution under the null hypothesis, with degrees of freedom equal to the difference in the number of parameters
- The LRT is more general than the F-test because it applies to any models estimated via maximum likelihood, not just OLS regression. This makes it the go-to choice for generalized linear models.
- Like the F-test, it requires nested models: the simpler model must be a restricted version of the complex one.
Compare: F-test vs. Likelihood Ratio Test: both compare nested models, but the F-test is specific to linear regression while the LRT works for any likelihood-based model. For standard OLS problems, they give equivalent results. For generalized linear models (logistic regression, Poisson regression, etc.), use the LRT.
Predictive Validation Methods
These approaches directly estimate out-of-sample performance by testing the model on data it hasn't seen. This is the most honest assessment of how your model will perform in practice, avoiding the optimism bias that plagues in-sample measures.
Cross-Validation
- K-fold method: split the data into k subsets (folds), train on kโ1 folds, test on the held-out fold, rotate through all folds, and average the prediction errors.
- Leave-one-out cross-validation (LOOCV) is the extreme case where k=n. It has low bias (each training set is nearly the full dataset) but high variance across folds, and it's computationally expensive for large n.
- Cross-validation is often called the gold standard for prediction because it directly estimates generalization error rather than relying on penalty-based approximations.
Prediction Error
- Definition: y^โiโโyiโ, the raw difference between the model's prediction and the observed value
- Expected prediction error is what all model selection criteria are ultimately trying to minimize. It decomposes into three components: bias, variance, and irreducible error. This decomposition is the theoretical backbone of the bias-variance tradeoff.
- Test set prediction error is the most reliable metric for model comparison when you have sufficient data to hold out a separate test set.
Compare: Cross-validation vs. AIC/BIC: information criteria use mathematical penalties to approximate out-of-sample error, while cross-validation measures it directly. Cross-validation is more computationally intensive but makes fewer distributional assumptions. When you have enough data, cross-validation is generally preferred. When data is limited or you need a quick comparison across many candidate models, AIC/BIC are practical alternatives.
Quick Reference Table
|
| Information-theoretic selection | AIC, BIC, Mallow's Cpโ |
| Penalizes complexity more heavily | BIC (vs. AIC) |
| Variance explained | Adjusted R2, RSS |
| Nested model comparison | F-test, Likelihood Ratio Test |
| Out-of-sample validation | Cross-validation, Prediction Error |
| Requires full-model ฯ^2 estimate | Mallow's Cpโ |
| Sensitive to outliers | MSE, RSS |
| Works for non-linear models | AIC, BIC, Cross-validation, LRT |
Self-Check Questions
-
You're comparing three non-nested regression models on a dataset with n=500. Which criterion, AIC or BIC, will tend to select the simpler model, and why?
-
A colleague adds five new predictors to a model and reports that R2 increased. Why is this insufficient evidence that the new model is better, and what metric should they report instead?
-
Compare and contrast the F-test and cross-validation as model selection tools. Under what circumstances would you prefer each?
-
If Mallow's Cpโ for a model with 6 predictors equals 15, what does this suggest about the model's fit?
-
You're given training MSE and test MSE for two models: Model A (training: 2.1, test: 8.7) and Model B (training: 3.4, test: 4.2). Which model should you recommend, and what phenomenon explains Model A's performance gap?