Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In linear modeling, building a model that fits your training data perfectly is actually easyโjust throw in every predictor you can find. The real challenge? Building a model that generalizes to new data without overfitting. This is where model selection criteria become essential. You're being tested on your ability to understand the fundamental tradeoff between model complexity and predictive accuracy, and knowing when to use each criterion can make or break an FRQ response.
These criteria embody core statistical principles: parsimony, bias-variance tradeoff, out-of-sample performance, and hypothesis testing. Don't just memorize that "lower AIC is better"โunderstand why each criterion penalizes complexity and when one criterion outperforms another. The best exam answers demonstrate that you grasp the underlying logic, not just the formulas.
These criteria use information theory to quantify model quality, balancing goodness of fit against model complexity through explicit penalty terms. The core idea: a model that explains the data well but uses fewer parameters is preferred because it's less likely to be capturing noise.
Compare: AIC vs. BICโboth penalize complexity, but BIC's penalty grows with sample size while AIC's penalty stays constant at . For large , BIC selects simpler models. If an FRQ asks which criterion to use with a massive dataset, BIC is your answer.
These criteria focus on how well the model explains variation in the response variable, directly measuring the gap between predicted and observed values. They quantify fit quality but require careful interpretation when comparing models of different complexity.
Compare: RSS vs. Adjusted โRSS always improves with more predictors, making it useless for model comparison alone. Adjusted corrects this by penalizing unnecessary complexity. Always use adjusted , not raw , when comparing models with different numbers of predictors.
These methods use formal statistical tests to determine whether additional model complexity is justified. The logic: if adding parameters doesn't significantly improve fit beyond what we'd expect by chance, keep the simpler model.
Compare: F-test vs. Likelihood Ratio Testโboth compare nested models, but the F-test is specific to linear regression while the LRT works for any likelihood-based model. For standard OLS problems, they give equivalent results; for generalized linear models, use the LRT.
These approaches directly estimate out-of-sample performance by testing the model on data it hasn't seen. This is the most honest assessment of how your model will perform in practice, avoiding the optimism bias of in-sample measures.
Compare: Cross-validation vs. AIC/BICโinformation criteria use mathematical penalties to approximate out-of-sample error, while cross-validation measures it directly. Cross-validation is more computationally intensive but makes fewer assumptions. When you have enough data, cross-validation is preferred.
| Concept | Best Examples |
|---|---|
| Information-theoretic selection | AIC, BIC, Mallow's Cp |
| Penalizes complexity more heavily | BIC (vs. AIC) |
| Variance explained | Adjusted , RSS |
| Nested model comparison | F-test, Likelihood Ratio Test |
| Out-of-sample validation | Cross-validation, Prediction Error |
| Requires full-model estimate | Mallow's Cp |
| Sensitive to outliers | MSE, RSS |
| Works for non-linear models | AIC, BIC, Cross-validation, LRT |
You're comparing three non-nested regression models on a dataset with . Which criterionโAIC or BICโwill tend to select the simpler model, and why?
A colleague adds five new predictors to a model and reports that increased. Why is this insufficient evidence that the new model is better, and what metric should they report instead?
Compare and contrast the F-test and cross-validation as model selection tools. Under what circumstances would you prefer each?
If Mallow's for a model with 6 predictors equals 15, what does this suggest about the model's fit?
An FRQ presents training MSE and test MSE for two models: Model A (training: 2.1, test: 8.7) and Model B (training: 3.4, test: 4.2). Which model should you recommend, and what phenomenon explains Model A's performance gap?