upgrade
upgrade

๐Ÿฅ–Linear Modeling Theory

Key Model Selection Criteria

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In linear modeling, building a model that fits your training data perfectly is actually easyโ€”just throw in every predictor you can find. The real challenge? Building a model that generalizes to new data without overfitting. This is where model selection criteria become essential. You're being tested on your ability to understand the fundamental tradeoff between model complexity and predictive accuracy, and knowing when to use each criterion can make or break an FRQ response.

These criteria embody core statistical principles: parsimony, bias-variance tradeoff, out-of-sample performance, and hypothesis testing. Don't just memorize that "lower AIC is better"โ€”understand why each criterion penalizes complexity and when one criterion outperforms another. The best exam answers demonstrate that you grasp the underlying logic, not just the formulas.


Information-Theoretic Criteria

These criteria use information theory to quantify model quality, balancing goodness of fit against model complexity through explicit penalty terms. The core idea: a model that explains the data well but uses fewer parameters is preferred because it's less likely to be capturing noise.

Akaike Information Criterion (AIC)

  • Formula: AIC=2kโˆ’2lnโก(L^)AIC = 2k - 2\ln(\hat{L})โ€”where kk is the number of parameters and L^\hat{L} is the maximized likelihood
  • Penalty structure adds 2k2k to discourage overfitting; derived from minimizing expected Kullback-Leibler divergence
  • Best use case is comparing multiple non-nested models when prediction accuracy is the primary goal

Bayesian Information Criterion (BIC)

  • Formula: BIC=klnโก(n)โˆ’2lnโก(L^)BIC = k\ln(n) - 2\ln(\hat{L})โ€”the lnโก(n)\ln(n) term creates a stronger penalty as sample size grows
  • More conservative than AIC because the penalty scales with sample size, favoring simpler models in large datasets
  • Derived from Bayesian model selection, approximating the log of the marginal likelihood under certain conditions

Mallow's Cp

  • Formula: Cp=RSSpฯƒ^2โˆ’n+2pC_p = \frac{RSS_p}{\hat{\sigma}^2} - n + 2pโ€”where pp is the number of predictors including the intercept
  • Target value is Cpโ‰ˆpC_p \approx p; values much larger than pp suggest important predictors are missing
  • Requires an estimate of ฯƒ2\sigma^2 from a full model, making it useful for comparing subsets of a larger predictor set

Compare: AIC vs. BICโ€”both penalize complexity, but BIC's lnโก(n)\ln(n) penalty grows with sample size while AIC's penalty stays constant at 2k2k. For large nn, BIC selects simpler models. If an FRQ asks which criterion to use with a massive dataset, BIC is your answer.


Variance-Based Measures

These criteria focus on how well the model explains variation in the response variable, directly measuring the gap between predicted and observed values. They quantify fit quality but require careful interpretation when comparing models of different complexity.

Adjusted R-squared

  • Formula: Radj2=1โˆ’(1โˆ’R2)(nโˆ’1)nโˆ’kโˆ’1R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}โ€”penalizes for adding predictors that don't improve fit proportionally
  • Can decrease when you add a weak predictor, unlike regular R2R^2 which only increases or stays the same
  • Interpretation remains intuitive: proportion of variance explained, adjusted for model complexity

Residual Sum of Squares (RSS)

  • Formula: RSS=โˆ‘i=1n(yiโˆ’y^i)2RSS = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2โ€”the foundation for most other fit measures
  • Always decreases as you add predictors, which is why raw RSS alone cannot guide model selection
  • Building block for calculating R2R^2, MSE, AIC, and FF-tests; know this formula cold

Mean Squared Error (MSE)

  • Formula: MSE=RSSnMSE = \frac{RSS}{n} or RSSnโˆ’k\frac{RSS}{n-k} for the unbiased estimator of error variance
  • Sensitive to outliers because squaring amplifies large residuals; consider this when data contains extreme values
  • Training vs. test MSE distinction is criticalโ€”low training MSE with high test MSE signals overfitting

Compare: RSS vs. Adjusted R2R^2โ€”RSS always improves with more predictors, making it useless for model comparison alone. Adjusted R2R^2 corrects this by penalizing unnecessary complexity. Always use adjusted R2R^2, not raw R2R^2, when comparing models with different numbers of predictors.


Hypothesis Testing Approaches

These methods use formal statistical tests to determine whether additional model complexity is justified. The logic: if adding parameters doesn't significantly improve fit beyond what we'd expect by chance, keep the simpler model.

F-test for Nested Models

  • Tests whether the reduction in RSS from adding predictors is statistically significant; requires models to be nested (one is a special case of the other)
  • Formula: F=(RSSreducedโˆ’RSSfull)/(dfreducedโˆ’dffull)RSSfull/dffullF = \frac{(RSS_{reduced} - RSS_{full})/(df_{reduced} - df_{full})}{RSS_{full}/df_{full}}โ€”compare to FF-distribution
  • Significant result (small p-value) means the additional predictors in the full model provide meaningful improvement

Likelihood Ratio Test

  • Test statistic: ฮ›=โˆ’2lnโก(LreducedLfull)\Lambda = -2\ln\left(\frac{L_{reduced}}{L_{full}}\right)โ€”follows a chi-squared distribution under the null hypothesis
  • More general than F-test because it applies to any models estimated via maximum likelihood, not just OLS regression
  • Requires nested models just like the F-test; the simpler model must be a restricted version of the complex one

Compare: F-test vs. Likelihood Ratio Testโ€”both compare nested models, but the F-test is specific to linear regression while the LRT works for any likelihood-based model. For standard OLS problems, they give equivalent results; for generalized linear models, use the LRT.


Predictive Validation Methods

These approaches directly estimate out-of-sample performance by testing the model on data it hasn't seen. This is the most honest assessment of how your model will perform in practice, avoiding the optimism bias of in-sample measures.

Cross-Validation

  • K-fold method splits data into kk subsets, trains on kโˆ’1k-1 folds, tests on the held-out fold, and averages results
  • Leave-one-out (LOOCV) is the extreme case where k=nk = n; low bias but high variance and computationally expensive
  • Gold standard for prediction because it directly estimates generalization error rather than relying on penalty approximations

Prediction Error

  • Definition: y^iโˆ’yi\hat{y}_i - y_iโ€”the raw difference between what the model predicts and what actually occurs
  • Expected prediction error is what we're ultimately trying to minimize; it decomposes into bias, variance, and irreducible error
  • Test set prediction error is the most reliable metric for model comparison when sufficient data exists for a holdout set

Compare: Cross-validation vs. AIC/BICโ€”information criteria use mathematical penalties to approximate out-of-sample error, while cross-validation measures it directly. Cross-validation is more computationally intensive but makes fewer assumptions. When you have enough data, cross-validation is preferred.


Quick Reference Table

ConceptBest Examples
Information-theoretic selectionAIC, BIC, Mallow's Cp
Penalizes complexity more heavilyBIC (vs. AIC)
Variance explainedAdjusted R2R^2, RSS
Nested model comparisonF-test, Likelihood Ratio Test
Out-of-sample validationCross-validation, Prediction Error
Requires full-model ฯƒ2\sigma^2 estimateMallow's Cp
Sensitive to outliersMSE, RSS
Works for non-linear modelsAIC, BIC, Cross-validation, LRT

Self-Check Questions

  1. You're comparing three non-nested regression models on a dataset with n=500n = 500. Which criterionโ€”AIC or BICโ€”will tend to select the simpler model, and why?

  2. A colleague adds five new predictors to a model and reports that R2R^2 increased. Why is this insufficient evidence that the new model is better, and what metric should they report instead?

  3. Compare and contrast the F-test and cross-validation as model selection tools. Under what circumstances would you prefer each?

  4. If Mallow's CpC_p for a model with 6 predictors equals 15, what does this suggest about the model's fit?

  5. An FRQ presents training MSE and test MSE for two models: Model A (training: 2.1, test: 8.7) and Model B (training: 3.4, test: 4.2). Which model should you recommend, and what phenomenon explains Model A's performance gap?