๐Ÿฅ–Linear Modeling Theory

Key Model Selection Criteria

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In linear modeling, building a model that fits your training data perfectly is actually easy: just throw in every predictor you can find. The real challenge is building a model that generalizes to new data without overfitting. This is where model selection criteria become essential.

These criteria all deal with the fundamental tradeoff between model complexity and predictive accuracy. They embody core statistical principles: parsimony, bias-variance tradeoff, out-of-sample performance, and hypothesis testing. Don't just memorize that "lower AIC is better." Understand why each criterion penalizes complexity and when one criterion outperforms another.


Information-Theoretic Criteria

These criteria use information theory to quantify model quality, balancing goodness of fit against model complexity through explicit penalty terms. The core idea: a model that explains the data well but uses fewer parameters is preferred because it's less likely to be capturing noise.

Akaike Information Criterion (AIC)

  • Formula: AIC=2kโˆ’2lnโก(L^)AIC = 2k - 2\ln(\hat{L}), where kk is the number of estimated parameters and L^\hat{L} is the maximized likelihood
  • The 2k2k penalty term discourages overfitting. Theoretically, AIC is derived from minimizing the expected Kullback-Leibler divergence between the fitted model and the true data-generating process.
  • Best use case: comparing multiple non-nested models when prediction accuracy is the primary goal. Lower AIC is better.

Bayesian Information Criterion (BIC)

  • Formula: BIC=klnโก(n)โˆ’2lnโก(L^)BIC = k\ln(n) - 2\ln(\hat{L}), where nn is the sample size
  • The lnโก(n)\ln(n) term creates a penalty that grows with sample size, making BIC more conservative than AIC. For any nโ‰ฅ8n \geq 8, BIC penalizes each additional parameter more heavily than AIC does.
  • Derived from Bayesian model selection, BIC approximates the log of the marginal likelihood under certain regularity conditions. It tends to favor simpler models, especially in large datasets.

Mallow's Cp

  • Formula: Cp=RSSpฯƒ^2โˆ’n+2pC_p = \frac{RSS_p}{\hat{\sigma}^2} - n + 2p, where pp is the number of parameters (predictors plus intercept) and ฯƒ^2\hat{\sigma}^2 is estimated from a full model
  • Target value: Cpโ‰ˆpC_p \approx p. When CpC_p is much larger than pp, the model is likely missing important predictors and suffering from bias. When CpC_p is close to pp, the model's bias is small relative to its variance.
  • Because it requires an estimate of ฯƒ2\sigma^2 from a larger reference model, Mallow's CpC_p is most useful for comparing subsets of predictors drawn from a known full model.

Compare: AIC vs. BIC: both penalize complexity, but BIC's lnโก(n)\ln(n) penalty grows with sample size while AIC's penalty stays constant at 2k2k. For large nn, BIC selects simpler models. If you're working with a massive dataset and care about identifying the "true" model (assuming it's among your candidates), BIC is the stronger choice. If your goal is purely predictive accuracy, AIC tends to perform better because it's more willing to retain useful predictors.


Variance-Based Measures

These criteria focus on how well the model explains variation in the response variable, directly measuring the gap between predicted and observed values. They quantify fit quality but require careful interpretation when comparing models of different complexity.

Adjusted R-squared

  • Formula: Radj2=1โˆ’(1โˆ’R2)(nโˆ’1)nโˆ’kโˆ’1R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}, where kk is the number of predictors (not counting the intercept)
  • Unlike regular R2R^2, which can only increase or stay the same as you add predictors, adjusted R2R^2 can decrease when you add a predictor that doesn't improve fit enough to justify the lost degree of freedom.
  • Interpretation stays intuitive: it estimates the proportion of variance explained, adjusted for model complexity. Use it instead of raw R2R^2 whenever you're comparing models with different numbers of predictors.

Residual Sum of Squares (RSS)

  • Formula: RSS=โˆ‘i=1n(yiโˆ’y^i)2RSS = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2
  • RSS is the foundation for most other fit measures, but it always decreases (or stays the same) as you add predictors. This means raw RSS alone cannot guide model selection, since it will always favor the most complex model.
  • Think of RSS as a building block: it feeds into the calculation of R2R^2, MSE, AIC, FF-tests, and Mallow's CpC_p.

Mean Squared Error (MSE)

  • Formula: MSE=RSSnMSE = \frac{RSS}{n} for the simple version, or RSSnโˆ’k\frac{RSS}{n-k} for the unbiased estimator of error variance (sometimes written as ฯƒ^2\hat{\sigma}^2)
  • Because squaring amplifies large residuals, MSE is sensitive to outliers. Keep this in mind when your data contains extreme values.
  • The distinction between training MSE and test MSE is critical. Low training MSE paired with high test MSE is the classic signature of overfitting.

Compare: RSS vs. Adjusted R2R^2: RSS always improves with more predictors, making it useless for model comparison on its own. Adjusted R2R^2 corrects this by penalizing unnecessary complexity. Always report adjusted R2R^2, not raw R2R^2, when comparing models with different numbers of predictors.


Hypothesis Testing Approaches

These methods use formal statistical tests to determine whether additional model complexity is justified. The logic: if adding parameters doesn't significantly improve fit beyond what you'd expect by chance, keep the simpler model.

F-test for Nested Models

The F-test checks whether the reduction in RSS from adding predictors is statistically significant. It requires models to be nested, meaning the simpler (reduced) model is a special case of the more complex (full) model.

  • Formula: F=(RSSreducedโˆ’RSSfull)/(dfreducedโˆ’dffull)RSSfull/dffullF = \frac{(RSS_{reduced} - RSS_{full})/(df_{reduced} - df_{full})}{RSS_{full}/df_{full}}
  • Compare this statistic to the FF-distribution with (dfreducedโˆ’dffull)(df_{reduced} - df_{full}) and dffulldf_{full} degrees of freedom.
  • A significant result (small p-value) means the additional predictors in the full model provide a meaningful improvement in fit that's unlikely due to chance alone.

Likelihood Ratio Test

  • Test statistic: ฮ›=โˆ’2lnโก(LreducedLfull)\Lambda = -2\ln\left(\frac{L_{reduced}}{L_{full}}\right), which follows a ฯ‡2\chi^2 distribution under the null hypothesis, with degrees of freedom equal to the difference in the number of parameters
  • The LRT is more general than the F-test because it applies to any models estimated via maximum likelihood, not just OLS regression. This makes it the go-to choice for generalized linear models.
  • Like the F-test, it requires nested models: the simpler model must be a restricted version of the complex one.

Compare: F-test vs. Likelihood Ratio Test: both compare nested models, but the F-test is specific to linear regression while the LRT works for any likelihood-based model. For standard OLS problems, they give equivalent results. For generalized linear models (logistic regression, Poisson regression, etc.), use the LRT.


Predictive Validation Methods

These approaches directly estimate out-of-sample performance by testing the model on data it hasn't seen. This is the most honest assessment of how your model will perform in practice, avoiding the optimism bias that plagues in-sample measures.

Cross-Validation

  • K-fold method: split the data into kk subsets (folds), train on kโˆ’1k-1 folds, test on the held-out fold, rotate through all folds, and average the prediction errors.
  • Leave-one-out cross-validation (LOOCV) is the extreme case where k=nk = n. It has low bias (each training set is nearly the full dataset) but high variance across folds, and it's computationally expensive for large nn.
  • Cross-validation is often called the gold standard for prediction because it directly estimates generalization error rather than relying on penalty-based approximations.

Prediction Error

  • Definition: y^iโˆ’yi\hat{y}_i - y_i, the raw difference between the model's prediction and the observed value
  • Expected prediction error is what all model selection criteria are ultimately trying to minimize. It decomposes into three components: bias, variance, and irreducible error. This decomposition is the theoretical backbone of the bias-variance tradeoff.
  • Test set prediction error is the most reliable metric for model comparison when you have sufficient data to hold out a separate test set.

Compare: Cross-validation vs. AIC/BIC: information criteria use mathematical penalties to approximate out-of-sample error, while cross-validation measures it directly. Cross-validation is more computationally intensive but makes fewer distributional assumptions. When you have enough data, cross-validation is generally preferred. When data is limited or you need a quick comparison across many candidate models, AIC/BIC are practical alternatives.


Quick Reference Table

ConceptBest Examples
Information-theoretic selectionAIC, BIC, Mallow's CpC_p
Penalizes complexity more heavilyBIC (vs. AIC)
Variance explainedAdjusted R2R^2, RSS
Nested model comparisonF-test, Likelihood Ratio Test
Out-of-sample validationCross-validation, Prediction Error
Requires full-model ฯƒ^2\hat{\sigma}^2 estimateMallow's CpC_p
Sensitive to outliersMSE, RSS
Works for non-linear modelsAIC, BIC, Cross-validation, LRT

Self-Check Questions

  1. You're comparing three non-nested regression models on a dataset with n=500n = 500. Which criterion, AIC or BIC, will tend to select the simpler model, and why?

  2. A colleague adds five new predictors to a model and reports that R2R^2 increased. Why is this insufficient evidence that the new model is better, and what metric should they report instead?

  3. Compare and contrast the F-test and cross-validation as model selection tools. Under what circumstances would you prefer each?

  4. If Mallow's CpC_p for a model with 6 predictors equals 15, what does this suggest about the model's fit?

  5. You're given training MSE and test MSE for two models: Model A (training: 2.1, test: 8.7) and Model B (training: 3.4, test: 4.2). Which model should you recommend, and what phenomenon explains Model A's performance gap?