Akaike vs Bayesian Information Criteria
Information Criteria for Model Selection
AIC and BIC are scoring tools that let you rank competing models by balancing how well each model fits the data against how many parameters it uses. Lower scores are better for both criteria.
AIC (Akaike Information Criterion) estimates the relative information lost by a given model. It accounts for goodness of fit and the number of parameters:
where is the number of estimated parameters (including the intercept and error variance) and is the maximized likelihood.
BIC (Bayesian Information Criterion) uses the same logic but introduces a stronger penalty that scales with sample size:
where is the sample size. Because whenever , BIC penalizes additional parameters more heavily than AIC for virtually any real dataset.
Both criteria follow the principle of parsimony (Occam's razor): prefer the simplest model that adequately explains the data.
Assumptions and Considerations
- AIC and BIC do not require that candidate models be nested. You can compare non-nested models as long as they are fit to the exact same dataset and the likelihoods are computed on the same scale (same response variable, same observations).
- Information criteria give you a quantitative ranking, but they shouldn't be the only factor in your decision. Interpretability, computational cost, and the goal of the analysis (prediction vs. inference) all matter.
- Neither criterion tells you whether your best-scoring model is actually good in an absolute sense. It only tells you which candidate is best relative to the others.
Calculating AIC and BIC
Steps to Calculate AIC
- Count the number of estimated parameters . In a linear regression, this includes each coefficient (including the intercept) plus the error variance, so a model with predictors has .
- Fit the model and obtain the maximized log-likelihood .
- Plug into the formula: .

Steps to Calculate BIC
- Count the number of estimated parameters (same as above).
- Record the sample size .
- Fit the model and obtain the maximized log-likelihood .
- Plug into the formula: .
Model Comparison Using AIC and BIC
Compare the AIC (or BIC) values across your candidate models. The model with the lowest value wins.
A few rules to keep the comparison valid:
- All models must be fit to the same dataset (same observations, same response variable).
- The likelihoods must be computed using the same method (e.g., don't mix maximum likelihood and restricted maximum likelihood across candidates).
The raw difference in scores between two models, often written or , tells you how strong the evidence is:
| Difference | Strength of Evidence |
|---|---|
| 0–2 | Weak (models are roughly equivalent) |
| 2–6 | Moderate |
| 6–10 | Strong |
| > 10 | Very strong |
These guidelines come from Burnham & Anderson for AIC and Kass & Raftery for BIC. They're rules of thumb, not hard cutoffs.
Model Complexity vs. Goodness of Fit
Balancing Fit and Complexity
Every time you add a parameter to a model, the fit to the training data improves (or at least doesn't get worse). But that improvement might just be capturing noise rather than real signal. AIC and BIC guard against this by attaching a cost to each additional parameter. A more complex model only "wins" if the gain in log-likelihood is large enough to offset the penalty.

Underfitting and Overfitting
Underfitting happens when a model is too simple to capture the real patterns in the data. These models have high bias and low variance. Residual plots will often show systematic structure that the model missed.
Overfitting happens when a model is so complex that it starts fitting the noise. These models have low bias but high variance, and they tend to perform poorly on new data even though they look great on the training set.
Information criteria sit between these extremes. The penalty term pushes you away from overfitting, while the log-likelihood term pushes you away from underfitting. The minimum-scoring model represents the criterion's best compromise.
Comparing AIC and BIC
Differences in Penalty Terms
The core difference is in the penalty multiplier on :
- AIC uses a fixed multiplier of 2.
- BIC uses , which grows with sample size.
For a dataset with , , so BIC's penalty per parameter is more than double AIC's. At , . This means BIC becomes increasingly conservative as your dataset grows, favoring simpler models.
Asymptotic Properties
These two criteria optimize for different things, and this distinction matters:
- AIC is asymptotically efficient. As , it selects the model that minimizes prediction error (mean squared error). This makes AIC a natural choice when your primary goal is forecasting. The tradeoff is that AIC is not consistent: even with infinite data, it may select a model that's slightly more complex than the true data-generating process.
- BIC is consistent. As , it selects the true model with probability approaching 1, assuming the true model is among your candidates. This makes BIC a better fit when your goal is identifying which variables truly belong in the model. The tradeoff is that with small samples, BIC can be too aggressive in dropping variables, leading to underfitting.
When to Use Which
| Criterion | Best suited for | Watch out for |
|---|---|---|
| AIC | Prediction-focused tasks; smaller samples | May retain extra variables that don't reflect the true model |
| BIC | Inference and variable identification; larger samples | May drop variables prematurely with small ; assumes the true model is a candidate |
In practice, if AIC and BIC agree on the same model, you can be fairly confident in that choice. When they disagree, let your analysis goal guide you: lean toward AIC for prediction, BIC for inference. And always check that the selected model makes substantive sense in the context of your problem.