AIC and BIC are scoring tools that let you rank competing models by balancing how well each model fits the data against how many parameters it uses. Lower scores are better for both criteria.

AIC (Akaike Information Criterion) estimates the relative information lost by a given model. It accounts for goodness of fit and the number of parameters:

$AIC = 2k - 2\ln(L)$

where $k$ is the number of estimated parameters (including the intercept and error variance) and $L$ is the maximized likelihood.

BIC (Bayesian Information Criterion) uses the same logic but introduces a stronger penalty that scales with sample size:

$BIC = k \ln(n) - 2\ln(L)$

where $n$ is the sample size. Because $\ln(n) > 2$ whenever $n \geq 8$ , BIC penalizes additional parameters more heavily than AIC for virtually any real dataset.

Both criteria follow the principle of parsimony (Occam's razor): prefer the simplest model that adequately explains the data.

Assumptions and Considerations

AIC and BIC do not require that candidate models be nested. You can compare non-nested models as long as they are fit to the exact same dataset and the likelihoods are computed on the same scale (same response variable, same observations).
Information criteria give you a quantitative ranking, but they shouldn't be the only factor in your decision. Interpretability, computational cost, and the goal of the analysis (prediction vs. inference) all matter.
Neither criterion tells you whether your best-scoring model is actually good in an absolute sense. It only tells you which candidate is best relative to the others.

Calculating AIC and BIC

Steps to Calculate AIC

Count the number of estimated parameters $k$ . In a linear regression, this includes each coefficient (including the intercept) plus the error variance, so a model with $p$ predictors has $k = p + 2$ .
Fit the model and obtain the maximized log-likelihood $\ln(L)$ .
Plug into the formula: $AIC = 2k - 2\ln(L)$ .

Information Criteria for Model Selection, Frontiers | To transform or not to transform: using generalized linear mixed models to analyse ...

Steps to Calculate BIC

Count the number of estimated parameters $k$ (same as above).
Record the sample size $n$ .
Fit the model and obtain the maximized log-likelihood $\ln(L)$ .
Plug into the formula: $BIC = k \ln(n) - 2\ln(L)$ .

Model Comparison Using AIC and BIC

Compare the AIC (or BIC) values across your candidate models. The model with the lowest value wins.

A few rules to keep the comparison valid:

All models must be fit to the same dataset (same observations, same response variable).
The likelihoods must be computed using the same method (e.g., don't mix maximum likelihood and restricted maximum likelihood across candidates).

The raw difference in scores between two models, often written $\Delta AIC$ or $\Delta BIC$ , tells you how strong the evidence is:

Difference	Strength of Evidence
0–2	Weak (models are roughly equivalent)
2–6	Moderate
6–10	Strong
> 10	Very strong

These guidelines come from Burnham & Anderson for AIC and Kass & Raftery for BIC. They're rules of thumb, not hard cutoffs.

Model Complexity vs. Goodness of Fit

Balancing Fit and Complexity

Every time you add a parameter to a model, the fit to the training data improves (or at least doesn't get worse). But that improvement might just be capturing noise rather than real signal. AIC and BIC guard against this by attaching a cost to each additional parameter. A more complex model only "wins" if the gain in log-likelihood is large enough to offset the penalty.

Information Criteria for Model Selection, Frontiers | Model Complexity in Diffusion Modeling: Benefits of Making the Model More Parsimonious

Underfitting and Overfitting

Underfitting happens when a model is too simple to capture the real patterns in the data. These models have high bias and low variance. Residual plots will often show systematic structure that the model missed.

Overfitting happens when a model is so complex that it starts fitting the noise. These models have low bias but high variance, and they tend to perform poorly on new data even though they look great on the training set.

Information criteria sit between these extremes. The penalty term pushes you away from overfitting, while the log-likelihood term pushes you away from underfitting. The minimum-scoring model represents the criterion's best compromise.

Comparing AIC and BIC

Differences in Penalty Terms

The core difference is in the penalty multiplier on $k$ :

AIC uses a fixed multiplier of 2.
BIC uses $\ln(n)$ , which grows with sample size.

For a dataset with $n = 100$ , $\ln(100) \approx 4.6$ , so BIC's penalty per parameter is more than double AIC's. At $n = 1000$ , $\ln(1000) \approx 6.9$ . This means BIC becomes increasingly conservative as your dataset grows, favoring simpler models.

Asymptotic Properties

These two criteria optimize for different things, and this distinction matters:

AIC is asymptotically efficient. As $n \to \infty$ , it selects the model that minimizes prediction error (mean squared error). This makes AIC a natural choice when your primary goal is forecasting. The tradeoff is that AIC is not consistent: even with infinite data, it may select a model that's slightly more complex than the true data-generating process.
BIC is consistent. As $n \to \infty$ , it selects the true model with probability approaching 1, assuming the true model is among your candidates. This makes BIC a better fit when your goal is identifying which variables truly belong in the model. The tradeoff is that with small samples, BIC can be too aggressive in dropping variables, leading to underfitting.

When to Use Which

Criterion	Best suited for	Watch out for
AIC	Prediction-focused tasks; smaller samples	May retain extra variables that don't reflect the true model
BIC	Inference and variable identification; larger samples	May drop variables prematurely with small $n$ ; assumes the true model is a candidate

In practice, if AIC and BIC agree on the same model, you can be fairly confident in that choice. When they disagree, let your analysis goal guide you: lean toward AIC for prediction, BIC for inference. And always check that the selected model makes substantive sense in the context of your problem.

2,589 studying →