Model selection helps you figure out which statistical model, from a set of candidates, best explains your data without being unnecessarily complex. In Bayesian statistics, this process naturally incorporates prior knowledge and uncertainty, giving you a principled framework for comparing models rather than just picking one that "looks good."

The core tension is always the same: a more complex model will fit your training data better, but it might just be fitting noise. Model selection criteria formalize the trade-off between fit and simplicity.

Purpose of Model Selection

Identify the most parsimonious model (simplest model that still adequately explains the data)
Balance complexity against goodness-of-fit to avoid overfitting
Improve predictive accuracy by favoring models that generalize to new data
Highlight which variables and relationships actually matter for scientific understanding

Challenges in Model Comparison

Complexity trade-offs: More parameters always improve in-sample fit, so you need principled ways to penalize unnecessary complexity.
Non-nested models: Traditional likelihood ratio tests only work for nested models. When comparing structurally different models, you need alternatives like Bayes factors or information criteria.
Large model spaces: With many candidate models (especially in high-dimensional settings), exhaustive comparison becomes computationally prohibitive.
Model uncertainty: Often several models perform similarly, and committing to just one ignores useful information from the others.

Likelihood-Based Criteria

These criteria start from the likelihood function, which measures how probable the observed data are under a given model and its parameter estimates. The raw likelihood always improves with more parameters, so each criterion adds a penalty for complexity.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) finds the parameter values that make the observed data most probable. It gives you a baseline measure of model fit: the maximized likelihood $\hat{L}$ . MLE serves as the foundation for both AIC and BIC, but on its own it doesn't penalize complexity, so it can't be used directly for model selection.

Akaike Information Criterion (AIC)

AIC penalizes model complexity by adding a term proportional to the number of estimated parameters:

$AIC = 2k - 2\ln(\hat{L})$

where $k$ is the number of parameters and $\hat{L}$ is the maximized likelihood. Lower AIC values indicate a better balance of fit and parsimony.

AIC is rooted in Kullback-Leibler divergence: it estimates the relative information lost when a model approximates the true data-generating process. Compared to BIC, AIC tends to favor more complex models, particularly with large sample sizes, because its complexity penalty doesn't grow with $n$ .

Bayesian Information Criterion (BIC)

BIC uses a stronger, sample-size-dependent penalty:

$BIC = k\ln(n) - 2\ln(\hat{L})$

where $n$ is the number of observations. Because the penalty scales with $\ln(n)$ , BIC penalizes additional parameters more heavily as your dataset grows.

BIC is consistent, meaning that as $n \to \infty$ , it will select the true model (assuming it's among the candidates). It also has a direct connection to Bayes factors: the difference in BIC between two models approximates $-2\ln(BF_{12})$ , which is why BIC is often preferred in Bayesian settings.

AIC vs. BIC in practice: AIC optimizes for predictive accuracy and may keep extra parameters that help prediction. BIC optimizes for identifying the true model and tends to select simpler models. Your choice depends on whether your goal is prediction or explanation.

Bayesian Model Selection

Unlike AIC and BIC, fully Bayesian methods integrate over the entire parameter space rather than relying on point estimates. This means they naturally account for parameter uncertainty.

Bayes Factors

A Bayes factor quantifies the relative evidence the data provide for one model over another:

$BF_{12} = \frac{p(y \mid M_1)}{p(y \mid M_2)}$

Each term in the ratio is a marginal likelihood (see below). A $BF_{12} = 10$ means the data are 10 times more probable under $M_1$ than $M_2$ .

Common interpretation guidelines (Kass and Raftery, 1995):

$BF_{12} < 1$ : Evidence favors $M_2$
$1$ to $3$ : Weak evidence for $M_1$
$3$ to $20$ : Positive evidence for $M_1$
$20$ to $150$ : Strong evidence for $M_1$
$> 150$ : Very strong evidence for $M_1$

One important caveat: Bayes factors can be sensitive to prior specifications, especially for parameters that appear in one model but not the other. Diffuse priors on those parameters tend to penalize the more complex model heavily, sometimes regardless of the data.

Posterior Model Probabilities

Once you have marginal likelihoods for all candidate models, you can compute the posterior probability of each model using Bayes' theorem:

$p(M_i \mid y) = \frac{p(y \mid M_i) \, p(M_i)}{\sum_j p(y \mid M_j) \, p(M_j)}$

Here $p(M_i)$ is your prior probability for model $i$ . If you have no reason to prefer one model over another, equal priors are a common default. Posterior model probabilities let you rank all candidates on a common scale and feed directly into model averaging.

Marginal Likelihood Estimation

The marginal likelihood (also called the model evidence) is the probability of the data under a model, integrated over all possible parameter values:

$p(y \mid M) = \int p(y \mid \theta, M) \, p(\theta \mid M) \, d\theta$

This integral is what makes Bayesian model selection powerful and difficult. It automatically penalizes complexity because models that spread their prior probability over a large parameter space will have lower marginal likelihoods unless the data strongly support that complexity.

Computing this integral analytically is rarely possible for realistic models. Common approximation methods include:

Laplace approximation: Approximates the posterior as a Gaussian centered at the mode. Fast but can be inaccurate for non-normal posteriors.
Bridge sampling: Uses samples from the posterior to estimate the marginal likelihood. More accurate but requires careful implementation.
Harmonic mean estimator: Easy to compute from MCMC output but notoriously unstable (avoid this in practice).

Cross-Validation Methods

Cross-validation assesses how well a model predicts data it hasn't seen. Rather than relying on theoretical penalties for complexity, it directly measures out-of-sample performance.

Purpose of model selection, Frontiers | Indices of Effect Existence and Significance in the Bayesian Framework

K-Fold Cross-Validation

Split your data into $K$ roughly equal subsets (folds).
For each fold, fit the model on the remaining $K-1$ folds.
Evaluate predictive performance on the held-out fold.
Average the performance across all $K$ folds.

Common choices are $K = 5$ or $K = 10$ . Larger $K$ gives less biased estimates but costs more computation, since you refit the model $K$ times.

Leave-One-Out Cross-Validation (LOO-CV)

LOO-CV is the special case where $K = n$ (each observation is its own fold). You train on $n - 1$ points and predict the single held-out point, repeating for every observation. This gives a nearly unbiased estimate of predictive performance but is computationally expensive for large datasets.

In Bayesian practice, Pareto smoothed importance sampling (PSIS-LOO) lets you approximate LOO-CV from a single model fit, avoiding the need to refit the model $n$ times. This is implemented in packages like loo in R.

Bayesian Cross-Validation

Standard cross-validation uses point predictions, but Bayesian cross-validation evaluates the full posterior predictive distribution on held-out data. This means you're assessing not just whether the model's predictions are close, but whether the model's uncertainty calibration is appropriate. PSIS-LOO is the most common implementation of this idea.

Information Theoretic Approaches

These methods frame model selection in terms of information theory, measuring how much information a model captures about the true data-generating process.

Kullback-Leibler Divergence

The KL divergence measures how much information is lost when you approximate a true distribution $p$ with a model distribution $q$ :

$D_{KL}(p \| q) = \int p(x) \ln \frac{p(x)}{q(x)} \, dx$

You can't compute this directly (since you don't know the true distribution), but it provides the theoretical motivation for AIC and DIC. A model with lower KL divergence from the truth is a better model.

Deviance Information Criterion (DIC)

DIC was designed specifically for Bayesian models fit with MCMC:

$DIC = \bar{D} + p_D$

$\bar{D}$ is the posterior mean deviance (average of $-2\ln p(y \mid \theta)$ over posterior samples)
$p_D$ is the effective number of parameters, calculated as $p_D = \bar{D} - D(\bar{\theta})$ , where $D(\bar{\theta})$ is the deviance evaluated at the posterior mean

DIC works well for hierarchical models where counting parameters isn't straightforward. However, it relies on the posterior mean as a point summary, which can be problematic when posteriors are skewed or multimodal.

Watanabe-Akaike Information Criterion (WAIC)

WAIC improves on DIC by using the full posterior distribution rather than a point estimate:

$WAIC = -2 \times (\text{lppd} - p_{WAIC})$

where lppd is the log pointwise predictive density (summing the log of the average predictive density for each observation) and $p_{WAIC}$ is a correction term for the effective number of parameters.

WAIC is asymptotically equivalent to LOO-CV and is more robust than DIC for models with non-normal posteriors. It's computed from posterior samples, making it straightforward to calculate after running MCMC.

Predictive Performance Measures

These are general-purpose metrics for evaluating how close a model's predictions are to observed values. They're used across both Bayesian and frequentist settings.

Mean Squared Error (MSE)

$MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$

MSE penalizes large errors disproportionately because of the squaring. It's the standard loss function for regression, but it can be dominated by outliers.

Mean Absolute Error (MAE)

$MAE = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$

MAE treats all errors proportionally and is more robust to outliers than MSE. It's also easier to interpret since it's in the same units as the outcome variable.

R-Squared and Adjusted R-Squared

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

$R^2$ tells you the proportion of variance in the outcome explained by the model. It always increases (or stays the same) when you add predictors, so adjusted $R^2$ corrects for this by penalizing additional predictors that don't meaningfully improve fit. In Bayesian contexts, analogous measures like Bayesian $R^2$ use the posterior predictive distribution rather than point estimates.

Model Averaging

Sometimes no single model is clearly best, and committing to one throws away useful information. Model averaging combines multiple models to produce more robust predictions and honest uncertainty estimates.

Bayesian Model Averaging (BMA)

BMA weights each model's contribution by its posterior probability:

$p(\theta \mid y) = \sum_{k=1}^K p(\theta \mid M_k, y) \, p(M_k \mid y)$

This means models with stronger data support contribute more to your final inference. BMA naturally propagates model uncertainty into your parameter estimates and predictions, giving you wider (and more honest) credible intervals than any single model would.

Frequentist Model Averaging

Frequentist model averaging uses weights derived from information criteria rather than posterior probabilities. For example, AIC weights for model $i$ are:

$w_i = \frac{\exp(-\Delta_i / 2)}{\sum_j \exp(-\Delta_j / 2)}$

where $\Delta_i = AIC_i - AIC_{min}$ . This approach is computationally lighter than full BMA and doesn't require computing marginal likelihoods, but it lacks the coherent probabilistic interpretation of the Bayesian version.

Practical Considerations

Computational Complexity

Full Bayesian methods (Bayes factors, marginal likelihoods) can be computationally demanding, especially for complex models or large datasets. Information criteria like WAIC and approximations like PSIS-LOO offer good trade-offs between accuracy and computational cost. When choosing a method, consider whether the added precision of a more expensive approach actually changes your conclusions.

Sample Size Effects

With small samples, most criteria become unreliable, and simpler models are generally safer. BIC's asymptotic consistency only kicks in with reasonably large $n$ . AIC has a small-sample correction (AICc) that's worth using when $n/k < 40$ . Cross-validation estimates also become noisy with small datasets because each fold contains very few observations.

Model Interpretability vs. Complexity

A model that predicts well but can't be explained may not be useful for scientific understanding. In many applied settings, a slightly worse-fitting but interpretable model is preferred. This is a judgment call that depends on your goals: if you need to understand mechanisms, lean toward simpler models; if you need accurate forecasts, let the criteria guide you toward complexity.

Limitations and Criticisms

Overfitting Concerns

Even with complexity penalties, model selection criteria can sometimes favor models that fit noise. This is especially true with small samples or when the candidate model set is very large (the "garden of forking paths" problem). Cross-validation and posterior predictive checks provide additional safeguards beyond relying on a single criterion.

Model Misspecification

All model selection criteria assume the true model is among your candidates, or at least that one candidate is a reasonable approximation. If every candidate model is wrong in important ways, the "best" model according to any criterion may still give misleading inferences. Posterior predictive checks help you detect misspecification by comparing simulated data from your model to the actual data.

Sensitivity to Prior Choices

Bayes factors and posterior model probabilities can shift substantially with different priors, particularly for parameters unique to one model. If you place a very diffuse prior on a parameter, the marginal likelihood for that model drops because the prior probability is spread thinly over values the data don't support. Running sensitivity analyses with different reasonable priors is essential. Methods like WAIC and LOO-CV, which condition on the posterior rather than integrating over the prior, are less affected by this issue.