In Bayesian statistics, you rarely have just one model for your data. Model comparison gives you a principled way to evaluate competing hypotheses by asking: which model is best supported by the evidence, and by how much?

This matters because simply picking the model with the best fit leads to overfitting. Bayesian model comparison methods balance fit against complexity, letting you choose models that generalize well to new data.

Purpose of Model Comparison

Identify which model best explains the observed data
Quantify the relative support for different models (not just pick a winner)
Guard against overfitting by penalizing unnecessary complexity
Facilitate scientific inference by formally comparing alternative hypotheses

Types of Models Compared

Nested models: one model is a special case of another (e.g., a regression with 3 predictors vs. the same regression with 2 of those predictors)
Non-nested models: models with fundamentally different structures or predictor sets
Linear vs. nonlinear models
Parametric vs. nonparametric models
Models that share the same likelihood but differ in their prior distributions

Bayes Factors

Bayes factors are the most direct Bayesian tool for model comparison. They tell you how much the observed data shift the relative plausibility of two models, without requiring the models to be nested.

Definition

The Bayes factor comparing model $M_1$ to model $M_2$ is the ratio of their marginal likelihoods:

$BF_{12} = \frac{p(D \mid M_1)}{p(D \mid M_2)}$

Each marginal likelihood $p(D \mid M_k)$ is obtained by integrating the likelihood over the entire prior distribution of that model's parameters. This integration is what gives Bayes factors their built-in complexity penalty: a model with many parameters spreads its prior probability across a large parameter space, so it must fit the data well across that space to earn a high marginal likelihood.

Interpretation

A $BF_{12} = 8$ means the data are 8 times more likely under $M_1$ than under $M_2$ . Jeffreys' scale provides rough interpretive guidelines:

$BF_{12}$	Strength of evidence for $M_1$
1 to 3	Weak (barely worth mentioning)
3 to 20	Positive
20 to 150	Strong
> 150	Very strong

Values below 1 favor $M_2$ . Log Bayes factors are often reported because they're symmetric around zero and easier to work with.

Advantages and Limitations

Advantages:
- Naturally implement Occam's razor through the marginal likelihood integral
- Work for non-nested models
- Provide a continuous, interpretable measure of evidence
Limitations:
- Sensitive to prior choice, especially for vague or improper priors. Two analysts with different priors can get very different Bayes factors.
- Computationally demanding for complex models because the marginal likelihood integral is often high-dimensional
- Can be numerically unstable in high-dimensional parameter spaces

Information Criteria

Information criteria offer computationally cheaper alternatives to Bayes factors. They estimate out-of-sample predictive performance by combining a measure of in-sample fit with a penalty for complexity.

Akaike Information Criterion (AIC)

AIC estimates the expected out-of-sample prediction error (specifically, the Kullback-Leibler divergence):

$AIC = -2\log(\hat{L}) + 2k$

where $\hat{L}$ is the maximized likelihood and $k$ is the number of estimated parameters. Lower AIC is better. The $2k$ term penalizes complexity, but the penalty is relatively mild. AIC is derived from frequentist assumptions and works best with large samples. Strictly speaking, it's not a Bayesian criterion, but it's commonly used alongside Bayesian methods.

Bayesian Information Criterion (BIC)

BIC applies a stronger complexity penalty that grows with sample size:

$BIC = -2\log(\hat{L}) + k\log(n)$

where $n$ is the number of observations. For any sample with $n \geq 8$ , the BIC penalty exceeds the AIC penalty, so BIC tends to favor simpler models. A key property: the difference in BIC between two models approximates $-2\log(BF_{12})$ for large samples, giving BIC a direct connection to Bayes factors. BIC is also consistent, meaning it selects the true model (if it's among the candidates) as $n \to \infty$ .

Deviance Information Criterion (DIC)

DIC was designed specifically for Bayesian hierarchical models:

$DIC = D(\bar{\theta}) + 2p_D$

Here $D(\bar{\theta})$ is the deviance evaluated at the posterior mean of the parameters, and $p_D$ is the effective number of parameters, which accounts for the fact that hierarchical priors shrink parameters and reduce effective complexity. DIC works well when the posterior is roughly normal, but it can give misleading results for mixture models or posteriors with multiple modes.

Cross-Validation Methods

Cross-validation directly measures how well a model predicts data it hasn't seen. This makes it less dependent on specific modeling assumptions than information criteria.

Purpose of model comparison, Frontiers | Indices of Effect Existence and Significance in the Bayesian Framework

Leave-One-Out Cross-Validation (LOO-CV)

Remove one data point from the dataset
Fit the model to the remaining $n - 1$ points
Calculate the predictive density for the held-out point
Repeat for every data point
Sum the log predictive densities to get the total LOO score

This is computationally expensive since it requires $n$ separate model fits. In practice, Pareto-smoothed importance sampling (PSIS-LOO) approximates LOO-CV from a single model fit by reweighting posterior samples. The loo package in R implements this efficiently.

K-Fold Cross-Validation

Divide the data into $K$ roughly equal subsets (folds)
Hold out one fold, fit the model on the remaining $K-1$ folds
Evaluate predictive performance on the held-out fold
Rotate through all $K$ folds and aggregate results

Common choices are $K = 5$ or $K = 10$ . K-fold is more computationally tractable than full LOO-CV for large datasets, though it introduces some variance depending on how the folds are split.

Bayesian Cross-Validation

Standard cross-validation uses point estimates, but Bayesian cross-validation uses the full posterior predictive distribution for evaluation. This means uncertainty in parameter estimates propagates into the predictive assessment. The key output is the expected log pointwise predictive density (ELPD), which measures how well the model's predictive distribution covers the observed data. Higher ELPD is better.

Posterior Predictive Checks

While the methods above compare models to each other, posterior predictive checks ask a different question: does this model adequately describe the data at all? A model can be the "best" among candidates and still be a poor fit.

How They Work

Draw parameter values from the posterior distribution
Simulate new datasets (called replicated data, $y^{rep}$ ) from the model using those parameter values
Compare features of the simulated data to the same features of the observed data
Systematic discrepancies reveal where the model fails

Visual vs. Quantitative Checks

Visual checks:

Overlay density plots of observed data and multiple simulated datasets
Plot residual distributions from replicated data
Use Q-Q plots to check distributional assumptions

Quantitative checks:

Compute summary statistics (mean, variance, skewness, min, max) for both observed and replicated data
Calculate discrepancy measures that target specific model assumptions
Compare test statistics across many replicated datasets

Posterior Predictive p-Values

A posterior predictive p-value is the proportion of replicated datasets where a chosen test statistic is more extreme than the observed value. Values near 0.5 suggest the model captures that feature well. Values near 0 or 1 signal a problem: the model consistently over- or under-predicts that aspect of the data. These are not the same as classical p-values and should not be interpreted with the same thresholds.

Model Averaging

Sometimes no single model clearly dominates, or you want predictions that account for uncertainty about which model is correct. Model averaging addresses this.

Bayesian Model Averaging (BMA)

Instead of committing to one model, BMA weights each model's predictions by its posterior probability:

$p(\Delta \mid D) = \sum_{k=1}^{K} p(\Delta \mid M_k, D) \, p(M_k \mid D)$

Here $\Delta$ is whatever quantity you're trying to estimate or predict, and $p(M_k \mid D)$ is the posterior model probability (derived from the marginal likelihoods and prior model probabilities). BMA often produces better-calibrated predictions than selecting a single model, especially when several models have similar posterior support.

Occam's Window

With many candidate models, most will have negligible posterior probability. Occam's window trims the model set to reduce computation:

Symmetric Occam's window: exclude any model whose Bayes factor against the best model falls below a threshold (e.g., $BF < 1/20$ )
Asymmetric Occam's window: also exclude complex models that are outperformed by simpler nested alternatives

This keeps the model average focused on models with meaningful support.

Reversible Jump MCMC

When models differ in the number of parameters (different dimensionality), standard MCMC can't move between them. Reversible jump MCMC (RJMCMC) solves this by designing proposals that jump between model spaces of different dimensions. The sampler simultaneously explores parameter values and model identity, producing posterior model probabilities as a byproduct. RJMCMC is powerful for variable selection problems (e.g., which predictors belong in a regression), but designing good between-model proposals requires care.

Purpose of model comparison, Bayesian Approaches | Mixed Models with R

Practical Considerations

Computational Complexity

Model comparison costs scale with both model complexity and the number of candidates. Some strategies for managing this:

Use PSIS-LOO instead of brute-force LOO-CV
Consider approximation methods like variational inference or Laplace approximation for marginal likelihood estimation
Leverage parallel computing for cross-validation and simulation-based methods
Be deliberate about which models to compare rather than exhaustively searching a huge model space

Sensitivity Analysis

Because Bayes factors are sensitive to prior choice, you should check whether your conclusions change under reasonable alternative priors. This involves:

Varying prior distributions and hyperparameters across a plausible range
Checking whether model rankings remain stable
Identifying cases where results depend heavily on a specific prior assumption

If your model comparison results flip under modest prior changes, that's a sign the data alone aren't strongly informative about model choice.

Handling Model Uncertainty

No single model is likely to be "true." Responsible Bayesian practice acknowledges this by:

Reporting results from multiple plausible models, not just the winner
Using model averaging for predictions and parameter estimates when appropriate
Presenting sensitivity analyses so readers can judge robustness
Being transparent about the set of models considered and why they were chosen

Advanced Techniques

Approximate Bayesian Computation (ABC)

Some models have likelihoods that are impossible or impractical to evaluate (common in population genetics, epidemiology, and agent-based models). ABC sidesteps this by:

Proposing parameter values from the prior
Simulating data from the model
Comparing simulated data to observed data using summary statistics
Accepting proposals where the summary statistics are "close enough"

ABC can be combined with model selection by running the procedure across competing models and estimating posterior model probabilities from the accepted samples. Extensions like ABC-SMC improve efficiency.

Variational Bayes Methods

Variational inference approximates the posterior by finding the closest distribution within a tractable family, turning inference into an optimization problem. For model comparison, the evidence lower bound (ELBO) from variational inference provides a lower bound on the log marginal likelihood, which can be used to approximate Bayes factors. The trade-off is speed for exactness: variational methods are much faster than MCMC but may underestimate posterior uncertainty.

Bayesian Nonparametrics

Bayesian nonparametric models (e.g., Dirichlet process mixtures, Gaussian processes) let the model complexity grow with the data rather than being fixed in advance. This shifts the model comparison question from "which fixed model?" to "how complex should the model be?" These methods require specialized inference techniques but are valuable when you don't want to commit to a specific number of clusters, basis functions, or other structural choices.

Applications in Research

Psychology

Bayesian model comparison is widely used to evaluate competing cognitive models of decision-making, learning, and memory. Hierarchical Bayesian models account for individual differences across participants, while Bayes factors test specific experimental hypotheses. Posterior predictive checks verify that a cognitive model can reproduce key behavioral patterns.

Ecology

Ecologists compare species distribution models under different climate scenarios, evaluate hypotheses about population dynamics, and use BMA to produce robust predictions of ecosystem change. Information criteria help select among competing food web models, and model uncertainty is explicitly incorporated into conservation planning.

Finance

In finance, model comparison methods evaluate competing asset pricing models and risk models for portfolio optimization. Cross-validation assesses predictive performance of time series forecasts, and model averaging helps produce more reliable predictions by acknowledging that no single financial model captures all market dynamics.

2,589 studying →