Basics of Model Comparison
In Bayesian statistics, you rarely have just one model for your data. Model comparison gives you a principled way to evaluate competing hypotheses by asking: which model is best supported by the evidence, and by how much?
This matters because simply picking the model with the best fit leads to overfitting. Bayesian model comparison methods balance fit against complexity, letting you choose models that generalize well to new data.
Purpose of Model Comparison
- Identify which model best explains the observed data
- Quantify the relative support for different models (not just pick a winner)
- Guard against overfitting by penalizing unnecessary complexity
- Facilitate scientific inference by formally comparing alternative hypotheses
Types of Models Compared
- Nested models: one model is a special case of another (e.g., a regression with 3 predictors vs. the same regression with 2 of those predictors)
- Non-nested models: models with fundamentally different structures or predictor sets
- Linear vs. nonlinear models
- Parametric vs. nonparametric models
- Models that share the same likelihood but differ in their prior distributions
Bayes Factors
Bayes factors are the most direct Bayesian tool for model comparison. They tell you how much the observed data shift the relative plausibility of two models, without requiring the models to be nested.
Definition
The Bayes factor comparing model to model is the ratio of their marginal likelihoods:
Each marginal likelihood is obtained by integrating the likelihood over the entire prior distribution of that model's parameters. This integration is what gives Bayes factors their built-in complexity penalty: a model with many parameters spreads its prior probability across a large parameter space, so it must fit the data well across that space to earn a high marginal likelihood.
Interpretation
A means the data are 8 times more likely under than under . Jeffreys' scale provides rough interpretive guidelines:
| Strength of evidence for | |
|---|---|
| 1 to 3 | Weak (barely worth mentioning) |
| 3 to 20 | Positive |
| 20 to 150 | Strong |
| > 150 | Very strong |
Values below 1 favor . Log Bayes factors are often reported because they're symmetric around zero and easier to work with.
Advantages and Limitations
- Advantages:
- Naturally implement Occam's razor through the marginal likelihood integral
- Work for non-nested models
- Provide a continuous, interpretable measure of evidence
- Limitations:
- Sensitive to prior choice, especially for vague or improper priors. Two analysts with different priors can get very different Bayes factors.
- Computationally demanding for complex models because the marginal likelihood integral is often high-dimensional
- Can be numerically unstable in high-dimensional parameter spaces
Information Criteria
Information criteria offer computationally cheaper alternatives to Bayes factors. They estimate out-of-sample predictive performance by combining a measure of in-sample fit with a penalty for complexity.
Akaike Information Criterion (AIC)
AIC estimates the expected out-of-sample prediction error (specifically, the Kullback-Leibler divergence):
where is the maximized likelihood and is the number of estimated parameters. Lower AIC is better. The term penalizes complexity, but the penalty is relatively mild. AIC is derived from frequentist assumptions and works best with large samples. Strictly speaking, it's not a Bayesian criterion, but it's commonly used alongside Bayesian methods.
Bayesian Information Criterion (BIC)
BIC applies a stronger complexity penalty that grows with sample size:
where is the number of observations. For any sample with , the BIC penalty exceeds the AIC penalty, so BIC tends to favor simpler models. A key property: the difference in BIC between two models approximates for large samples, giving BIC a direct connection to Bayes factors. BIC is also consistent, meaning it selects the true model (if it's among the candidates) as .
Deviance Information Criterion (DIC)
DIC was designed specifically for Bayesian hierarchical models:
Here is the deviance evaluated at the posterior mean of the parameters, and is the effective number of parameters, which accounts for the fact that hierarchical priors shrink parameters and reduce effective complexity. DIC works well when the posterior is roughly normal, but it can give misleading results for mixture models or posteriors with multiple modes.
Cross-Validation Methods
Cross-validation directly measures how well a model predicts data it hasn't seen. This makes it less dependent on specific modeling assumptions than information criteria.

Leave-One-Out Cross-Validation (LOO-CV)
-
Remove one data point from the dataset
-
Fit the model to the remaining points
-
Calculate the predictive density for the held-out point
-
Repeat for every data point
-
Sum the log predictive densities to get the total LOO score
This is computationally expensive since it requires separate model fits. In practice, Pareto-smoothed importance sampling (PSIS-LOO) approximates LOO-CV from a single model fit by reweighting posterior samples. The loo package in R implements this efficiently.
K-Fold Cross-Validation
- Divide the data into roughly equal subsets (folds)
- Hold out one fold, fit the model on the remaining folds
- Evaluate predictive performance on the held-out fold
- Rotate through all folds and aggregate results
Common choices are or . K-fold is more computationally tractable than full LOO-CV for large datasets, though it introduces some variance depending on how the folds are split.
Bayesian Cross-Validation
Standard cross-validation uses point estimates, but Bayesian cross-validation uses the full posterior predictive distribution for evaluation. This means uncertainty in parameter estimates propagates into the predictive assessment. The key output is the expected log pointwise predictive density (ELPD), which measures how well the model's predictive distribution covers the observed data. Higher ELPD is better.
Posterior Predictive Checks
While the methods above compare models to each other, posterior predictive checks ask a different question: does this model adequately describe the data at all? A model can be the "best" among candidates and still be a poor fit.
How They Work
- Draw parameter values from the posterior distribution
- Simulate new datasets (called replicated data, ) from the model using those parameter values
- Compare features of the simulated data to the same features of the observed data
- Systematic discrepancies reveal where the model fails
Visual vs. Quantitative Checks
Visual checks:
- Overlay density plots of observed data and multiple simulated datasets
- Plot residual distributions from replicated data
- Use Q-Q plots to check distributional assumptions
Quantitative checks:
- Compute summary statistics (mean, variance, skewness, min, max) for both observed and replicated data
- Calculate discrepancy measures that target specific model assumptions
- Compare test statistics across many replicated datasets
Posterior Predictive p-Values
A posterior predictive p-value is the proportion of replicated datasets where a chosen test statistic is more extreme than the observed value. Values near 0.5 suggest the model captures that feature well. Values near 0 or 1 signal a problem: the model consistently over- or under-predicts that aspect of the data. These are not the same as classical p-values and should not be interpreted with the same thresholds.
Model Averaging
Sometimes no single model clearly dominates, or you want predictions that account for uncertainty about which model is correct. Model averaging addresses this.
Bayesian Model Averaging (BMA)
Instead of committing to one model, BMA weights each model's predictions by its posterior probability:
Here is whatever quantity you're trying to estimate or predict, and is the posterior model probability (derived from the marginal likelihoods and prior model probabilities). BMA often produces better-calibrated predictions than selecting a single model, especially when several models have similar posterior support.
Occam's Window
With many candidate models, most will have negligible posterior probability. Occam's window trims the model set to reduce computation:
- Symmetric Occam's window: exclude any model whose Bayes factor against the best model falls below a threshold (e.g., )
- Asymmetric Occam's window: also exclude complex models that are outperformed by simpler nested alternatives
This keeps the model average focused on models with meaningful support.
Reversible Jump MCMC
When models differ in the number of parameters (different dimensionality), standard MCMC can't move between them. Reversible jump MCMC (RJMCMC) solves this by designing proposals that jump between model spaces of different dimensions. The sampler simultaneously explores parameter values and model identity, producing posterior model probabilities as a byproduct. RJMCMC is powerful for variable selection problems (e.g., which predictors belong in a regression), but designing good between-model proposals requires care.

Practical Considerations
Computational Complexity
Model comparison costs scale with both model complexity and the number of candidates. Some strategies for managing this:
- Use PSIS-LOO instead of brute-force LOO-CV
- Consider approximation methods like variational inference or Laplace approximation for marginal likelihood estimation
- Leverage parallel computing for cross-validation and simulation-based methods
- Be deliberate about which models to compare rather than exhaustively searching a huge model space
Sensitivity Analysis
Because Bayes factors are sensitive to prior choice, you should check whether your conclusions change under reasonable alternative priors. This involves:
- Varying prior distributions and hyperparameters across a plausible range
- Checking whether model rankings remain stable
- Identifying cases where results depend heavily on a specific prior assumption
If your model comparison results flip under modest prior changes, that's a sign the data alone aren't strongly informative about model choice.
Handling Model Uncertainty
No single model is likely to be "true." Responsible Bayesian practice acknowledges this by:
- Reporting results from multiple plausible models, not just the winner
- Using model averaging for predictions and parameter estimates when appropriate
- Presenting sensitivity analyses so readers can judge robustness
- Being transparent about the set of models considered and why they were chosen
Advanced Techniques
Approximate Bayesian Computation (ABC)
Some models have likelihoods that are impossible or impractical to evaluate (common in population genetics, epidemiology, and agent-based models). ABC sidesteps this by:
- Proposing parameter values from the prior
- Simulating data from the model
- Comparing simulated data to observed data using summary statistics
- Accepting proposals where the summary statistics are "close enough"
ABC can be combined with model selection by running the procedure across competing models and estimating posterior model probabilities from the accepted samples. Extensions like ABC-SMC improve efficiency.
Variational Bayes Methods
Variational inference approximates the posterior by finding the closest distribution within a tractable family, turning inference into an optimization problem. For model comparison, the evidence lower bound (ELBO) from variational inference provides a lower bound on the log marginal likelihood, which can be used to approximate Bayes factors. The trade-off is speed for exactness: variational methods are much faster than MCMC but may underestimate posterior uncertainty.
Bayesian Nonparametrics
Bayesian nonparametric models (e.g., Dirichlet process mixtures, Gaussian processes) let the model complexity grow with the data rather than being fixed in advance. This shifts the model comparison question from "which fixed model?" to "how complex should the model be?" These methods require specialized inference techniques but are valuable when you don't want to commit to a specific number of clusters, basis functions, or other structural choices.
Applications in Research
Psychology
Bayesian model comparison is widely used to evaluate competing cognitive models of decision-making, learning, and memory. Hierarchical Bayesian models account for individual differences across participants, while Bayes factors test specific experimental hypotheses. Posterior predictive checks verify that a cognitive model can reproduce key behavioral patterns.
Ecology
Ecologists compare species distribution models under different climate scenarios, evaluate hypotheses about population dynamics, and use BMA to produce robust predictions of ecosystem change. Information criteria help select among competing food web models, and model uncertainty is explicitly incorporated into conservation planning.
Finance
In finance, model comparison methods evaluate competing asset pricing models and risk models for portfolio optimization. Cross-validation assesses predictive performance of time series forecasts, and model averaging helps produce more reliable predictions by acknowledging that no single financial model captures all market dynamics.