Foundations of Bayesian inference
Bayesian inference is a statistical framework that updates the probability of a hypothesis as new evidence becomes available. Rather than treating parameters as fixed unknowns (the frequentist view), Bayesian methods treat them as random variables with probability distributions that reflect our uncertainty. This distinction matters in actuarial work because you often have prior knowledge from historical portfolios or expert judgment that should inform your estimates.
Bayes' theorem
Bayes' theorem is the mathematical engine behind all Bayesian inference. It describes how to revise the probability of a hypothesis given new data:
where:
- is the posterior distribution: your updated belief about parameter after observing data
- is the prior distribution: what you believed about before seeing the data
- is the likelihood: the probability of observing the data given a particular value of
- is the marginal likelihood (or evidence): a normalizing constant that ensures the posterior integrates to 1
In practice, since doesn't depend on , you'll often work with the proportionality:
This proportional form is usually all you need, because MCMC methods (covered below) can sample from the posterior without computing the normalizing constant.
Prior and posterior distributions
The prior distribution encodes what you know (or assume) about a parameter before collecting data. Priors can be:
- Informative: reflecting genuine prior knowledge (e.g., historical loss ratios from similar portfolios)
- Weakly informative: constraining parameters to reasonable ranges without strong assumptions
- Non-informative (or diffuse): attempting to let the data dominate (e.g., flat priors, Jeffreys priors)
The posterior distribution combines the prior with the likelihood. It represents your complete updated knowledge about the parameter. All Bayesian inference flows from the posterior: point estimates (posterior mean or mode), interval estimates (credible intervals), and predictions.
The choice of prior can substantially affect the posterior, especially with small sample sizes. As the amount of data grows, the likelihood dominates and the posterior becomes less sensitive to the prior. This property is called asymptotic consistency.
Conjugate priors
A prior is conjugate to a given likelihood if the posterior belongs to the same distributional family as the prior. This is valuable because it gives you a closed-form posterior, avoiding the need for numerical methods entirely.
Common conjugate pairs in actuarial work:
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Binomial | Beta | Beta |
| Poisson | Gamma | Gamma |
| Normal (known variance) | Normal | Normal |
| Exponential | Gamma | Gamma |
For example, if claim counts follow a Poisson distribution with unknown rate , and you place a prior on , then after observing claims totaling , the posterior is . The prior parameters and act like "pseudo-data," representing prior claim experience.
Conjugate priors are computationally convenient, but they constrain the functional form of the prior. For more complex models, you'll typically need MCMC.
Bayesian vs. frequentist approaches
These two paradigms differ in how they interpret probability and handle uncertainty:
- Frequentist: probability is the long-run frequency of events. Parameters are fixed but unknown. Inference relies on sampling distributions, maximum likelihood estimation, and confidence intervals.
- Bayesian: probability measures degree of belief. Parameters are random variables with distributions. Inference produces posterior distributions and credible intervals.
A key practical difference: a 95% credible interval means there's a 0.95 probability the parameter lies in that interval (given the model and prior). A 95% confidence interval means that if you repeated the experiment many times, 95% of such intervals would contain the true parameter. The Bayesian interpretation is often more intuitive for decision-making.
Bayesian methods also handle nuisance parameters naturally through marginalization, and they produce full predictive distributions rather than point forecasts.
Markov chain Monte Carlo (MCMC) methods
MCMC methods are algorithms for sampling from probability distributions that can't be computed in closed form. In Bayesian inference, the posterior distribution often involves intractable integrals (especially the normalizing constant ). MCMC sidesteps this by constructing a Markov chain whose stationary distribution is the target posterior. After the chain runs long enough, its samples approximate draws from the posterior.
Monte Carlo integration
The core idea behind Monte Carlo methods is simple: approximate an expectation by averaging over random samples. If you want to compute:
you draw samples from and estimate:
By the law of large numbers, this estimate converges to the true value as . The estimation error decreases at rate regardless of the dimension of , which is why Monte Carlo methods scale better to high dimensions than deterministic numerical integration (quadrature).
Markov chains
A Markov chain is a sequence of random variables where the distribution of each state depends only on the immediately preceding state, not on the full history. This is the Markov property:
The chain is fully characterized by its transition kernel . Under regularity conditions (irreducibility and aperiodicity), the chain converges to a unique stationary distribution such that:
The goal of MCMC is to design a transition kernel whose stationary distribution equals the target posterior.
Metropolis-Hastings algorithm
The Metropolis-Hastings (MH) algorithm is the most general MCMC method. It works for virtually any target distribution, provided you can evaluate the posterior up to a proportionality constant.
Steps at each iteration :
- Given current state , draw a candidate from a proposal distribution
- Compute the acceptance ratio:
- Draw
- If , accept the candidate: set . Otherwise, reject: set
The normalizing constant cancels in the ratio, so you only need the unnormalized posterior. The proposal distribution is your choice. A common option is a symmetric random walk: , in which case the ratio simplifies (the terms cancel) and you get the original Metropolis algorithm.
Tuning the proposal variance matters: too small and the chain moves slowly (high acceptance, poor exploration); too large and most proposals are rejected. A common target is an acceptance rate around 20-40% for multivariate problems.
Gibbs sampling
Gibbs sampling is a special case of Metropolis-Hastings where every proposal is accepted. It applies when you can sample from the full conditional distributions of each parameter.
For a parameter vector , one iteration cycles through:
- Draw from
- Draw from
- Continue until all components are updated
Each draw conditions on the most recent values of all other parameters. Gibbs sampling is especially efficient when conjugate priors make the full conditionals recognizable distributions. It's widely used in hierarchical models where the conditional structure is naturally layered.
The limitation: if parameters are highly correlated, Gibbs sampling can mix slowly because it updates one component at a time along coordinate axes.

Convergence diagnostics for MCMC
MCMC only produces valid posterior samples after the chain has converged to its stationary distribution. In practice, you never know exactly when convergence has occurred, so you rely on diagnostic tools to check for problems. Think of these as necessary sanity checks, not guarantees.
Burn-in period
The burn-in is the initial stretch of MCMC iterations that you discard. The chain's starting point is typically arbitrary, and early samples reflect that starting point rather than the posterior.
To choose an appropriate burn-in length:
- Run the chain and examine trace plots (see below)
- Look for where the chain appears to "settle" into a stable region
- A conservative approach is to discard the first 50% of samples, though shorter burn-ins are often sufficient
Thinning
Thinning means keeping only every -th sample from the chain (e.g., every 10th or 50th). Successive MCMC samples are autocorrelated because each state depends on the previous one, and thinning reduces this autocorrelation.
That said, thinning is somewhat controversial. It discards information, and for most inferential purposes, using all post-burn-in samples (even correlated ones) gives lower-variance estimates than using a thinned subset. The main practical reason to thin is storage: if you're running millions of iterations with many parameters, keeping every sample may be impractical.
Trace plots
A trace plot shows the sampled values of a parameter across iterations. You're looking for:
- Good mixing: the chain moves freely across the parameter space, with rapid, random-looking fluctuations around a stable level. This resembles a "fuzzy caterpillar."
- Poor mixing: the chain gets stuck in regions for long stretches, shows visible trends, or drifts slowly. This suggests the chain hasn't converged or the sampler needs tuning.
Run multiple chains from dispersed starting points. If they all converge to the same region and look similar, that's encouraging.
Gelman-Rubin statistic
The Gelman-Rubin diagnostic (also called ) formalizes the multi-chain comparison. It compares the variance between chains to the variance within chains.
The potential scale reduction factor (PSRF) is computed as:
where is the pooled estimate of the posterior variance (combining between-chain and within-chain variance) and is the average within-chain variance.
- suggests convergence (the chains are exploring the same distribution)
- (or sometimes for stricter criteria) indicates the chains haven't converged and you need more iterations
Always use this alongside trace plots, not as a standalone check.
Bayesian model selection
When you have multiple candidate models, Bayesian model selection provides a principled way to compare them. The key idea is balancing fit against complexity: a more complex model will always fit the data better, but it may overfit. Bayesian methods handle this naturally because the marginal likelihood automatically penalizes unnecessary complexity (a property sometimes called the "Bayesian Occam's razor").
Bayes factors
The Bayes factor compares two models by the ratio of their marginal likelihoods:
- : evidence favors model
- : evidence favors model
Jeffreys' scale provides rough interpretation guidelines:
| Bayes Factor | Evidence |
|---|---|
| 1 to 3 | Barely worth mentioning |
| 3 to 10 | Substantial |
| 10 to 30 | Strong |
| 30 to 100 | Very strong |
| > 100 | Decisive |
Computing Bayes factors can be challenging because the marginal likelihoods involve high-dimensional integrals. Methods like bridge sampling or thermodynamic integration are often used in practice.
Bayesian information criterion (BIC)
The BIC provides a computationally simpler approximation to the log Bayes factor. It's defined as:
where is the maximized likelihood, is the number of parameters, and is the sample size. Lower BIC values indicate a better balance of fit and parsimony.
BIC is derived from a Laplace approximation to the marginal likelihood and becomes more accurate as grows. It penalizes complexity more heavily than AIC for moderate-to-large sample sizes.
Deviance information criterion (DIC)
The DIC is designed specifically for Bayesian models and is computed from MCMC output:
where is the posterior mean of the deviance () and is the effective number of parameters, defined as . The term captures model complexity in a way that accounts for prior constraints and hierarchical structure.
Lower DIC values are preferred. DIC is particularly useful for hierarchical models where counting "the number of parameters" isn't straightforward. However, DIC has known limitations: it can be unreliable for models with non-normal posteriors, and it doesn't have the same theoretical backing as Bayes factors.
Applications in actuarial science
Bayesian methods are well-suited to actuarial problems because actuaries routinely combine data with expert judgment, work with limited data for specific risk segments, and need to quantify uncertainty in their estimates.
Bayesian credibility theory
Classical credibility theory asks: how much weight should you give to an individual risk's experience versus the collective portfolio? Bayesian credibility provides a formal answer.
The credibility premium is the posterior mean, which turns out to be a weighted average:
where is the individual's observed mean, is the prior (collective) mean, and is the credibility factor. Under the Bühlmann model with conjugate priors, depends on the ratio of process variance to the sum of process variance and prior variance, scaled by the number of observations.
The Bayesian framework makes the assumptions explicit: the prior represents the portfolio distribution, and the posterior gives the credibility-weighted estimate.
Bayesian reserving methods
Loss reserving estimates the outstanding liabilities for claims that have occurred but aren't fully settled. Bayesian approaches bring several advantages:
- They incorporate expert judgment as informative priors on development factors
- They produce full posterior distributions of reserves, giving a natural measure of reserve uncertainty
- The Bayesian chain-ladder method treats the development factors as random variables with prior distributions, rather than fixed point estimates
- More flexible Bayesian models can capture correlations between accident years and development periods that traditional methods miss
Bayesian GLMs for insurance pricing
Generalized linear models (GLMs) are standard tools for modeling claim frequency and severity as functions of risk factors (age, vehicle type, region, etc.). Bayesian GLMs extend this by:
- Placing priors on regression coefficients, which regularizes estimates and prevents overfitting, especially for sparse rating factors
- Providing posterior distributions for all parameters, enabling full uncertainty quantification in premiums
- Supporting Bayesian model averaging, where predictions are weighted across multiple models rather than relying on a single "best" model
- Connecting naturally to credibility theory when hierarchical priors are used across risk groups
Bayesian forecasting in finance
Actuaries working in pensions, life insurance, and enterprise risk management use Bayesian forecasting for:
- Asset return modeling: incorporating market views alongside historical data (related to the Black-Litterman framework)
- Volatility estimation: Bayesian stochastic volatility models capture time-varying risk more flexibly than GARCH
- Bayesian vector autoregressive (BVAR) models: for multivariate time series of economic variables, where Minnesota-type priors shrink coefficients toward reasonable defaults and prevent overfitting
- Probabilistic forecasting: producing full predictive distributions rather than point estimates, which feeds directly into risk measures like Value-at-Risk and Conditional Tail Expectation
Advanced topics in Bayesian inference
These methods extend the basic Bayesian framework to handle more complex modeling challenges. They appear increasingly in actuarial research and practice.
Hierarchical Bayesian models
Hierarchical models (also called multilevel models) introduce parameters at multiple levels. For example, in a portfolio of insurance companies:
- Level 1: claim counts for each company follow a Poisson distribution with company-specific rate
- Level 2: the rates are drawn from a common distribution
- Level 3: hyperpriors are placed on and
This structure allows borrowing of strength: companies with little data are pulled toward the group average, while companies with abundant data are driven more by their own experience. This is the Bayesian formalization of credibility theory and is especially valuable for small or heterogeneous risk groups.
Bayesian nonparametrics
Standard Bayesian models assume a fixed parametric form (e.g., claims follow a lognormal distribution). Bayesian nonparametric methods let the data determine the complexity of the model.
Key tools include:
- Dirichlet process priors: for clustering and density estimation with an unknown number of components
- Gaussian process priors: for flexible regression and spatial modeling
- Pólya tree priors: for nonparametric density estimation
These methods are useful when you suspect the true distribution doesn't fit neatly into a standard parametric family, such as multimodal claim severity distributions.
Variational Bayes methods
Variational Bayes (VB) is an alternative to MCMC for approximate posterior inference. Instead of sampling, VB finds the closest tractable distribution to the true posterior by minimizing the Kullback-Leibler (KL) divergence:
where is a family of tractable distributions (often factorized: ).
VB is much faster than MCMC and scales better to large datasets, making it attractive for high-volume actuarial applications. The trade-off is that VB tends to underestimate posterior variance because the factorized approximation can't capture correlations between parameters. Use VB for quick exploration and MCMC for final, publication-quality inference.
Hamiltonian Monte Carlo (HMC)
Hamiltonian Monte Carlo improves on basic Metropolis-Hastings by using gradient information to make smarter proposals. It borrows ideas from classical mechanics: the parameter vector is treated as the "position" of a particle, and an auxiliary "momentum" variable is introduced.
The algorithm at each iteration:
- Sample a random momentum from a standard normal distribution
- Simulate Hamiltonian dynamics for leapfrog steps of size , using the gradient
- Apply a Metropolis accept/reject step to correct for numerical integration error
HMC explores high-dimensional parameter spaces much more efficiently than random-walk Metropolis because proposals follow the geometry of the posterior rather than wandering randomly. The No-U-Turn Sampler (NUTS), implemented in software like Stan, automatically tunes and , making HMC practical without extensive manual tuning.
HMC does require the posterior to be differentiable with respect to all parameters, which rules out discrete parameters (though workarounds exist via marginalization).