Fiveable

📊Actuarial Mathematics Unit 11 Review

QR code for Actuarial Mathematics practice questions

11.4 Bayesian inference and Markov chain Monte Carlo

11.4 Bayesian inference and Markov chain Monte Carlo

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📊Actuarial Mathematics
Unit & Topic Study Guides

Foundations of Bayesian inference

Bayesian inference is a statistical framework that updates the probability of a hypothesis as new evidence becomes available. Rather than treating parameters as fixed unknowns (the frequentist view), Bayesian methods treat them as random variables with probability distributions that reflect our uncertainty. This distinction matters in actuarial work because you often have prior knowledge from historical portfolios or expert judgment that should inform your estimates.

Bayes' theorem

Bayes' theorem is the mathematical engine behind all Bayesian inference. It describes how to revise the probability of a hypothesis given new data:

P(θX)=P(Xθ)P(θ)P(X)P(\theta|X) = \frac{P(X|\theta)P(\theta)}{P(X)}

where:

  • P(θX)P(\theta|X) is the posterior distribution: your updated belief about parameter θ\theta after observing data XX
  • P(θ)P(\theta) is the prior distribution: what you believed about θ\theta before seeing the data
  • P(Xθ)P(X|\theta) is the likelihood: the probability of observing the data given a particular value of θ\theta
  • P(X)P(X) is the marginal likelihood (or evidence): a normalizing constant that ensures the posterior integrates to 1

In practice, since P(X)P(X) doesn't depend on θ\theta, you'll often work with the proportionality:

P(θX)P(Xθ)P(θ)P(\theta|X) \propto P(X|\theta) \cdot P(\theta)

This proportional form is usually all you need, because MCMC methods (covered below) can sample from the posterior without computing the normalizing constant.

Prior and posterior distributions

The prior distribution encodes what you know (or assume) about a parameter before collecting data. Priors can be:

  • Informative: reflecting genuine prior knowledge (e.g., historical loss ratios from similar portfolios)
  • Weakly informative: constraining parameters to reasonable ranges without strong assumptions
  • Non-informative (or diffuse): attempting to let the data dominate (e.g., flat priors, Jeffreys priors)

The posterior distribution combines the prior with the likelihood. It represents your complete updated knowledge about the parameter. All Bayesian inference flows from the posterior: point estimates (posterior mean or mode), interval estimates (credible intervals), and predictions.

The choice of prior can substantially affect the posterior, especially with small sample sizes. As the amount of data grows, the likelihood dominates and the posterior becomes less sensitive to the prior. This property is called asymptotic consistency.

Conjugate priors

A prior is conjugate to a given likelihood if the posterior belongs to the same distributional family as the prior. This is valuable because it gives you a closed-form posterior, avoiding the need for numerical methods entirely.

Common conjugate pairs in actuarial work:

LikelihoodConjugate PriorPosterior
BinomialBetaBeta
PoissonGammaGamma
Normal (known variance)NormalNormal
ExponentialGammaGamma

For example, if claim counts follow a Poisson distribution with unknown rate λ\lambda, and you place a Gamma(α,β)\text{Gamma}(\alpha, \beta) prior on λ\lambda, then after observing nn claims totaling xi\sum x_i, the posterior is Gamma(α+xi,β+n)\text{Gamma}(\alpha + \sum x_i, \beta + n). The prior parameters α\alpha and β\beta act like "pseudo-data," representing prior claim experience.

Conjugate priors are computationally convenient, but they constrain the functional form of the prior. For more complex models, you'll typically need MCMC.

Bayesian vs. frequentist approaches

These two paradigms differ in how they interpret probability and handle uncertainty:

  • Frequentist: probability is the long-run frequency of events. Parameters are fixed but unknown. Inference relies on sampling distributions, maximum likelihood estimation, and confidence intervals.
  • Bayesian: probability measures degree of belief. Parameters are random variables with distributions. Inference produces posterior distributions and credible intervals.

A key practical difference: a 95% credible interval means there's a 0.95 probability the parameter lies in that interval (given the model and prior). A 95% confidence interval means that if you repeated the experiment many times, 95% of such intervals would contain the true parameter. The Bayesian interpretation is often more intuitive for decision-making.

Bayesian methods also handle nuisance parameters naturally through marginalization, and they produce full predictive distributions rather than point forecasts.

Markov chain Monte Carlo (MCMC) methods

MCMC methods are algorithms for sampling from probability distributions that can't be computed in closed form. In Bayesian inference, the posterior distribution often involves intractable integrals (especially the normalizing constant P(X)P(X)). MCMC sidesteps this by constructing a Markov chain whose stationary distribution is the target posterior. After the chain runs long enough, its samples approximate draws from the posterior.

Monte Carlo integration

The core idea behind Monte Carlo methods is simple: approximate an expectation by averaging over random samples. If you want to compute:

E[g(θ)]=g(θ)p(θX)dθE[g(\theta)] = \int g(\theta) \, p(\theta|X) \, d\theta

you draw NN samples θ1,θ2,,θN\theta_1, \theta_2, \ldots, \theta_N from p(θX)p(\theta|X) and estimate:

E[g(θ)]1Ni=1Ng(θi)E[g(\theta)] \approx \frac{1}{N} \sum_{i=1}^{N} g(\theta_i)

By the law of large numbers, this estimate converges to the true value as NN \to \infty. The estimation error decreases at rate O(1/N)O(1/\sqrt{N}) regardless of the dimension of θ\theta, which is why Monte Carlo methods scale better to high dimensions than deterministic numerical integration (quadrature).

Markov chains

A Markov chain is a sequence of random variables where the distribution of each state depends only on the immediately preceding state, not on the full history. This is the Markov property:

P(θt+1θt,θt1,,θ0)=P(θt+1θt)P(\theta_{t+1} | \theta_t, \theta_{t-1}, \ldots, \theta_0) = P(\theta_{t+1} | \theta_t)

The chain is fully characterized by its transition kernel T(θt+1θt)T(\theta_{t+1} | \theta_t). Under regularity conditions (irreducibility and aperiodicity), the chain converges to a unique stationary distribution π(θ)\pi(\theta) such that:

π(θ)=T(θθ)π(θ)dθ\pi(\theta') = \int T(\theta' | \theta) \, \pi(\theta) \, d\theta

The goal of MCMC is to design a transition kernel whose stationary distribution equals the target posterior.

Metropolis-Hastings algorithm

The Metropolis-Hastings (MH) algorithm is the most general MCMC method. It works for virtually any target distribution, provided you can evaluate the posterior up to a proportionality constant.

Steps at each iteration tt:

  1. Given current state θt\theta_t, draw a candidate θ\theta^* from a proposal distribution q(θθt)q(\theta^* | \theta_t)
  2. Compute the acceptance ratio: α=min(1,p(θX)q(θtθ)p(θtX)q(θθt))\alpha = \min\left(1, \frac{p(\theta^*|X) \, q(\theta_t | \theta^*)}{p(\theta_t|X) \, q(\theta^* | \theta_t)}\right)
  3. Draw uUniform(0,1)u \sim \text{Uniform}(0, 1)
  4. If uαu \leq \alpha, accept the candidate: set θt+1=θ\theta_{t+1} = \theta^*. Otherwise, reject: set θt+1=θt\theta_{t+1} = \theta_t

The normalizing constant cancels in the ratio, so you only need the unnormalized posterior. The proposal distribution qq is your choice. A common option is a symmetric random walk: q(θθt)=N(θt,σ2)q(\theta^* | \theta_t) = N(\theta_t, \sigma^2), in which case the ratio simplifies (the qq terms cancel) and you get the original Metropolis algorithm.

Tuning the proposal variance σ2\sigma^2 matters: too small and the chain moves slowly (high acceptance, poor exploration); too large and most proposals are rejected. A common target is an acceptance rate around 20-40% for multivariate problems.

Gibbs sampling

Gibbs sampling is a special case of Metropolis-Hastings where every proposal is accepted. It applies when you can sample from the full conditional distributions of each parameter.

For a parameter vector θ=(θ1,θ2,,θp)\theta = (\theta_1, \theta_2, \ldots, \theta_p), one iteration cycles through:

  1. Draw θ1(t+1)\theta_1^{(t+1)} from p(θ1θ2(t),θ3(t),,θp(t),X)p(\theta_1 | \theta_2^{(t)}, \theta_3^{(t)}, \ldots, \theta_p^{(t)}, X)
  2. Draw θ2(t+1)\theta_2^{(t+1)} from p(θ2θ1(t+1),θ3(t),,θp(t),X)p(\theta_2 | \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots, \theta_p^{(t)}, X)
  3. Continue until all components are updated

Each draw conditions on the most recent values of all other parameters. Gibbs sampling is especially efficient when conjugate priors make the full conditionals recognizable distributions. It's widely used in hierarchical models where the conditional structure is naturally layered.

The limitation: if parameters are highly correlated, Gibbs sampling can mix slowly because it updates one component at a time along coordinate axes.

Bayes' theorem, Maximum-Entropy Markov Model

Convergence diagnostics for MCMC

MCMC only produces valid posterior samples after the chain has converged to its stationary distribution. In practice, you never know exactly when convergence has occurred, so you rely on diagnostic tools to check for problems. Think of these as necessary sanity checks, not guarantees.

Burn-in period

The burn-in is the initial stretch of MCMC iterations that you discard. The chain's starting point is typically arbitrary, and early samples reflect that starting point rather than the posterior.

To choose an appropriate burn-in length:

  • Run the chain and examine trace plots (see below)
  • Look for where the chain appears to "settle" into a stable region
  • A conservative approach is to discard the first 50% of samples, though shorter burn-ins are often sufficient

Thinning

Thinning means keeping only every kk-th sample from the chain (e.g., every 10th or 50th). Successive MCMC samples are autocorrelated because each state depends on the previous one, and thinning reduces this autocorrelation.

That said, thinning is somewhat controversial. It discards information, and for most inferential purposes, using all post-burn-in samples (even correlated ones) gives lower-variance estimates than using a thinned subset. The main practical reason to thin is storage: if you're running millions of iterations with many parameters, keeping every sample may be impractical.

Trace plots

A trace plot shows the sampled values of a parameter across iterations. You're looking for:

  • Good mixing: the chain moves freely across the parameter space, with rapid, random-looking fluctuations around a stable level. This resembles a "fuzzy caterpillar."
  • Poor mixing: the chain gets stuck in regions for long stretches, shows visible trends, or drifts slowly. This suggests the chain hasn't converged or the sampler needs tuning.

Run multiple chains from dispersed starting points. If they all converge to the same region and look similar, that's encouraging.

Gelman-Rubin statistic

The Gelman-Rubin diagnostic (also called R^\hat{R}) formalizes the multi-chain comparison. It compares the variance between chains to the variance within chains.

The potential scale reduction factor (PSRF) is computed as:

R^=V^W\hat{R} = \sqrt{\frac{\hat{V}}{W}}

where V^\hat{V} is the pooled estimate of the posterior variance (combining between-chain and within-chain variance) and WW is the average within-chain variance.

  • R^1.0\hat{R} \approx 1.0 suggests convergence (the chains are exploring the same distribution)
  • R^>1.1\hat{R} > 1.1 (or sometimes >1.01> 1.01 for stricter criteria) indicates the chains haven't converged and you need more iterations

Always use this alongside trace plots, not as a standalone check.

Bayesian model selection

When you have multiple candidate models, Bayesian model selection provides a principled way to compare them. The key idea is balancing fit against complexity: a more complex model will always fit the data better, but it may overfit. Bayesian methods handle this naturally because the marginal likelihood automatically penalizes unnecessary complexity (a property sometimes called the "Bayesian Occam's razor").

Bayes factors

The Bayes factor compares two models by the ratio of their marginal likelihoods:

BF12=P(XM1)P(XM2)=P(Xθ1,M1)P(θ1M1)dθ1P(Xθ2,M2)P(θ2M2)dθ2BF_{12} = \frac{P(X|M_1)}{P(X|M_2)} = \frac{\int P(X|\theta_1, M_1) P(\theta_1|M_1) \, d\theta_1}{\int P(X|\theta_2, M_2) P(\theta_2|M_2) \, d\theta_2}

  • BF12>1BF_{12} > 1: evidence favors model M1M_1
  • BF12<1BF_{12} < 1: evidence favors model M2M_2

Jeffreys' scale provides rough interpretation guidelines:

Bayes FactorEvidence
1 to 3Barely worth mentioning
3 to 10Substantial
10 to 30Strong
30 to 100Very strong
> 100Decisive

Computing Bayes factors can be challenging because the marginal likelihoods involve high-dimensional integrals. Methods like bridge sampling or thermodynamic integration are often used in practice.

Bayesian information criterion (BIC)

The BIC provides a computationally simpler approximation to the log Bayes factor. It's defined as:

BIC=2lnL^+klnnBIC = -2 \ln \hat{L} + k \ln n

where L^\hat{L} is the maximized likelihood, kk is the number of parameters, and nn is the sample size. Lower BIC values indicate a better balance of fit and parsimony.

BIC is derived from a Laplace approximation to the marginal likelihood and becomes more accurate as nn grows. It penalizes complexity more heavily than AIC for moderate-to-large sample sizes.

Deviance information criterion (DIC)

The DIC is designed specifically for Bayesian models and is computed from MCMC output:

DIC=Dˉ+pDDIC = \bar{D} + p_D

where Dˉ\bar{D} is the posterior mean of the deviance (D(θ)=2lnP(Xθ)D(\theta) = -2 \ln P(X|\theta)) and pDp_D is the effective number of parameters, defined as pD=DˉD(θˉ)p_D = \bar{D} - D(\bar{\theta}). The term pDp_D captures model complexity in a way that accounts for prior constraints and hierarchical structure.

Lower DIC values are preferred. DIC is particularly useful for hierarchical models where counting "the number of parameters" isn't straightforward. However, DIC has known limitations: it can be unreliable for models with non-normal posteriors, and it doesn't have the same theoretical backing as Bayes factors.

Bayes' theorem, BG - Relations - Bayesian calibration of terrestrial ecosystem models: a study of advanced ...

Applications in actuarial science

Bayesian methods are well-suited to actuarial problems because actuaries routinely combine data with expert judgment, work with limited data for specific risk segments, and need to quantify uncertainty in their estimates.

Bayesian credibility theory

Classical credibility theory asks: how much weight should you give to an individual risk's experience versus the collective portfolio? Bayesian credibility provides a formal answer.

The credibility premium is the posterior mean, which turns out to be a weighted average:

μ^=ZXˉ+(1Z)μ0\hat{\mu} = Z \cdot \bar{X} + (1 - Z) \cdot \mu_0

where Xˉ\bar{X} is the individual's observed mean, μ0\mu_0 is the prior (collective) mean, and ZZ is the credibility factor. Under the Bühlmann model with conjugate priors, ZZ depends on the ratio of process variance to the sum of process variance and prior variance, scaled by the number of observations.

The Bayesian framework makes the assumptions explicit: the prior represents the portfolio distribution, and the posterior gives the credibility-weighted estimate.

Bayesian reserving methods

Loss reserving estimates the outstanding liabilities for claims that have occurred but aren't fully settled. Bayesian approaches bring several advantages:

  • They incorporate expert judgment as informative priors on development factors
  • They produce full posterior distributions of reserves, giving a natural measure of reserve uncertainty
  • The Bayesian chain-ladder method treats the development factors as random variables with prior distributions, rather than fixed point estimates
  • More flexible Bayesian models can capture correlations between accident years and development periods that traditional methods miss

Bayesian GLMs for insurance pricing

Generalized linear models (GLMs) are standard tools for modeling claim frequency and severity as functions of risk factors (age, vehicle type, region, etc.). Bayesian GLMs extend this by:

  • Placing priors on regression coefficients, which regularizes estimates and prevents overfitting, especially for sparse rating factors
  • Providing posterior distributions for all parameters, enabling full uncertainty quantification in premiums
  • Supporting Bayesian model averaging, where predictions are weighted across multiple models rather than relying on a single "best" model
  • Connecting naturally to credibility theory when hierarchical priors are used across risk groups

Bayesian forecasting in finance

Actuaries working in pensions, life insurance, and enterprise risk management use Bayesian forecasting for:

  • Asset return modeling: incorporating market views alongside historical data (related to the Black-Litterman framework)
  • Volatility estimation: Bayesian stochastic volatility models capture time-varying risk more flexibly than GARCH
  • Bayesian vector autoregressive (BVAR) models: for multivariate time series of economic variables, where Minnesota-type priors shrink coefficients toward reasonable defaults and prevent overfitting
  • Probabilistic forecasting: producing full predictive distributions rather than point estimates, which feeds directly into risk measures like Value-at-Risk and Conditional Tail Expectation

Advanced topics in Bayesian inference

These methods extend the basic Bayesian framework to handle more complex modeling challenges. They appear increasingly in actuarial research and practice.

Hierarchical Bayesian models

Hierarchical models (also called multilevel models) introduce parameters at multiple levels. For example, in a portfolio of insurance companies:

  • Level 1: claim counts for each company follow a Poisson distribution with company-specific rate λi\lambda_i
  • Level 2: the rates λi\lambda_i are drawn from a common distribution Gamma(α,β)\text{Gamma}(\alpha, \beta)
  • Level 3: hyperpriors are placed on α\alpha and β\beta

This structure allows borrowing of strength: companies with little data are pulled toward the group average, while companies with abundant data are driven more by their own experience. This is the Bayesian formalization of credibility theory and is especially valuable for small or heterogeneous risk groups.

Bayesian nonparametrics

Standard Bayesian models assume a fixed parametric form (e.g., claims follow a lognormal distribution). Bayesian nonparametric methods let the data determine the complexity of the model.

Key tools include:

  • Dirichlet process priors: for clustering and density estimation with an unknown number of components
  • Gaussian process priors: for flexible regression and spatial modeling
  • Pólya tree priors: for nonparametric density estimation

These methods are useful when you suspect the true distribution doesn't fit neatly into a standard parametric family, such as multimodal claim severity distributions.

Variational Bayes methods

Variational Bayes (VB) is an alternative to MCMC for approximate posterior inference. Instead of sampling, VB finds the closest tractable distribution to the true posterior by minimizing the Kullback-Leibler (KL) divergence:

q(θ)=argminqQKL(q(θ)p(θX))q^*(\theta) = \arg\min_{q \in \mathcal{Q}} \, KL(q(\theta) \| p(\theta|X))

where Q\mathcal{Q} is a family of tractable distributions (often factorized: q(θ)=jqj(θj)q(\theta) = \prod_j q_j(\theta_j)).

VB is much faster than MCMC and scales better to large datasets, making it attractive for high-volume actuarial applications. The trade-off is that VB tends to underestimate posterior variance because the factorized approximation can't capture correlations between parameters. Use VB for quick exploration and MCMC for final, publication-quality inference.

Hamiltonian Monte Carlo (HMC)

Hamiltonian Monte Carlo improves on basic Metropolis-Hastings by using gradient information to make smarter proposals. It borrows ideas from classical mechanics: the parameter vector is treated as the "position" of a particle, and an auxiliary "momentum" variable is introduced.

The algorithm at each iteration:

  1. Sample a random momentum pp from a standard normal distribution
  2. Simulate Hamiltonian dynamics for LL leapfrog steps of size ϵ\epsilon, using the gradient logp(θX)\nabla \log p(\theta|X)
  3. Apply a Metropolis accept/reject step to correct for numerical integration error

HMC explores high-dimensional parameter spaces much more efficiently than random-walk Metropolis because proposals follow the geometry of the posterior rather than wandering randomly. The No-U-Turn Sampler (NUTS), implemented in software like Stan, automatically tunes LL and ϵ\epsilon, making HMC practical without extensive manual tuning.

HMC does require the posterior to be differentiable with respect to all parameters, which rules out discrete parameters (though workarounds exist via marginalization).