Mixture models let you represent a population as a combination of distinct subpopulations, each with its own probability distribution. This is critical in actuarial work because real insurance portfolios are rarely homogeneous: your book of business might contain low-risk drivers, moderate-risk drivers, and high-risk drivers, each generating claims from a different underlying distribution. A single distribution can't capture that kind of structure, but a mixture model can.

Definition of mixture models

A mixture model expresses the overall density of a random variable as a weighted sum of component distributions. Each component represents a subpopulation, and the weights reflect how prevalent each subpopulation is. The result is a single model flexible enough to capture multimodal behavior, heavy tails, or other patterns that no single standard distribution handles well.

Components of mixture models

A mixture model has two building blocks:

Mixing proportions $\pi_k$ : the probability that a randomly chosen observation comes from component $k$ . These must satisfy $\sum_{k=1}^K \pi_k = 1$ and each $\pi_k \geq 0$ .
Component distributions $f_k(x|\theta_k)$ : the PDF (or PMF) for each subpopulation, parameterized by $\theta_k$ . Common choices include exponential, gamma, lognormal, Pareto, and normal distributions.

The overall density is:

$f(x) = \sum_{k=1}^K \pi_k f_k(x|\theta_k)$

For example, a two-component mixture of exponentials with means 1,000 and 10,000 and mixing weights 0.7 and 0.3 would produce a distribution with a bulk of small claims and a meaningful tail of large claims.

Finite vs. infinite mixture models

Finite mixture models fix the number of components $K$ in advance. You either choose $K$ from domain knowledge (e.g., "we know there are three risk classes") or select it using model selection criteria like AIC or BIC.
Infinite mixture models (Dirichlet process mixtures) allow the number of components to grow with the data. These are useful when you have no strong prior belief about how many subpopulations exist, but they require Bayesian computational methods and are more complex to implement.

For most exam-level and practical actuarial work, you'll deal with finite mixtures.

Applications in actuarial science

Insurance pricing: Segment policyholders into risk classes, each with its own claim distribution, to set appropriate premiums.
Claims modeling: Represent small, medium, and large claims as separate components. A two-component mixture (e.g., lognormal + Pareto) is a classic approach for capturing both the body and the tail of a severity distribution.
Reserving: Different claim types may develop at different speeds; mixture models can account for these distinct development patterns.
Fraud detection and classification: Unusual claims may form their own component, making them identifiable through posterior component probabilities.

Deductibles

A deductible is the amount of loss a policyholder must pay out of pocket before the insurer begins to pay. Deductibles are a fundamental risk-sharing mechanism: they reduce insurer exposure, discourage frivolous claims, and lower premiums.

Purpose of deductibles

Reduce moral hazard: When policyholders bear part of the loss, they have an incentive to take preventive measures.
Eliminate small claims: Insurers avoid the administrative cost of processing many low-dollar claims.
Lower premiums: Shifting some risk to the policyholder means the insurer charges less.

Types of deductibles

Ordinary (fixed) deductible: The policyholder pays the first $d$ dollars of any loss. The insurer pays $\max(X - d, 0)$ , where $X$ is the loss amount. A $500 deductible on an $1,800 loss means the insurer pays $1,300.

Percentage deductible: Expressed as a percentage of the insured value. A 2% deductible on a $100,000 property means the policyholder absorbs the first $2,000.

Franchise deductible: If the loss is below $d$ , the policyholder pays everything. If the loss meets or exceeds $d$ , the insurer pays the entire loss. With a $1,000 franchise deductible, a $900 claim is fully on the policyholder, but a $1,500 claim is fully covered by the insurer. This differs sharply from an ordinary deductible, where the insurer would only pay $500 on that $1,500 claim.

Impact on insurance premiums

Higher deductibles transfer more risk to the policyholder, which reduces the insurer's expected claim payments and therefore lowers the premium. The relationship isn't linear: going from a $0 to $500 deductible typically produces a larger premium reduction than going from $500 to $1,000, because most of the claim frequency sits in the lower severity range.

Actuaries quantify this by computing the loss elimination ratio (LER):

$LER(d) = \frac{E[\min(X, d)]}{E[X]}$

This gives the proportion of expected losses eliminated by a deductible of $d$ .

Deductibles vs. coinsurance

Both are risk-sharing tools, but they operate differently:

A deductible applies first: the policyholder absorbs losses up to $d$ .
Coinsurance applies after the deductible: the policyholder pays a percentage of the remaining loss. With 80/20 coinsurance and a $500 deductible, on a $2,500 loss the insurer pays $0.80 \times (2500 - 500) = \$1{,}600$ and the policyholder pays $900.

These mechanisms are often combined in a single policy.

Definition of mixture models, Mixture model - Wikipedia

Modeling claims with deductibles

When a deductible exists, the insurer only observes claims that exceed the deductible, and the amount paid differs from the ground-truth loss. You need to adjust the severity distribution accordingly. There are two key perspectives, and mixing them up is a common source of errors.

Claim size distribution

The ground-truth loss $X$ follows some severity distribution (exponential, gamma, Pareto, lognormal, etc.). The choice depends on the data's characteristics: Pareto or lognormal for heavy-tailed data, exponential or gamma for lighter tails.

Per-loss variable (left-truncated distribution)

This models the original loss $X$ given that it exceeds the deductible. You're conditioning on $X > d$ , so you use a left-truncated distribution:

$f_{X|X>d}(x) = \frac{f_X(x)}{1 - F_X(d)}, \quad x > d$

Here $f_X(x)$ is the original PDF and $F_X(d)$ is the CDF at $d$ . The denominator $1 - F_X(d)$ rescales the density so it integrates to 1 over $(d, \infty)$ .

This is the distribution of losses given that a claim is reported. It's sometimes called the per-loss variable in exam notation.

Per-payment variable (shifted/excess loss distribution)

The insurer doesn't pay $X$ ; it pays $Y = X - d$ for losses exceeding $d$ . This is the per-payment or excess loss variable. Its PDF is:

$f_Y(y) = \frac{f_X(y + d)}{1 - F_X(d)}, \quad y > 0$

Notice this combines both the shift (replacing $x$ with $y + d$ ) and the truncation (dividing by $1 - F_X(d)$ ). The variable $Y$ represents the actual dollar amount the insurer pays on each claim.

Be careful: the shifted distribution $f_X(x+d)$ alone (without the denominator) does not integrate to 1 over $(0, \infty)$ unless $F_X(d) = 0$ . The correct per-payment density always includes the survival function $1 - F_X(d)$ in the denominator.

Mixture of truncated and shifted distributions

In some actuarial models, you combine these perspectives using a mixture. For instance, if a portfolio contains both claims that exceed the deductible and claims that are handled differently (e.g., partial payments or different coverage layers), you might write:

$f(x) = \pi \cdot f_{X|X>d}(x) + (1-\pi) \cdot f_{X-d}(x), \quad x > 0$

The mixing proportion $\pi$ represents the probability that a claim falls into the first category. This structure appears when modeling portfolios with heterogeneous deductible arrangements.

Parameter estimation

With mixture models, you're estimating both the mixing proportions and the parameters of each component distribution. This is harder than fitting a single distribution because you don't know which component generated each observation.

Maximum likelihood estimation

The log-likelihood for a $K$ -component mixture with $n$ observations is:

$\log L(\Theta) = \sum_{i=1}^n \log \left(\sum_{k=1}^K \pi_k f_k(x_i|\theta_k)\right)$

where $\Theta = (\pi_1, \ldots, \pi_K, \theta_1, \ldots, \theta_K)$ .

The sum inside the logarithm makes direct optimization difficult. You can't simply take the derivative and solve in closed form. That's why the EM algorithm is the standard approach.

Expectation-maximization algorithm

The EM algorithm handles the "missing data" problem (you don't know which component each observation belongs to) by iterating between two steps:

E-step: Compute the posterior probability that observation $x_i$ belongs to component $k$ , using current parameter estimates:

$\hat{z}_{ik} = \frac{\hat{\pi}_k f_k(x_i|\hat{\theta}_k)}{\sum_{j=1}^K \hat{\pi}_j f_j(x_i|\hat{\theta}_j)}$

These $\hat{z}_{ik}$ values are called responsibilities: they tell you how much each component "claims" each data point.

M-step: Update the parameters using the responsibilities as weights:

$\hat{\pi}_k = \frac{1}{n} \sum_{i=1}^n \hat{z}_{ik}$

$\hat{\theta}_k = \arg\max_{\theta_k} \sum_{i=1}^n \hat{z}_{ik} \log f_k(x_i|\theta_k)$

Repeat until the log-likelihood converges (i.e., changes by less than some tolerance between iterations).

A practical warning: EM can converge to a local maximum, so it's standard practice to run it from multiple random starting points and keep the solution with the highest log-likelihood.

Bayesian inference for mixture models

Bayesian methods treat the parameters $\Theta$ as random variables with prior distributions $p(\Theta)$ . After observing data $\mathbf{x}$ , you update to the posterior:

$p(\Theta|\mathbf{x}) \propto p(\mathbf{x}|\Theta) \, p(\Theta)$

Because the posterior for mixture models is analytically intractable, you sample from it using MCMC methods:

Gibbs sampling works well when you can derive the full conditional distributions for each parameter block (common with conjugate priors).
Metropolis-Hastings is more general but requires tuning proposal distributions.

Bayesian inference naturally quantifies parameter uncertainty through the posterior distribution and supports model comparison via Bayes factors.

Definition of mixture models, Spike sorting with Gaussian mixture models | Scientific Reports

Model selection

Choosing the right number of components $K$ is one of the most important decisions when fitting a mixture model. Too few components underfit; too many overfit.

Criteria for model selection

All standard criteria balance fit against complexity. The key trade-off: adding a component always improves the likelihood, but it also adds parameters that may just be fitting noise.

Akaike information criterion (AIC)

$AIC = -2 \log L(\hat{\Theta}) + 2k$

where $k$ is the total number of estimated parameters. Lower AIC is better. AIC tends to favor slightly more complex models and is useful when the goal is prediction rather than identifying the "true" model.

Bayesian information criterion (BIC)

$BIC = -2 \log L(\hat{\Theta}) + k \log n$

where $n$ is the sample size. The penalty term $k \log n$ grows with sample size, so BIC penalizes complexity more heavily than AIC for any $n \geq 8$ . BIC is consistent, meaning it selects the true number of components as $n \to \infty$ (assuming the true model is among the candidates). In practice, BIC tends to choose simpler models than AIC.

Likelihood ratio tests

For comparing nested models (e.g., $K = 2$ vs. $K = 3$ ), the likelihood ratio statistic is:

$LR = -2 (\log L(\hat{\Theta}_0) - \log L(\hat{\Theta}_1))$

Under the null hypothesis, $LR$ approximately follows a chi-squared distribution with degrees of freedom equal to the difference in parameter counts.

There's a subtlety here: standard regularity conditions for the chi-squared approximation can fail for mixture models because under the null, some parameters are on the boundary of the parameter space (e.g., $\pi_k = 0$ ). Adjusted tests or bootstrap methods are sometimes needed.

Simulation of mixture models

Simulation lets you generate synthetic data from a fitted mixture, which is useful for validating estimation methods, stress testing, and computing quantities that are hard to derive analytically.

Generating random variates

To simulate one observation from a $K$ -component mixture:

Draw a component label $z$ from a categorical distribution with probabilities $(\pi_1, \ldots, \pi_K)$ .
Given $z = k$ , draw $x$ from the component distribution $f_k(x|\theta_k)$ using inverse transform sampling, acceptance-rejection, or a built-in generator.
Repeat for as many observations as needed.

This two-step process directly mirrors the generative interpretation of the mixture model.

Monte Carlo simulation

Once you can generate data from the mixture, you can estimate any quantity by simulation:

Moments: Compute the sample mean and variance of a large simulated sample to approximate $E[X]$ and $\text{Var}(X)$ .
Tail probabilities: Estimate $P(X > x)$ as the fraction of simulated values exceeding $x$ .
Quantiles: Sort the simulated values and read off the desired percentile (e.g., VaR at the 99th percentile).

The accuracy of Monte Carlo estimates improves with the number of simulations, roughly at a rate of $1/\sqrt{N}$ where $N$ is the number of replications.

Assessing model fit

After fitting a mixture model, check whether it actually describes the data well:

Histogram overlay: Plot the fitted mixture density on top of a histogram of the observed data. Gaps or systematic deviations signal a poor fit.
Q-Q plot: Compare the quantiles of the observed data against the theoretical quantiles of the fitted mixture. Points should fall near the 45-degree line.
Formal tests: The Kolmogorov-Smirnov test compares the empirical CDF to the fitted CDF. The chi-squared goodness-of-fit test bins the data and compares observed vs. expected frequencies. Both can flag significant departures from the model, though with large samples even minor deviations become "significant."