Generalized linear models expand on traditional linear regression by allowing you to analyze non-normal data, which is extremely common in insurance and finance. They tie together a response variable, a linear predictor, a link function, and a variance function to model complex relationships between variables.

GLMs are central to actuarial work in risk assessment and pricing. Because they accommodate various data distributions, they give you flexible tools for modeling claim frequency, claim severity, and other key metrics.

Fundamentals of generalized linear models

Generalized linear models (GLMs) extend linear regression to handle response variables that don't follow a normal distribution. This makes them essential in actuarial modeling, where you regularly encounter count data (number of claims), binary outcomes (claim filed or not), and continuous positive data (claim amounts). Standard linear regression would give poor results for these types of data because it assumes normally distributed errors with constant variance.

Components of GLMs

Every GLM has four building blocks:

Response variable: The dependent variable you're trying to model. It must follow a distribution from the exponential family.
Linear predictor: A linear combination of your explanatory variables: $\eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$
Link function: A function that connects the expected value of the response to the linear predictor. This is what allows GLMs to capture non-linear relationships while keeping the model structure linear in the parameters.
Variance function: Describes how the variance of the response relates to its mean. This relationship depends on which distribution you choose.

Exponential family of distributions

The exponential family is the set of distributions that GLMs can work with. It includes the Normal, Poisson, Binomial, Gamma, and Inverse Gaussian distributions.

All distributions in this family share a common form for their density (or mass) function:

$f(y; \theta, \phi) = \exp\left(\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right)$

$\theta$ is the natural parameter (related to the mean)
$\phi$ is the dispersion parameter (controls the spread)
$a(\cdot)$ , $b(\cdot)$ , and $c(\cdot)$ are specific functions that differ for each distribution

The key property: the mean is $b'(\theta)$ and the variance is $b''(\theta) \cdot a(\phi)$ . This is how the variance function connects back to the distribution choice.

Link functions for GLMs

The link function $g(\cdot)$ maps the expected value $\mu = \mathbb{E}(Y)$ to the linear predictor $\eta$ , so that $g(\mu) = \eta$ .

Common link functions:

Identity link: $g(\mu) = \mu$ — used in ordinary linear regression (Normal response)
Log link: $g(\mu) = \log(\mu)$ — used in Poisson regression for count data
Logit link: $g(\mu) = \log\left(\frac{\mu}{1-\mu}\right)$ — used in logistic regression for binary outcomes
Inverse link: $g(\mu) = \frac{1}{\mu}$ — used in Gamma regression for positive continuous data

Each distribution has a canonical link (the natural pairing), but you're not required to use it. The choice depends on the distribution of your response and how you want to interpret the coefficients.

Maximum likelihood estimation in GLMs

Maximum likelihood estimation (MLE) finds the parameter values that make the observed data most probable under the model. For GLMs, this means finding the regression coefficients $\boldsymbol{\beta}$ that maximize the likelihood function. MLE is foundational in actuarial work because it provides a principled way to estimate risk parameters and assess how well a model fits.

Log-likelihood function

The log-likelihood is the natural logarithm of the likelihood function:

$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \log f(y_i; \theta_i, \phi)$

where $\boldsymbol{\beta}$ is the vector of regression coefficients. You maximize the log-likelihood instead of the likelihood itself because the logarithm turns products into sums, which is much easier to work with both analytically and computationally.

Fisher scoring algorithm

The Fisher scoring algorithm is an iterative method for finding the MLE of the regression coefficients. Here's how it works:

Start with initial estimates of $\boldsymbol{\beta}$ (often from a simpler model or set to zero).
Compute the score vector (the gradient of the log-likelihood) at the current estimates.
Compute the Fisher information matrix, which is the expected value of the negative Hessian of the log-likelihood. This captures the curvature of the likelihood surface.
Update the estimates: $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \mathcal{I}^{-1}(\boldsymbol{\beta}^{(t)}) \cdot S(\boldsymbol{\beta}^{(t)})$
Repeat until the estimates converge (i.e., changes between iterations become negligibly small).

Fisher scoring uses the expected information rather than the observed information (as Newton-Raphson does), which tends to make it more numerically stable.

Iteratively reweighted least squares

Iteratively reweighted least squares (IRLS) is an equivalent reformulation of Fisher scoring that recasts GLM estimation as a sequence of weighted least squares problems. At each iteration:

Compute a "working response" based on current estimates.
Compute weights derived from the variance function and current fitted values.
Solve a weighted least squares regression using these working responses and weights.
Update the estimates and repeat.

IRLS converges to the same MLE as Fisher scoring. Most statistical software uses IRLS internally because it leverages efficient least squares algorithms and makes the connection between GLMs and weighted regression transparent.

Model selection and validation

Choosing the right model structure and verifying its performance are critical steps. Actuaries use model selection criteria and validation techniques to balance complexity against predictive accuracy, ensuring the model is suitable for pricing, reserving, and risk management.

Deviance and likelihood ratio tests

Deviance measures goodness of fit by comparing your fitted model to the saturated model (a model with one parameter per observation, which fits the data perfectly):

$D = 2\left[\ell(\text{saturated}) - \ell(\text{fitted})\right]$

A smaller deviance indicates a better fit. To compare two nested models (where the simpler model is a special case of the more complex one), you use a likelihood ratio test:

Compute the deviance for each model.
Take the difference in deviances.
Under the null hypothesis that the simpler model is adequate, this difference follows a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters.

If the test statistic exceeds the critical value, you reject the simpler model in favor of the more complex one.

Components of GLMs, partR2: partitioning R2 in generalized linear mixed models [PeerJ]

Akaike information criterion (AIC)

AIC balances goodness of fit against model complexity:

$\text{AIC} = -2\ell(\hat{\boldsymbol{\beta}}) + 2p$

where $p$ is the number of estimated parameters. Lower AIC values are better. AIC penalizes each additional parameter by 2, so adding a predictor only improves AIC if it increases the log-likelihood by more than 1.

AIC is useful for comparing non-nested models (unlike the likelihood ratio test, which requires nesting).

Bayesian information criterion (BIC)

BIC applies a heavier penalty for model complexity than AIC:

$\text{BIC} = -2\ell(\hat{\boldsymbol{\beta}}) + p\log(n)$

where $n$ is the sample size. Because $\log(n) > 2$ for any $n \geq 8$ , BIC penalizes additional parameters more aggressively. This means BIC tends to favor simpler models than AIC, especially with large datasets. Lower BIC values are preferred.

Residual analysis and diagnostics

Residual analysis examines the differences between observed and fitted values to check model assumptions and spot problems. Key diagnostic plots for GLMs:

Residuals vs. fitted values: Look for patterns. Random scatter is good; systematic curves suggest missing non-linearity, and fanning patterns suggest heteroscedasticity.
Normal Q-Q plot: Points should fall along the diagonal line if the residuals are approximately normal. Relevant mainly for Gaussian GLMs.
Scale-location plot: Plots $\sqrt{|\text{standardized residuals}|}$ against fitted values to detect non-constant variance.
Cook's distance plot: Flags influential observations that disproportionately affect the fitted model. High Cook's distance values warrant investigation.

These diagnostics help you refine the model, detect assumption violations, and improve prediction reliability.

Poisson regression for count data

Poisson regression models count data, where the response variable represents the number of events in a fixed interval. Actuaries commonly use it to model claim frequency, accident counts, or policy renewal counts.

Poisson distribution and assumptions

The Poisson distribution gives the probability of observing $k$ events when the average rate is $\lambda$ :

$P(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$

A defining property of the Poisson distribution is equidispersion: the mean equals the variance ( $\mathbb{E}(Y) = \text{Var}(Y) = \lambda$ ).

Poisson regression assumes:

The response variable follows a Poisson distribution
The log of the expected count is linearly related to the predictors
Events occur independently of each other

Log-linear models and interpretation

The Poisson regression model uses the log link:

$\log(\mathbb{E}(Y)) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$

Coefficient interpretation works on two scales:

Log scale: $\beta_j$ is the change in the log expected count for a one-unit increase in $x_j$ , holding other predictors constant.
Count scale: $\exp(\beta_j)$ is the multiplicative factor on the expected count. For example, if $\beta_1 = 0.15$ , then $\exp(0.15) \approx 1.16$ , meaning a one-unit increase in $x_1$ is associated with a 16% increase in the expected count.

Overdispersion and quasi-Poisson models

Overdispersion occurs when the observed variance exceeds the mean, violating the Poisson equidispersion assumption. This is very common in real insurance data. The consequence: standard errors become too small, leading you to declare predictors significant when they may not be.

Quasi-Poisson models address this by introducing a dispersion parameter $\phi$ :

$\text{Var}(Y) = \phi \cdot \mathbb{E}(Y)$

When $\phi > 1$ , you have overdispersion. The quasi-Poisson model keeps the same mean structure as the standard Poisson model but inflates the standard errors by $\sqrt{\phi}$ , producing more honest inference. An alternative approach is the negative binomial model, which explicitly models the extra-Poisson variation.

Logistic regression for binary outcomes

Logistic regression models binary responses where the outcome is the probability of an event occurring. In actuarial work, typical applications include modeling the probability of a claim being filed, policyholder lapse, or loan default.

Logit link function and odds ratios

The logistic regression model uses the logit link:

$\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$

where $p$ is the probability of the event. The quantity $\frac{p}{1-p}$ is the odds of the event.

Exponentiating a coefficient gives the odds ratio: $\exp(\beta_j)$ represents how the odds change for a one-unit increase in $x_j$ , holding other predictors constant.

$\exp(\beta_j) > 1$ : the predictor increases the odds of the event
$\exp(\beta_j) < 1$ : the predictor decreases the odds of the event
$\exp(\beta_j) = 1$ : the predictor has no effect on the odds

For example, if $\beta_1 = 0.30$ for a "prior claim" indicator, then $\exp(0.30) \approx 1.35$ , meaning policyholders with a prior claim have 35% higher odds of filing a new claim.

Components of GLMs, GMD - Statistical downscaling with the downscaleR package (v3.1.0): contribution to the VALUE ...

Interpretation of coefficients

Because the logit function is non-linear, the effect of a predictor on the probability scale depends on the values of all other predictors. To convert from log-odds to probability, use the inverse logit:

$p = \frac{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}{1 + \exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}$

To assess the practical impact of a predictor on probability, you can:

Compute average marginal effects (the average change in probability across all observations for a one-unit change in the predictor)
Predict probabilities at representative covariate values and compare them

Receiver operating characteristic (ROC) curves

ROC curves evaluate how well a logistic regression model discriminates between the two outcome classes. The curve plots the true positive rate (sensitivity) on the y-axis against the false positive rate (1 - specificity) on the x-axis across all possible classification thresholds.

The area under the ROC curve (AUC) summarizes discriminatory power:

AUC = 0.5: no better than random guessing
AUC = 0.7–0.8: acceptable discrimination
AUC = 0.8–0.9: good discrimination
AUC > 0.9: excellent discrimination

Actuaries use ROC curves to compare competing models, evaluate the sensitivity-specificity trade-off, and select classification thresholds aligned with business objectives (e.g., minimizing false negatives in fraud detection).

Gamma regression for continuous positive data

Gamma regression models continuous, strictly positive response variables with right-skewed distributions. Actuaries use it extensively for claim severity, loss amounts, and insurance premiums.

Gamma distribution and assumptions

The Gamma distribution is a continuous distribution for positive values. Its probability density function is:

$f(y; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}y^{\alpha-1}e^{-\beta y}$

where $\alpha$ is the shape parameter and $\beta$ is the rate parameter. The mean is $\alpha/\beta$ and the variance is $\alpha/\beta^2$ .

Gamma regression assumes:

The response variable follows a Gamma distribution
The reciprocal of the expected value is linearly related to the predictors (when using the canonical inverse link)
The variance is proportional to the square of the mean: $\text{Var}(Y) \propto [\mathbb{E}(Y)]^2$

This last property makes Gamma regression well-suited for data where larger values tend to have more variability, which is typical of insurance claim amounts.

Inverse link function and interpretation

With the canonical inverse link, the Gamma regression model is:

$\frac{1}{\mathbb{E}(Y)} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$

The coefficients represent changes in the reciprocal of the expected value for a one-unit increase in the predictor. This can be unintuitive, which is why many actuaries use the log link instead for Gamma regression. With a log link, interpretation mirrors Poisson regression: $\exp(\beta_j)$ gives the multiplicative effect on the expected claim amount.

Applications in insurance modeling

Gamma regression is widely used in insurance for pricing and reserving:

Auto insurance severity: Modeling the average cost per claim using predictors like driver age, vehicle type, and accident history. For instance, a model might show that drivers under 25 have expected claim costs 1.4 times higher than the baseline group.
Health insurance costs: Estimating average cost per claim based on policyholder demographics and medical conditions.
Property insurance losses: Predicting loss amounts using property value, geographic location, and construction type.

Accurate severity modeling feeds directly into premium calculations (frequency × severity) and reserve adequacy assessments.

Tweedie regression for compound Poisson-Gamma data

Tweedie regression combines properties of the Poisson and Gamma distributions to model continuous, non-negative data that includes exact zeros. This is particularly valuable in actuarial work because aggregate loss data naturally contains policyholders with zero losses alongside those with positive loss amounts.

Tweedie distribution and properties

The Tweedie distribution is a family within the exponential dispersion models, characterized by a power variance function:

$\text{Var}(Y) = \phi \cdot [\mathbb{E}(Y)]^p$

The power parameter $p$ determines which distribution you get:

$p = 0$ : Normal distribution
$p = 1$ : Poisson distribution
$1 < p < 2$ : Compound Poisson-Gamma distribution (the actuarially relevant case)
$p = 2$ : Gamma distribution
$p = 3$ : Inverse Gaussian distribution

For $1 < p < 2$ , the Tweedie distribution can be interpreted as a Poisson-distributed number of claims, each with a Gamma-distributed severity. This directly mirrors the frequency-severity framework actuaries use.

Power variance function and parameter estimation

Tweedie regression typically uses the log link:

$\log(\mathbb{E}(Y)) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$

with the variance function $\text{Var}(Y) = \phi \cdot [\mathbb{E}(Y)]^p$ .

The power parameter $p$ , regression coefficients $\boldsymbol{\beta}$ , and dispersion parameter $\phi$ are all estimated from the data. Profile likelihood methods are commonly used to estimate $p$ : you fit the model across a grid of $p$ values and select the one that maximizes the likelihood. Statistical software packages (such as R's tweedie and statmod packages) provide tools for this estimation.

The value of $p$ affects how the model handles zero observations and the relative weight given to small versus large claims.

Applications in actuarial science

Tweedie regression is especially useful for modeling pure premium (expected loss per policy), which naturally combines frequency and severity into a single model:

Aggregate loss modeling: Rather than building separate frequency and severity models and combining them, Tweedie regression models the total claim amount directly. For a portfolio where 85% of policyholders have zero claims, the Tweedie model handles both the point mass at zero and the continuous positive losses in one framework.
Ratemaking: Tweedie GLMs are used in personal lines pricing (auto, home) to predict expected losses per exposure unit, with rating factors such as territory, coverage level, and insured characteristics.
Reserving: Tweedie models can be applied to loss triangles to estimate outstanding claim liabilities, accommodating the mix of zero and positive incremental payments.

The advantage over separate frequency-severity models is simplicity and the ability to capture interactions between frequency and severity effects through a single set of covariates.