Survival analysis gives actuaries a rigorous framework for modeling time-to-event data: when does a policyholder die, become disabled, lapse a policy, or default on a loan? These questions sit at the heart of pricing, reserving, and risk management. The Cox proportional hazards model is the workhorse semi-parametric tool for this work, but it fits within a broader ecosystem of nonparametric estimators, parametric distributions, and extensions for competing risks and recurrent events.

Survival analysis fundamentals

Survival analysis studies the time until an event of interest occurs. In actuarial work, that event might be death, disability onset, policy lapse, or loan default. What makes survival analysis distinct from ordinary regression is its ability to handle incomplete observations through censoring and truncation.

Three building blocks define any survival analysis problem: the time origin and scale you choose, how you handle incomplete data, and the mathematical functions (survival and hazard) that describe the event process.

Time origin and scale

Time origin is the clock's starting point. Common choices include birth (for mortality studies), policy inception (for lapse analysis), or diagnosis date (for disability duration). Time scale is the unit you measure in: years, months, or days.

The choice matters more than it might seem. Modeling mortality from birth versus from policy inception can lead to different baseline hazard shapes and different covariate effects. Always match the origin and scale to the research question.

Censoring and truncation

Censoring means you know the individual is in the study, but you don't observe their exact event time:

Right-censoring: The study ends (or the person drops out) before the event occurs. This is the most common type.
Left-censoring: You know the event happened before a certain time, but not exactly when.
Interval-censoring: The event occurred between two observation times, but you don't know exactly when within that interval.

Truncation means certain individuals are systematically excluded from your data:

Left-truncation (delayed entry): Individuals enter observation only after surviving to a certain age or time. Pension studies often have this, since you only observe people who survived to employment.
Right-truncation: Only individuals who have already experienced the event are included.

Ignoring censoring or truncation introduces bias. Every survival method you'll encounter has been designed specifically to handle these features correctly.

Survival and hazard functions

The survival function $S(t)$ gives the probability of surviving beyond time $t$ :

$S(t) = P(T > t)$

It starts at $S(0) = 1$ and decreases toward 0. A steeper drop means events are happening faster.

The hazard function $h(t)$ gives the instantaneous rate of the event at time $t$ , conditional on having survived to $t$ :

$h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t}$

The hazard is not a probability (it can exceed 1). Think of it as the "intensity" of the event at each moment.

These two functions are linked through the cumulative hazard function $H(t)$ :

$H(t) = \int_0^t h(u) \, du = -\log S(t)$

Which gives the equivalent relationship:

$S(t) = \exp(-H(t))$

And the derivative relationship stated more compactly:

$h(t) = -\frac{d}{dt} \log S(t)$

If you know any one of $S(t)$ , $h(t)$ , or $H(t)$ , you can derive the other two.

Nonparametric estimation

Nonparametric methods estimate survival and hazard functions directly from the data without assuming any particular distributional form. They're your first step in any analysis: look at the data before imposing structure on it.

Kaplan-Meier estimator

The Kaplan-Meier (KM) estimator (also called the product-limit estimator) estimates $S(t)$ from censored data. At each observed event time $t_i$ :

$\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$

where $d_i$ is the number of events at time $t_i$ and $n_i$ is the number at risk just before $t_i$ .

The result is a step function that drops at each event time and stays flat between events. Censored observations reduce the risk set but don't cause a drop in the curve. This is what makes the KM estimator so useful: it extracts survival information even when many observations are censored.

Nelson-Aalen estimator

The Nelson-Aalen estimator targets the cumulative hazard function $H(t)$ rather than $S(t)$ directly:

$\hat{H}(t) = \sum_{t_i \leq t} \frac{d_i}{n_i}$

You can convert this to a survival estimate via $\hat{S}(t) = \exp(-\hat{H}(t))$ , which tends to perform slightly better than the KM estimator in small samples. The Nelson-Aalen estimator is also the foundation for several residual diagnostics used with the Cox model.

Confidence intervals and bands

Pointwise confidence intervals quantify uncertainty at a single time point. For the KM estimator, Greenwood's formula gives the variance, and you typically compute intervals on the log or complementary log-log scale to keep them within $[0, 1]$ .

Simultaneous confidence bands (such as the Hall-Wellner or EP bands) cover the entire survival curve with a specified probability. They're wider than pointwise intervals because they must hold at all time points simultaneously.

Both tools help you assess whether apparent differences between survival curves are real or just noise.

Parametric survival models

Parametric models assume a specific functional form for $S(t)$ or $h(t)$ , governed by a small number of parameters. The tradeoff: you gain efficiency and the ability to extrapolate, but you risk bias if the assumed form is wrong. Always check parametric fits against the nonparametric KM curve.

Exponential distribution

The simplest model. It assumes a constant hazard rate $\lambda$ :

Hazard: $h(t) = \lambda$
Survival: $S(t) = \exp(-\lambda t)$
Mean survival time: $1/\lambda$

The exponential distribution has the memoryless property: the probability of surviving an additional time $s$ doesn't depend on how long you've already survived. This is rarely realistic for human mortality but can be reasonable for certain mechanical failures or short observation windows.

Weibull distribution

The Weibull adds a shape parameter $\alpha$ to the exponential's scale parameter $\lambda$ :

Hazard: $h(t) = \alpha \lambda^{\alpha} t^{\alpha - 1}$
Survival: $S(t) = \exp(-(\lambda t)^{\alpha})$

The shape parameter controls the hazard trajectory:

$\alpha < 1$ : decreasing hazard (e.g., infant mortality, early policy lapse)
$\alpha = 1$ : constant hazard (reduces to exponential)
$\alpha > 1$ : increasing hazard (e.g., wear-out failures, aging)

This flexibility makes the Weibull one of the most commonly used parametric models in actuarial work.

Gompertz distribution

The Gompertz model assumes an exponentially increasing hazard, which fits adult human mortality remarkably well:

Hazard: $h(t) = \alpha \exp(\beta t)$
Survival: $S(t) = \exp\left(\frac{\alpha}{\beta}(1 - \exp(\beta t))\right)$

Here $\alpha > 0$ sets the initial level of mortality and $\beta > 0$ controls how fast it increases with age.

The Gompertz-Makeham extension adds a constant term $c$ to capture age-independent mortality (accidents, etc.):

$h(t) = c + \alpha \exp(\beta t)$

This is the standard model behind many actuarial life tables for ages roughly 30 to 90.

Accelerated failure time models

Accelerated failure time (AFT) models offer a different regression framework from the Cox model. Instead of modeling the hazard multiplicatively, they model the log of survival time as a linear function of covariates:

$\log T = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \sigma \epsilon$

where $\epsilon$ follows a specified distribution (e.g., extreme value for Weibull, normal for log-normal, logistic for log-logistic).

Coefficients have an intuitive interpretation: $\exp(\beta_j)$ is an acceleration factor. A value of 2 means the covariate doubles the expected survival time; a value of 0.5 means it halves it. AFT models are a natural alternative when the proportional hazards assumption fails.

Time origin and scale, Frontiers | A Comparison Study of Machine Learning (Random Survival Forest) and Classic ...

Cox proportional hazards model

The Cox PH model is the most widely used regression model in survival analysis. It's semi-parametric: it models the effect of covariates parametrically while leaving the baseline hazard completely unspecified.

Model specification and assumptions

The Cox model specifies the hazard for an individual with covariate vector $X$ as:

$h(t \mid X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)$

where $h_0(t)$ is the baseline hazard (the hazard when all covariates equal zero) and $\beta$ is the vector of regression coefficients.

The critical assumption is proportional hazards: the hazard ratio between any two individuals is constant over time. Mathematically, for individuals with covariates $X_a$ and $X_b$ :

$\frac{h(t \mid X_a)}{h(t \mid X_b)} = \exp(\beta^T(X_a - X_b))$

The right side has no $t$ in it. If this ratio actually changes over time, the PH assumption is violated and you need to consider alternatives (stratification, time-dependent coefficients, or AFT models).

Partial likelihood estimation

Because $h_0(t)$ is unspecified, you can't write down a full likelihood. Cox's key insight was the partial likelihood, which eliminates $h_0(t)$ entirely.

At each event time $t_i$ , the partial likelihood contribution is the probability that the individual who actually failed is the one who failed, out of everyone still at risk:

$L(\beta) = \prod_{i: \text{event}} \frac{\exp(\beta^T X_i)}{\sum_{j \in R(t_i)} \exp(\beta^T X_j)}$

where $R(t_i)$ is the risk set at time $t_i$ . Maximizing this with respect to $\beta$ yields consistent, asymptotically normal estimates. The resulting $\hat{\beta}$ values are interpreted as log-hazard ratios.

Hazard ratio interpretation

The hazard ratio (HR) for covariate $X_k$ is $\exp(\hat{\beta}_k)$ :

$HR > 1$ : higher hazard (shorter expected survival) for a one-unit increase in $X_k$
$HR < 1$ : lower hazard (longer expected survival)
$HR = 1$ : no effect

For example, if smoking status has $\hat{\beta} = 0.47$ , then $HR = \exp(0.47) \approx 1.60$ . Smokers face a 60% higher hazard of the event at any given time compared to non-smokers, holding other covariates constant.

For continuous covariates, the HR applies per unit increase. Rescaling the covariate (e.g., age in decades vs. years) changes the coefficient but not the model.

Tied event times

When multiple events share the same recorded time, the partial likelihood needs adjustment. Three common approaches, in order of increasing accuracy and computational cost:

Breslow approximation: Simplest; treats the risk set as unchanged across tied events. Works well when ties are few.
Efron approximation: Adjusts the risk set to account for ties more carefully. Generally preferred as a good balance of accuracy and speed.
Exact partial likelihood: Considers all possible orderings of the tied events. Most accurate but computationally expensive with many ties.

In actuarial data with discrete time units (e.g., monthly records), ties can be frequent, making the choice of method consequential.

Time-dependent covariates

Some covariates change over time: a policyholder's age, cumulative claim amount, or current health status. The Cox model accommodates these by allowing the linear predictor to vary:

$h(t \mid X(t)) = h_0(t) \exp(\beta^T X(t))$

Including time-dependent covariates relaxes the proportional hazards assumption for those variables, since the hazard ratio now changes as $X(t)$ changes. Implementation requires restructuring the data so that each individual contributes multiple rows, one for each interval during which their covariates are constant.

Model diagnostics and assessment

Fitting a model is only half the work. You need to verify that the model's assumptions hold and that it fits the data adequately.

Proportional hazards assumption

Several tools exist to check whether the PH assumption holds:

Log-minus-log plot: Plot $\log(-\log(\hat{S}(t)))$ for different groups. Under PH, these curves should be roughly parallel. Crossing or diverging curves suggest violation.
Scaled Schoenfeld residuals: Plot these against time for each covariate. A non-zero slope indicates the covariate's effect changes over time.
Grambsch-Therneau test: A formal statistical test based on the correlation between Schoenfeld residuals and time. A significant p-value for a covariate means its hazard ratio is not constant.

If PH is violated, common remedies include stratifying on the offending variable, adding an interaction with time, or switching to an AFT model.

Residual analysis

Three types of residuals are commonly used:

Cox-Snell residuals: If the model fits well, these should follow a unit exponential distribution. Plot the Nelson-Aalen estimate of their cumulative hazard against a 45-degree line; departures indicate poor fit.
Martingale residuals: Range from $-\infty$ to 1. Useful for detecting the correct functional form of covariates. Plot them against each covariate and look for nonlinear patterns.
Deviance residuals: A transformation of martingale residuals that are more symmetrically distributed around zero. Better for identifying outliers.

Goodness-of-fit tests

Three standard tests assess model fit and covariate significance:

Likelihood ratio test: Compares nested models by examining twice the difference in log-likelihoods. Most reliable for testing whether a set of covariates improves the model.
Wald test: Tests individual coefficients using $\hat{\beta}/SE(\hat{\beta})$ . Quick but can be unreliable for large coefficients.
Score test: Tests the effect of adding covariates without refitting the model. Useful for screening candidate variables.

All three are asymptotically equivalent but can differ in finite samples. The likelihood ratio test is generally preferred.

Model selection criteria

When comparing non-nested models, use information criteria:

AIC (Akaike Information Criterion): $AIC = -2\ell + 2p$ , where $\ell$ is the maximized log-likelihood and $p$ is the number of parameters.
BIC (Bayesian Information Criterion): $BIC = -2\ell + p \log(n)$ , where $n$ is the sample size. BIC penalizes complexity more heavily than AIC.

Lower values indicate a better balance of fit and parsimony. Cross-validation (k-fold or leave-one-out) provides a more direct assessment of predictive performance on unseen data, though it's computationally heavier.

Competing risks and multistate models

Real actuarial problems rarely involve a single event type. A policyholder might die, lapse, or surrender. A borrower might default, prepay, or reach maturity. Competing risks and multistate models handle these richer event structures.

Time origin and scale, Frontiers | Survival prediction for patients with glioblastoma multiforme using a Cox ...

Cause-specific hazards

The cause-specific hazard for event type $k$ is:

$h_k(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t, \text{cause} = k \mid T \geq t)}{\Delta t}$

To estimate it, treat all other event types as censored and apply standard methods (KM, Cox) to each cause separately. The overall hazard is the sum of all cause-specific hazards:

$h(t) = \sum_k h_k(t)$

Be careful with interpretation: a cause-specific hazard ratio from a Cox model tells you about the rate of that specific event, but it doesn't directly tell you about the probability of that event occurring, because competing events also affect that probability.

Cumulative incidence functions

The cumulative incidence function (CIF) for cause $k$ gives the probability of experiencing event $k$ by time $t$ , accounting for competing risks:

$F_k(t) = \int_0^t S(u^-) h_k(u) \, du$

where $S(u^-)$ is the overall survival function (probability of no event of any type). The CIFs for all causes sum to $1 - S(t)$ , not to 1 individually.

The Aalen-Johansen estimator generalizes the KM estimator to this setting. Using $1 - KM$ for each cause separately (treating other causes as censored) overestimates the cumulative incidence, which is a common and consequential mistake.

Multistate Markov models

Multistate models generalize competing risks by allowing transitions between multiple states, not just from "alive" to various "dead" states. A disability insurance model might have states: Active → Disabled → Recovered → Dead, with transitions possible between several pairs.

Key features:

Transition intensities $q_{ij}(t)$ describe the instantaneous rate of moving from state $i$ to state $j$ .
The Markov property assumes future transitions depend only on the current state, not on history.
Outputs include state occupation probabilities (probability of being in each state at time $t$ ), transition probabilities, and expected sojourn times (how long you stay in a state).

These models are central to disability insurance, long-term care, and pension valuation where individuals move between active, disabled, and deceased states.

Recurrent event analysis

Some events can happen more than once to the same individual: repeated hospitalizations, multiple insurance claims, or recurring disability episodes. Standard survival analysis assumes one event per person, so recurrent event methods are needed.

The core challenge is that events within the same individual are correlated. Someone who files one claim quickly is likely to file the next one quickly too. Three main approaches handle this dependence.

Intensity functions

The intensity function gives the instantaneous event rate at time $t$ , conditional on the full event history up to that point.

Two extensions of the Cox model are commonly used:

Andersen-Gill (AG) model: Treats each event interval as a separate observation in a single Cox model. Assumes a common baseline hazard across all events. Simple but assumes events within an individual are independent given covariates.
Prentice-Williams-Peterson (PWP) model: Stratifies the baseline hazard by event number (first event, second event, etc.), allowing the baseline risk to differ for each recurrence. More realistic when the risk profile changes after each event.

Variance-corrected models

Rather than modeling the dependence structure explicitly, variance-corrected models use the standard Cox framework but fix the standard errors.

The robust (sandwich) variance estimator clusters observations by individual and produces standard errors that are valid even when within-individual events are correlated. The point estimates (hazard ratios) stay the same as in a naive analysis, but the confidence intervals and p-values are corrected.

This approach is practical when you care primarily about covariate effects and less about the dependence structure itself.

Frailty models

Frailty models add a random effect $Z_i$ for each individual to capture unobserved heterogeneity:

$h(t \mid X_i, Z_i) = Z_i \cdot h_0(t) \exp(\beta^T X_i)$

The frailty $Z_i$ is typically assumed to follow a gamma or log-normal distribution with mean 1. Individuals with $Z_i > 1$ are more "frail" (higher event rates); those with $Z_i < 1$ are more robust.

Frailty models simultaneously account for within-individual correlation and provide individual-level risk estimates. The variance of the frailty distribution quantifies the degree of unobserved heterogeneity in the population.

Applications in actuarial science

Life insurance pricing and reserving

Mortality modeling is the foundation of life insurance. Actuaries fit parametric models (typically Gompertz-Makeham) to population or insured-lives data to construct life tables. The Cox model helps quantify how underwriting factors (smoking status, BMI, occupation) shift mortality risk. These survival estimates feed directly into premium calculations and reserve requirements.

Pension plan valuation

Pension liabilities depend heavily on how long retirees live. Multistate models capture the transitions between active employment, retirement, disability, and death. Longevity risk (the risk that people live longer than expected) is a major concern, and survival models that incorporate improving mortality trends help actuaries set appropriate funding levels.

Health insurance claims modeling

Health insurers use survival analysis to model both the incidence (when does a claim start?) and duration (how long does it last?) of claims. Recurrent event models handle policyholders with multiple hospitalizations or disability episodes. These models inform premium setting, claims reserving, and the design of benefit structures like elimination periods and maximum benefit durations.

Credit risk modeling

Survival analysis applies naturally to credit risk, where the "event" is default. The Cox PH model assesses how borrower characteristics (credit score, debt-to-income ratio) and macroeconomic conditions (unemployment rate, interest rates) affect default timing. Competing risks models distinguish between default, prepayment, and maturity. These survival-based estimates feed into expected loss calculations, credit scoring systems, and regulatory capital requirements.