Random variables and probability distributions are the foundation of actuarial mathematics. They give you the tools to assign numbers to uncertain outcomes and then describe how likely those outcomes are, which is exactly what actuaries need to quantify risk in insurance and finance.

This section covers the types of random variables, how probability distributions work (including joint, marginal, and conditional distributions), the most common discrete and continuous distributions, moments, transformations, and limit theorems.

Types of random variables

A random variable assigns a numerical value to each outcome of a random experiment. The type of random variable you're working with determines which mathematical tools you'll use, so getting this distinction right matters.

Discrete vs continuous

Discrete random variables take on countable values, like integers or elements of a finite set. Think of things you can count: the number of claims filed in a month, the number of policies sold, or the number of defaults in a portfolio.

Continuous random variables can take any value within an interval. These describe things you measure rather than count: the dollar amount of a claim, the time until a policy lapses, or an insured's age at death.

The distinction matters because discrete and continuous variables use different mathematical machinery (summation vs. integration), and mixing them up will lead to errors.

Probability mass functions

A probability mass function (PMF) describes the probability distribution of a discrete random variable. It tells you the probability that the variable equals each possible value.

Notation: $P(X = x)$ , where $X$ is the random variable and $x$ is a specific value
Every probability must be non-negative, and the sum across all possible values must equal 1: $\sum_x P(X = x) = 1$
The binomial and Poisson distributions are common examples of distributions defined by PMFs

Probability density functions

A probability density function (PDF) describes the distribution of a continuous random variable. Unlike a PMF, a PDF does not give you the probability of a single point. Instead, it gives the relative likelihood at each value.

Notation: $f(x)$
To find the probability that $X$ falls in an interval, you integrate: $P(a \leq X \leq b) = \int_a^b f(x)\, dx$
The total area under the PDF must equal 1: $\int_{-\infty}^{\infty} f(x)\, dx = 1$
The normal and exponential distributions are common examples

A key point that trips people up: for a continuous random variable, $P(X = x) = 0$ for any single value $x$ . Probabilities only make sense over intervals.

Probability distributions

Probability distributions are mathematical functions that describe how likely different outcomes are for a random variable. Actuaries rely on them to model everything from claim frequencies to loss severities.

Cumulative distribution functions

The cumulative distribution function (CDF) gives the probability that a random variable is less than or equal to a specific value:

$F(x) = P(X \leq x)$

CDFs work for both discrete and continuous random variables:

Discrete: $F(x) = \sum_{t \leq x} P(X = t)$
Continuous: $F(x) = \int_{-\infty}^{x} f(t)\, dt$

Every CDF has three properties: it's non-decreasing, $\lim_{x \to -\infty} F(x) = 0$ , and $\lim_{x \to \infty} F(x) = 1$ . CDFs are especially useful for computing probabilities over intervals and for finding quantiles (percentiles) of a distribution.

Joint probability distributions

Joint probability distributions describe the probability of two or more random variables taking on values simultaneously.

For discrete variables, the joint PMF is $P(X = x, Y = y)$
For continuous variables, the joint PDF is $f(x, y)$ , and probabilities are found by integrating over a region

Joint distributions let you model dependent risks. For example, you might use a joint distribution to capture the relationship between claim frequency and claim severity for an insurance policy. The joint CDF is $F(x, y) = P(X \leq x, Y \leq y)$ .

Marginal probability distributions

Marginal distributions are obtained from a joint distribution by "summing out" or "integrating out" the other variables:

Discrete: $P(X = x) = \sum_y P(X = x, Y = y)$
Continuous: $f_X(x) = \int_{-\infty}^{\infty} f(x, y)\, dy$

The result is the distribution of a single variable on its own, ignoring the others. For instance, if you have the joint distribution of claim frequency and severity, the marginal distribution of frequency tells you about frequency alone.

Conditional probability distributions

Conditional distributions describe the probability of one variable given a known value of another:

Discrete: $P(X = x \mid Y = y) = \frac{P(X = x, Y = y)}{P(Y = y)}$
Continuous: $f(x \mid y) = \frac{f(x, y)}{f_Y(y)}$

These are essential for updating your model when new information arrives. For example, the distribution of claim severity given that a claim has occurred is a conditional distribution. Conditional distributions connect to joint and marginal distributions through Bayes' theorem.

Common discrete distributions

Discrete distributions model counted quantities. Each distribution below has specific assumptions, so choosing the right one depends on the structure of the problem.

Bernoulli distribution

Models a single trial with two outcomes: success (1) or failure (0).

Notation: $X \sim \text{Bern}(p)$ , where $p$ is the probability of success
PMF: $P(X = x) = p^x (1-p)^{1-x}$ for $x \in \{0, 1\}$
Mean: $p$ , Variance: $p(1-p)$

Use this for binary events, such as whether a policyholder files a claim or not.

Binomial distribution

Models the number of successes in $n$ independent Bernoulli trials, each with the same success probability $p$ .

Notation: $X \sim \text{Bin}(n, p)$
PMF: $P(X = x) = \binom{n}{x} p^x (1-p)^{n-x}$ for $x \in \{0, 1, \ldots, n\}$
Mean: $np$ , Variance: $np(1-p)$

Use this when you have a fixed number of independent trials. For example, out of 100 policies, how many will result in a claim this year?

Discrete vs continuous, 4.1: Probability Distribution Function (PDF) for a Discrete Random Variable - Statistics LibreTexts

Poisson distribution

Models the count of events occurring in a fixed interval of time or space, where events happen independently at a constant average rate.

Notation: $X \sim \text{Pois}(\lambda)$ , where $\lambda$ is the average number of events per interval
PMF: $P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}$ for $x \in \{0, 1, 2, \ldots\}$
Mean: $\lambda$ , Variance: $\lambda$

The Poisson is the workhorse distribution for claim frequency modeling. A notable property: the mean equals the variance. If your data shows the variance significantly exceeding the mean, the Poisson may not be the right fit (consider the negative binomial instead).

Geometric distribution

Models the number of trials until the first success in a sequence of independent Bernoulli trials.

Notation: $X \sim \text{Geom}(p)$
PMF: $P(X = x) = (1-p)^{x-1}p$ for $x \in \{1, 2, 3, \ldots\}$
Mean: $1/p$ , Variance: $(1-p)/p^2$

Watch out: some textbooks define the geometric as the number of failures before the first success, which shifts the support to $\{0, 1, 2, \ldots\}$ and changes the PMF to $P(X = x) = (1-p)^x p$ . Always check which convention is being used.

Negative binomial distribution

Generalizes the geometric distribution. Models the number of failures before achieving $r$ successes.

Notation: $X \sim \text{NB}(r, p)$
PMF: $P(X = x) = \binom{x+r-1}{x}(1-p)^x p^r$ for $x \in \{0, 1, 2, \ldots\}$
Mean: $r(1-p)/p$ , Variance: $r(1-p)/p^2$

In actuarial work, the negative binomial is often used as an alternative to the Poisson for claim counts when overdispersion is present (i.e., the variance exceeds the mean).

Hypergeometric distribution

Models the number of successes in $n$ draws from a finite population of size $N$ containing $K$ successes, without replacement.

Notation: $X \sim \text{HGeom}(N, K, n)$
PMF: $P(X = x) = \frac{\binom{K}{x}\binom{N-K}{n-x}}{\binom{N}{n}}$ for $x \in \{\max(0, n+K-N), \ldots, \min(n, K)\}$

The key difference from the binomial: sampling is done without replacement, so trials are not independent. Use this when the population is small enough that removing items changes the probabilities meaningfully. As $N$ grows large relative to $n$ , the hypergeometric approaches the binomial.

Common continuous distributions

Continuous distributions model measured quantities like claim amounts, durations, and financial returns.

Uniform distribution

Every value in the interval $[a, b]$ is equally likely.

Notation: $X \sim \text{Unif}(a, b)$
PDF: $f(x) = \frac{1}{b-a}$ for $x \in [a, b]$
Mean: $(a+b)/2$ , Variance: $(b-a)^2/12$

Often used when you have no prior information about which values are more likely, or as a building block in simulation (since many random number generators produce uniform variates that get transformed into other distributions).

Normal distribution

The symmetric, bell-shaped distribution that appears throughout statistics and actuarial science, largely because of the central limit theorem.

Notation: $X \sim \mathcal{N}(\mu, \sigma^2)$ , where $\mu$ is the mean and $\sigma^2$ is the variance
PDF: $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ for $x \in \mathbb{R}$

The standard normal has $\mu = 0$ and $\sigma^2 = 1$ . Any normal variable can be standardized: $Z = (X - \mu)/\sigma$ . Used to model aggregate losses (via CLT), financial returns, and as an approximation for many other distributions when sample sizes are large.

Exponential distribution

Models the waiting time until the next event in a Poisson process.

Notation: $X \sim \text{Exp}(\lambda)$ , where $\lambda$ is the rate parameter
PDF: $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$
Mean: $1/\lambda$ , Variance: $1/\lambda^2$

The exponential distribution has the memoryless property: $P(X > s + t \mid X > s) = P(X > t)$ . This means the probability of waiting an additional $t$ units doesn't depend on how long you've already waited. It's the only continuous distribution with this property.

Gamma distribution

Generalizes the exponential. When $\alpha$ is a positive integer, it models the waiting time until the $\alpha$ -th event in a Poisson process.

Notation: $X \sim \text{Gamma}(\alpha, \beta)$ , where $\alpha$ is the shape and $\beta$ is the rate
PDF: $f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$ for $x > 0$
Mean: $\alpha/\beta$ , Variance: $\alpha/\beta^2$

Note that $\Gamma(\alpha)$ is the gamma function, which generalizes the factorial: $\Gamma(n) = (n-1)!$ for positive integers. The gamma distribution is widely used for modeling claim amounts and in credibility theory. When $\alpha = 1$ , it reduces to the exponential.

Beta distribution

Models a random variable constrained to the interval $(0, 1)$ , making it natural for proportions and probabilities.

Notation: $X \sim \text{Beta}(\alpha, \beta)$
PDF: $f(x) = \frac{1}{B(\alpha, \beta)}x^{\alpha-1}(1-x)^{\beta-1}$ for $x \in (0, 1)$
Mean: $\alpha/(\alpha + \beta)$ , Variance: $\alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)]$

Here $B(\alpha, \beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ is the beta function. The beta distribution is very flexible: depending on the parameters, it can be symmetric, left-skewed, right-skewed, U-shaped, or uniform (when $\alpha = \beta = 1$ ). Actuaries use it for loss ratios and as a prior distribution in Bayesian analysis.

Lognormal distribution

If $\ln(X)$ follows a normal distribution, then $X$ follows a lognormal distribution. This means $X$ is always positive and right-skewed.

Notation: $X \sim \text{Lognormal}(\mu, \sigma^2)$ , where $\mu$ and $\sigma^2$ are the mean and variance of $\ln(X)$
PDF: $f(x) = \frac{1}{x\sqrt{2\pi\sigma^2}}e^{-\frac{(\ln x - \mu)^2}{2\sigma^2}}$ for $x > 0$
Mean: $e^{\mu + \sigma^2/2}$ , Variance: $(e^{\sigma^2} - 1)e^{2\mu + \sigma^2}$

The lognormal is one of the most common distributions for modeling individual claim amounts because claims are positive and their distribution is typically right-skewed (many small claims, few very large ones).

Discrete vs continuous, Discrete Random Variables (3 of 5) | Concepts in Statistics

Moments of distributions

Moments are numerical summaries that capture the shape and behavior of a distribution. The first few moments tell you about the center, spread, asymmetry, and tail behavior.

Expected value

The expected value (mean) is the probability-weighted average of all possible values.

Notation: $\mathbb{E}[X]$ or $\mu$
Discrete: $\mathbb{E}[X] = \sum_{x} x\, P(X = x)$
Continuous: $\mathbb{E}[X] = \int_{-\infty}^{\infty} x\, f(x)\, dx$

The expected value is linear: $\mathbb{E}[aX + b] = a\mathbb{E}[X] + b$ . Actuaries use it to calculate fair premiums, expected claim costs, and expected profits.

Variance and standard deviation

Variance measures how spread out the distribution is around the mean.

Notation: $\text{Var}(X)$ or $\sigma^2$
Computed as: $\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

The second formula ( $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$ ) is usually easier to compute. The standard deviation $\sigma = \sqrt{\text{Var}(X)}$ has the same units as $X$ , making it more interpretable.

Variance is used to assess volatility of claim amounts, set risk margins, and determine capital requirements.

Skewness and kurtosis

Skewness measures asymmetry:

$\gamma_1 = \mathbb{E}\left[\left(\frac{X - \mu}{\sigma}\right)^3\right]$

$\gamma_1 > 0$ : right-skewed (long right tail, common for claim amounts)
$\gamma_1 < 0$ : left-skewed
$\gamma_1 = 0$ : symmetric

Excess kurtosis measures tail heaviness relative to the normal distribution:

$\gamma_2 = \mathbb{E}\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3$

The " $-3$ " makes the normal distribution the baseline ( $\gamma_2 = 0$ ). Positive excess kurtosis means heavier tails than normal, which signals a higher probability of extreme outcomes. This is critical for actuaries assessing catastrophic risk.

Moment-generating functions

The moment-generating function (MGF) encodes all the moments of a distribution in a single function:

$M_X(t) = \mathbb{E}[e^{tX}]$

To extract the $n$ -th moment, take the $n$ -th derivative and evaluate at $t = 0$ :

$\mathbb{E}[X^n] = M_X^{(n)}(0)$

MGFs are powerful for three reasons:

If two random variables have the same MGF (and it exists in a neighborhood of 0), they have the same distribution
For independent random variables, $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ , which simplifies working with sums
They provide a clean way to prove limit theorems

Not every distribution has an MGF (the Cauchy distribution, for example, does not). When the MGF doesn't exist, the characteristic function serves as an alternative.

Transformations of random variables

Transformations create new random variables by applying functions to existing ones. This is how you move between distributions and model more complex relationships.

Linear transformations

A linear transformation takes the form $Y = aX + b$ , where $a$ and $b$ are constants.

Mean: $\mathbb{E}[Y] = a\mathbb{E}[X] + b$
Variance: $\text{Var}(Y) = a^2\text{Var}(X)$

Notice that adding a constant $b$ shifts the mean but doesn't affect the variance. Multiplying by $a$ scales both the mean and the standard deviation (by $|a|$ ), and scales the variance by $a^2$ .

Common uses: standardizing a variable ( $Z = (X - \mu)/\sigma$ ), converting units, or adjusting claim amounts for inflation.

Functions of random variables

For a general transformation $Y = g(X)$ , you need to find the distribution of $Y$ .

The CDF method works as follows:

Start with $F_Y(y) = P(g(X) \leq y)$
Rewrite the event $\{g(X) \leq y\}$ in terms of $X$
If $g$ is monotonically increasing: $F_Y(y) = F_X(g^{-1}(y))$
If $g$ is monotonically decreasing: $F_Y(y) = 1 - F_X(g^{-1}(y))$
Differentiate the CDF to get the PDF of $Y$

For a strictly monotonic, differentiable $g$ , the PDF can be found directly:

$f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy}g^{-1}(y)\right|$

This technique is how you derive, for example, that exponentiating a normal variable gives a lognormal variable.

Convolutions and sums of random variables

When $X$ and $Y$ are independent, the distribution of $Z = X + Y$ is found through convolution:

Discrete: $P(Z = z) = \sum_{x} P(X = x)\,P(Y = z - x)$
Continuous: $f_Z(z) = \int_{-\infty}^{\infty} f_X(x)\,f_Y(z - x)\, dx$

In practice, convolution integrals can be difficult to compute directly. The MGF approach is often easier: since $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ for independent variables, you can multiply the MGFs and then identify the resulting distribution.

For example, the sum of independent Poisson random variables with parameters $\lambda_1$ and $\lambda_2$ is Poisson with parameter $\lambda_1 + \lambda_2$ . You can verify this quickly using MGFs.

Convolutions are central to modeling aggregate losses, where total claims equal the sum of individual claim amounts.

Limit theorems

Limit theorems describe what happens as you work with larger and larger samples. They justify many of the approximations actuaries use in practice.

Law of large numbers

The law of large numbers (LLN) says that the sample mean converges to the population mean as the sample size grows.

For i.i.d. random variables $X_1, X_2, \ldots$ with mean $\mu$ :

$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \to \mu \quad \text{as } n \to \infty$

There are two versions:

Weak LLN: $\bar{X}_n$ converges to $\mu$ in probability
Strong LLN: $\bar{X}_n$ converges to $\mu$ almost surely (a stronger guarantee)

The LLN is the theoretical foundation of insurance: with a large enough pool of policyholders, the average claim per policy becomes predictable. This is why insurers can set stable premiums despite individual claims being random.

Central limit theorem

The central limit theorem (CLT) states that the standardized sum (or average) of a large number of i.i.d. random variables is approximately normally distributed, regardless of the original distribution.

For i.i.d. random variables with mean $\mu$ and variance $\sigma^2$ :

$\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty$

Equivalently, for the sum $S_n = \sum_{i=1}^n X_i$ :

$\frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)$

The CLT is why the normal distribution appears so frequently in practice. Actuaries use it to approximate the distribution of aggregate claims, construct confidence intervals, and perform hypothesis tests. As a rough guideline, the approximation tends to work well for $n \geq 30$ , though highly skewed distributions may require larger samples.