Expectation represents the average value a random variable takes over many trials. It collapses an entire probability distribution into a single number, which makes comparing random variables much easier. In actuarial work, expectation is the starting point for pricing insurance products, estimating reserves, and quantifying risk.

Discrete random variables

For a discrete random variable $X$ with probability mass function $p(x)$ , the expectation is:

$E(X) = \sum_{x} x \cdot p(x)$

You multiply each possible outcome by its probability, then sum everything up.

Example: For a fair six-sided die, each face has probability $\frac{1}{6}$ , so:

$E(X) = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = 3.5$

Notice the expected value doesn't have to be a value the die can actually land on. It's the long-run average.

Continuous random variables

For a continuous random variable $X$ with probability density function $f(x)$ , the sum becomes an integral:

$E(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$

The logic is the same: weight each value by how likely it is, but now you integrate over a continuous range instead of summing over discrete outcomes.

Example: For a standard normal distribution ( $\mu = 0, \sigma = 1$ ), the density is symmetric about zero, so $E(X) = 0$ .

Linearity of expectation

This is one of the most useful properties in all of probability:

$E(X + Y) = E(X) + E(Y)$

This holds regardless of whether $X$ and $Y$ are independent, dependent, or correlated. More generally, for constants $a$ and $b$ :

$E(aX + b) = aE(X) + b$

Linearity lets you break complex random variables into simpler pieces, compute each expectation separately, and add the results.

Moments of distributions

Moments characterize the shape and properties of a distribution. Think of them as a sequence of numerical summaries: the first moment tells you where the distribution is centered, the second tells you how spread out it is, and higher moments capture asymmetry and tail behavior.

Raw moments

The $k$ -th raw moment (or moment about the origin) of a random variable $X$ is:

$\mu_k' = E(X^k) = \int_{-\infty}^{\infty} x^k \cdot f(x) \, dx$

For discrete variables, replace the integral with a sum. The first raw moment ( $k=1$ ) is just the expectation $E(X)$ .

Central moments

The $k$ -th central moment measures deviation from the mean:

$\mu_k = E\left((X - \mu)^k\right), \quad \text{where } \mu = E(X)$

The first central moment is always zero (deviations above and below the mean cancel).
The second central moment is the variance: $\text{Var}(X) = E\left((X - \mu)^2\right)$ .
Higher central moments ( $k = 3, 4$ ) give skewness and kurtosis information.

Moment generating functions

The moment generating function (MGF) of $X$ is:

$M_X(t) = E(e^{tX})$

Why "moment generating"? Because the $k$ -th derivative evaluated at $t = 0$ gives the $k$ -th raw moment:

$M_X^{(k)}(0) = E(X^k)$

Two key facts about MGFs:

Uniqueness: If two random variables have the same MGF (where it exists in a neighborhood of zero), they have the same distribution.
Sums of independent variables: If $X$ and $Y$ are independent, $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ . This makes MGFs especially powerful for finding the distribution of sums.

Note that the MGF doesn't exist for all distributions (the integral may diverge). When it doesn't, the characteristic function $E(e^{itX})$ serves a similar role.

Variance and standard deviation

Variance and standard deviation measure how spread out a distribution is around its mean. A small variance means outcomes cluster tightly; a large variance means they're more dispersed.

Definition of variance

$\text{Var}(X) = E\left((X - \mu)^2\right) = E(X^2) - \left(E(X)\right)^2$

The second form (often called the "computational formula") is usually easier to work with. To use it:

Compute $E(X^2)$ (the second raw moment).
Compute $E(X)$ and square it.
Subtract: $\text{Var}(X) = E(X^2) - [E(X)]^2$ .

Because variance involves squaring, its units are the square of the original variable's units, which can make direct interpretation awkward.

Discrete random variables, Discrete Random Variables (5 of 5) | Concepts in Statistics

Variance of linear combinations

For random variables $X$ and $Y$ with constants $a$ and $b$ :

$\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) + 2ab\,\text{Cov}(X,Y)$

If $X$ and $Y$ are independent, then $\text{Cov}(X,Y) = 0$ and this simplifies to:

$\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y)$

Watch the constants carefully: they get squared when pulled out of variance, unlike expectation where they come out linearly.

Standard deviation vs variance

$\sigma_X = \sqrt{\text{Var}(X)}$

Standard deviation is in the same units as $X$ , so it's more interpretable. If claim sizes have a mean of $500 and a standard deviation of $120, you can immediately see that typical deviations from the mean are on the order of $120. Variance would give you $14,400, which is harder to contextualize.

Skewness and kurtosis

These higher-order moments describe the shape of a distribution beyond just its center and spread.

Measuring asymmetry with skewness

Skewness quantifies how asymmetric a distribution is:

$\text{Skew}(X) = \frac{E\left((X - \mu)^3\right)}{\sigma^3}$

Positive skewness: The right tail is longer. Most values cluster to the left of the mean. Common in insurance loss distributions where most claims are small but a few are very large.
Negative skewness: The left tail is longer.
Zero skewness: The distribution is symmetric (e.g., the normal distribution).

Dividing by $\sigma^3$ makes skewness dimensionless, so you can compare skewness across distributions with different scales.

Measuring tail behavior with kurtosis

Kurtosis captures how heavy the tails are:

$\text{Kurt}(X) = \frac{E\left((X - \mu)^4\right)}{\sigma^4}$

The normal distribution has a kurtosis of 3. Excess kurtosis subtracts 3 so that the normal serves as the baseline:

$\text{Excess Kurt}(X) = \text{Kurt}(X) - 3$

Excess kurtosis > 0 (leptokurtic): Heavier tails than normal, more prone to extreme values.
Excess kurtosis < 0 (platykurtic): Lighter tails than normal.

For actuaries, high kurtosis is a red flag: it signals that extreme losses are more likely than a normal model would predict.

Comparing distributions using moments

Two distributions can share the same mean and variance but differ in higher moments. For example, the normal and Laplace distributions can be parameterized to have identical means and variances, but the Laplace distribution has excess kurtosis of 3 (kurtosis of 6), meaning it produces more extreme values. Moments give you a systematic way to detect these differences.

Covariance and correlation

These measure the linear relationship between two random variables. They're essential for understanding how risks interact in a portfolio.

Definition of covariance

$\text{Cov}(X,Y) = E\left((X - \mu_X)(Y - \mu_Y)\right) = E(XY) - E(X)E(Y)$

The computational form $E(XY) - E(X)E(Y)$ is typically easier to calculate. Covariance can be:

Positive: $X$ and $Y$ tend to increase together.
Negative: When one increases, the other tends to decrease.
Zero: No linear association (but there could still be a nonlinear relationship).

Correlation coefficient

Correlation normalizes covariance to a $[-1, 1]$ scale:

$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \, \sigma_Y}$

$\rho = 1$ : Perfect positive linear relationship.
$\rho = -1$ : Perfect negative linear relationship.
$\rho = 0$ : No linear relationship.

A common pitfall: zero correlation does not imply independence. Two variables can be uncorrelated yet strongly dependent in a nonlinear way (e.g., $Y = X^2$ where $X$ is symmetric about zero).

Discrete random variables, Probability mass function - Wikipedia

Covariance matrices

For a random vector $\mathbf{X} = (X_1, \ldots, X_n)$ , the covariance matrix is:

$\Sigma = E\left((\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T\right)$

Diagonal entries $\Sigma_{ii} = \text{Var}(X_i)$
Off-diagonal entries $\Sigma_{ij} = \text{Cov}(X_i, X_j)$

The matrix is always symmetric and positive semi-definite. In actuarial work, covariance matrices are used to model the joint behavior of multiple lines of business or risk factors simultaneously.

Conditional expectation

Conditional expectation is the expected value of a random variable given that you know something about another variable. It refines your estimate by incorporating additional information.

Definition of conditional expectation

For continuous random variables, the conditional expectation of $Y$ given $X = x$ is:

$E(Y \mid X = x) = \int_{-\infty}^{\infty} y \cdot f_{Y|X}(y \mid x) \, dy$

Here $f_{Y|X}(y \mid x)$ is the conditional density of $Y$ given $X = x$ . You can think of this as: "If I fix $X$ at a particular value, what's the average of $Y$ ?"

Viewed as a function of $X$ , $E(Y \mid X)$ is itself a random variable.

Law of total expectation

$E(Y) = E\left(E(Y \mid X)\right)$

This says you can compute $E(Y)$ in two stages:

First compute $E(Y \mid X)$ for each possible value of $X$ .
Then average those conditional expectations over the distribution of $X$ .

This is sometimes called the "tower property" or "iterated expectation." It's extremely useful when direct computation of $E(Y)$ is hard but conditioning on $X$ simplifies things.

Conditional variance

$\text{Var}(Y \mid X = x) = E\left((Y - E(Y \mid X = x))^2 \mid X = x\right)$

The law of total variance (also called the "Eve's law") decomposes unconditional variance into two components:

$\text{Var}(Y) = E\left(\text{Var}(Y \mid X)\right) + \text{Var}\left(E(Y \mid X)\right)$

$E(\text{Var}(Y \mid X))$ : Average variability within each group defined by $X$ .
$\text{Var}(E(Y \mid X))$ : Variability between group means.

In actuarial contexts, this decomposition appears frequently. For instance, if $X$ represents a policyholder's risk class, the first term captures randomness within each class, and the second captures how much average claims differ across classes.

Applications in actuarial science

Pricing insurance contracts

The expected payout $E(L)$ of a loss random variable $L$ is the foundation of any premium calculation. A simple premium principle is:

$\text{Premium} = E(L) + \theta \cdot \text{Var}(L)$

where $\theta$ is a risk loading factor. Higher variance (more uncertain losses) leads to a higher premium. Skewness and kurtosis further inform how much extra loading is needed for distributions with heavy tails or asymmetry.

Calculating reserves

Reserves estimate future liabilities. Actuaries use $E(L)$ to set the central estimate, then examine the variance and higher moments to determine how much additional margin is needed. The law of total expectation is especially useful here: you can condition on claim type, policy year, or development period to build up the overall reserve estimate in stages.

Risk management using moments

Variance and standard deviation quantify overall portfolio risk.
Skewness flags portfolios where large losses are more likely than large gains.
Kurtosis identifies exposure to extreme events that standard deviation alone would understate.
Covariance and correlation reveal how different lines of business or risk factors move together, which is critical for diversification. A portfolio of negatively correlated risks has lower total variance than the sum of individual variances.