Expectation quantifies the "average" value a random variable takes, weighted by probability. It captures the central tendency, or long-run average, of a random variable and is denoted $\mathbb{E}[X]$ .

Discrete random variables

For a discrete random variable $X$ with probability mass function $p(x)$ , the expectation is:

$\mathbb{E}[X] = \sum_{x} x \cdot p(x)$

You multiply each possible value of $X$ by its probability, then sum everything up.

Example: For a fair six-sided die, each face has probability $1/6$ , so:

$\mathbb{E}[X] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + 3 \cdot \tfrac{1}{6} + 4 \cdot \tfrac{1}{6} + 5 \cdot \tfrac{1}{6} + 6 \cdot \tfrac{1}{6} = 3.5$

Notice the expected value doesn't have to be a value $X$ can actually take.

Continuous random variables

For a continuous random variable $X$ with probability density function $f(x)$ , the sum becomes an integral:

$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$

The logic is the same: weight each value $x$ by how likely it is (its density), then integrate over the entire domain.

Example: For a standard normal distribution ( $\mu = 0, \sigma = 1$ ), the density is symmetric about zero, so $\mathbb{E}[X] = 0$ .

Linearity of expectation

For any constants $a$ and $b$ and random variables $X$ and $Y$ :

$\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$

This holds regardless of whether $X$ and $Y$ are independent. That's what makes linearity so powerful: you can break apart complicated sums without worrying about dependence structure.

Law of the unconscious statistician (LOTUS)

LOTUS lets you compute $\mathbb{E}[g(X)]$ directly from the distribution of $X$ , without first finding the distribution of $g(X)$ :

Discrete: $\mathbb{E}[g(X)] = \sum_{x} g(x) \cdot p(x)$
Continuous: $\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f(x) \, dx$

Example: For $X \sim N(0,1)$ , you can find $\mathbb{E}[X^2]$ without deriving the distribution of $X^2$ :

$\mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 \cdot \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \, dx = 1$

This result is used constantly when computing variances.

Properties of expectation

These properties follow from the definition and are used repeatedly in proofs throughout stochastic processes.

Non-negativity

If $X \geq 0$ (with probability 1), then $\mathbb{E}[X] \geq 0$ . This follows directly because you're summing or integrating non-negative quantities.

Monotonicity

If $X \leq Y$ almost surely (i.e., $P(X \leq Y) = 1$ ), then:

$\mathbb{E}[X] \leq \mathbb{E}[Y]$

You can think of this as: if one random variable is always at most as large as another, its average can't exceed the other's average.

Bounds on expectation

The expectation is bounded by the extreme values of the random variable:

$\min(X) \leq \mathbb{E}[X] \leq \max(X)$

Example: If $X$ counts the number of heads in 3 fair coin tosses, then $X \in \{0, 1, 2, 3\}$ , so $0 \leq \mathbb{E}[X] \leq 3$ . (The actual value is $\mathbb{E}[X] = 1.5$ .)

Conditional expectation

Conditional expectation extends expectation to settings where you have partial information. It's one of the most important tools in stochastic processes because it formalizes how predictions update as new information arrives.

Definition and properties

The conditional expectation of $X$ given an event $A$ with $P(A) > 0$ is:

$\mathbb{E}[X \mid A] = \frac{\mathbb{E}[X \cdot \mathbf{1}_A]}{P(A)}$

where $\mathbf{1}_A$ is the indicator function of event $A$ (equals 1 when $A$ occurs, 0 otherwise). Conditional expectation inherits linearity, non-negativity, and monotonicity from regular expectation.

Example: In a standard deck of 52 cards, assign values Jack = 11, Queen = 12, King = 13. The conditional expected value given that the card is a face card:

$\mathbb{E}[X \mid \text{face card}] = \frac{11 + 12 + 13}{3} = 12$

Discrete random variables, Discrete Random Variables (2 of 5) | Concepts in Statistics

Tower property (law of iterated expectations)

For random variables $X$ and $Y$ :

$\mathbb{E}[\mathbb{E}[X \mid Y]] = \mathbb{E}[X]$

In words: if you first compute the expectation of $X$ conditional on $Y$ , then average that over all possible values of $Y$ , you recover the unconditional expectation of $X$ . This is especially useful when the conditional expectation is easier to compute than the unconditional one.

Law of total expectation

For a partition $\{A_1, A_2, \ldots, A_n\}$ of the sample space:

$\mathbb{E}[X] = \sum_{i=1}^{n} \mathbb{E}[X \mid A_i] \cdot P(A_i)$

This is the "discrete version" of the tower property. You break the sample space into cases, compute the expectation in each case, and take a weighted average.

Example: A factory produces items that are defective with probability 0.1. Non-defective items cost $10, defective items cost $50. The expected cost per item:

$\mathbb{E}[X] = 10 \cdot 0.9 + 50 \cdot 0.1 = 14$

Variance and standard deviation

While expectation tells you the center, variance and standard deviation tell you how spread out the distribution is around that center.

Definition of variance

$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$

Variance is the average squared deviation from the mean. A useful computational shortcut:

$\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

This second form is almost always easier to work with. To use it:

Compute $\mathbb{E}[X]$ (often via LOTUS or linearity).
Compute $\mathbb{E}[X^2]$ (using LOTUS).
Subtract: $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ .

Properties of variance

Non-negativity: $\text{Var}(X) \geq 0$ for any random variable $X$ . Variance equals zero only if $X$ is constant with probability 1.
Scaling: $\text{Var}(aX) = a^2 \text{Var}(X)$ . Note the square on $a$ ; adding a constant shifts the distribution but doesn't change spread, so $\text{Var}(aX + b) = a^2 \text{Var}(X)$ .
Additivity for independent variables: If $X$ and $Y$ are independent, $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ . Unlike linearity of expectation, this requires independence (or at least zero covariance).

Standard deviation

$\sigma_X = \sqrt{\text{Var}(X)}$

Standard deviation has the same units as $X$ , making it more interpretable than variance. For a normal distribution with mean $\mu$ and standard deviation $\sigma$ , the 68-95-99.7 rule applies: approximately 68% of values fall within $\mu \pm \sigma$ , 95% within $\mu \pm 2\sigma$ , and 99.7% within $\mu \pm 3\sigma$ .

Coefficient of variation

$CV = \frac{\sigma_X}{\mathbb{E}[X]}$

The CV is dimensionless, so it lets you compare variability across random variables with different scales.

Example: Stock A has mean return 10% and standard deviation 5% ( $CV = 0.5$ ). Stock B has mean return 5% and standard deviation 5% ( $CV = 1.0$ ). Even though both have the same absolute spread, Stock B is twice as variable relative to its mean.

Covariance and correlation

These measure the linear relationship between two random variables. They're central to understanding how random variables interact, which matters a great deal in stochastic processes.

Definition of covariance

$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$

A computational shortcut analogous to the variance formula:

$\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$

Positive covariance: $X$ and $Y$ tend to move in the same direction.
Negative covariance: they tend to move in opposite directions.
Zero covariance: no linear association (but they could still be dependent).

Properties of covariance

Symmetry: $\text{Cov}(X, Y) = \text{Cov}(Y, X)$
Linearity in each argument: $\text{Cov}(aX + b, Y) = a \cdot \text{Cov}(X, Y)$ . Constants added to a variable don't affect covariance.
Self-covariance: $\text{Cov}(X, X) = \text{Var}(X)$
General variance of a sum: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$ . This shows why independence (which implies zero covariance) simplifies things.

Discrete random variables, Probability mass function - Wikipedia

Correlation coefficient

$\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$

Correlation is a normalized version of covariance, always satisfying $-1 \leq \rho_{X,Y} \leq 1$ .

$\rho = 1$ : perfect positive linear relationship
$\rho = -1$ : perfect negative linear relationship
$\rho = 0$ : no linear relationship (uncorrelated)

Keep in mind that uncorrelated does not imply independent, except in special cases (e.g., jointly normal random variables).

Cauchy-Schwarz inequality

$(\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2]$

This is the inequality that guarantees $|\rho_{X,Y}| \leq 1$ . Equality holds if and only if $Y = aX + b$ for some constants $a$ and $b$ (i.e., $X$ and $Y$ are linearly dependent). The Cauchy-Schwarz inequality appears frequently in proofs throughout probability and statistics.

Moment-generating functions

Moment-generating functions (MGFs) encode all the moments of a distribution into a single function. They're a key tool for identifying distributions and working with sums of independent random variables.

Definition and properties

The MGF of a random variable $X$ is:

$M_X(t) = \mathbb{E}[e^{tX}]$

provided this expectation exists in a neighborhood of $t = 0$ .

Key properties:

Uniqueness: If two random variables have the same MGF (in a neighborhood of 0), they have the same distribution.
Affine transformation: $M_{aX+b}(t) = e^{bt} M_X(at)$
Independence and sums: If $X$ and $Y$ are independent, $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ . This makes MGFs extremely useful for finding the distribution of sums.

Relationship to expectation and variance

Moments are extracted by differentiating the MGF and evaluating at $t = 0$ :

$\mathbb{E}[X] = M'_X(0)$
$\mathbb{E}[X^2] = M''_X(0)$
$\text{Var}(X) = M''_X(0) - (M'_X(0))^2$

More generally, the $n$ -th moment is $\mathbb{E}[X^n] = M_X^{(n)}(0)$ .

Uniqueness and existence

Not every random variable has an MGF. The MGF exists when $\mathbb{E}[e^{tX}] < \infty$ for all $t$ in some open interval around 0. Distributions with light tails (e.g., normal, Poisson, exponential) have MGFs. Heavy-tailed distributions like the Cauchy distribution do not.

When the MGF does exist, it uniquely determines the distribution. This is why "matching MGFs" is a valid proof technique for showing two random variables have the same distribution.

Applications in probability calculations

Sums of independent random variables: Multiply MGFs, then identify the resulting function. For example, the MGF of a standard normal is $M_X(t) = e^{t^2/2}$ . The sum of $n$ independent standard normals has MGF $e^{nt^2/2}$ , which is the MGF of $N(0, n)$ .
Proving limit theorems: MGFs provide one route to proving the central limit theorem and the law of large numbers.
Identifying distributions: If you compute an MGF and recognize it as belonging to a known family, you've identified the distribution without inverting a transform.

Inequalities involving expectation and variance

These inequalities give you probability bounds using only moments, without needing the full distribution. They get progressively tighter as you use more information.

Markov's inequality

For a non-negative random variable $X$ and any $a > 0$ :

$P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}$

This is the weakest of the three inequalities here, but it only requires $X \geq 0$ and a finite mean. It's often used as a stepping stone to derive stronger bounds.

Example: If $\mathbb{E}[X] = 10$ , then $P(X \geq 50) \leq 10/50 = 0.2$ .

Chebyshev's inequality

For any random variable $X$ with finite mean $\mu$ and variance $\sigma^2$ , and any $k > 0$ :

$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$

Chebyshev's is stronger than Markov's because it uses both the mean and the variance. It applies to any distribution with finite variance.

Example: For any random variable with finite variance, at least 75% of the probability mass lies within 2 standard deviations of the mean ( $1/2^2 = 0.25$ , so at most 25% lies outside).

Chernoff bounds

Chernoff bounds provide exponentially decaying tail bounds for sums of independent random variables, making them much tighter than Markov or Chebyshev for large deviations.

The general technique:

For any $t > 0$ , apply Markov's inequality to $e^{tX}$ : $P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\mathbb{E}[e^{tX}]}{e^{ta}} = \frac{M_X(t)}{e^{ta}}$
Optimize over $t$ to get the tightest bound.

A common special case (Hoeffding's inequality): for a sum $S_n = \sum_{i=1}^n X_i$ of independent bounded random variables with $|X_i| \leq 1$ , and any $\varepsilon > 0$ :

$P(S_n - \mathbb{E}[S_n] \geq \varepsilon) \leq e^{-2\varepsilon^2/n}$

The exponential decay makes Chernoff bounds far more useful than Chebyshev's inequality when dealing with sums of many independent random variables, which is a common setting in stochastic processes.