Fiveable

🔀Stochastic Processes Unit 1 Review

QR code for Stochastic Processes practice questions

1.5 Expectation and variance

1.5 Expectation and variance

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🔀Stochastic Processes
Unit & Topic Study Guides

Definition of expectation

Expectation quantifies the "average" value a random variable takes, weighted by probability. It captures the central tendency, or long-run average, of a random variable and is denoted E[X]\mathbb{E}[X].

Discrete random variables

For a discrete random variable XX with probability mass function p(x)p(x), the expectation is:

E[X]=xxp(x)\mathbb{E}[X] = \sum_{x} x \cdot p(x)

You multiply each possible value of XX by its probability, then sum everything up.

Example: For a fair six-sided die, each face has probability 1/61/6, so:

E[X]=116+216+316+416+516+616=3.5\mathbb{E}[X] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + 3 \cdot \tfrac{1}{6} + 4 \cdot \tfrac{1}{6} + 5 \cdot \tfrac{1}{6} + 6 \cdot \tfrac{1}{6} = 3.5

Notice the expected value doesn't have to be a value XX can actually take.

Continuous random variables

For a continuous random variable XX with probability density function f(x)f(x), the sum becomes an integral:

E[X]=xf(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

The logic is the same: weight each value xx by how likely it is (its density), then integrate over the entire domain.

Example: For a standard normal distribution (μ=0,σ=1\mu = 0, \sigma = 1), the density is symmetric about zero, so E[X]=0\mathbb{E}[X] = 0.

Linearity of expectation

For any constants aa and bb and random variables XX and YY:

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]

This holds regardless of whether XX and YY are independent. That's what makes linearity so powerful: you can break apart complicated sums without worrying about dependence structure.

Law of the unconscious statistician (LOTUS)

LOTUS lets you compute E[g(X)]\mathbb{E}[g(X)] directly from the distribution of XX, without first finding the distribution of g(X)g(X):

  • Discrete: E[g(X)]=xg(x)p(x)\mathbb{E}[g(X)] = \sum_{x} g(x) \cdot p(x)
  • Continuous: E[g(X)]=g(x)f(x)dx\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f(x) \, dx

Example: For XN(0,1)X \sim N(0,1), you can find E[X2]\mathbb{E}[X^2] without deriving the distribution of X2X^2:

E[X2]=x212πex2/2dx=1\mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 \cdot \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \, dx = 1

This result is used constantly when computing variances.

Properties of expectation

These properties follow from the definition and are used repeatedly in proofs throughout stochastic processes.

Non-negativity

If X0X \geq 0 (with probability 1), then E[X]0\mathbb{E}[X] \geq 0. This follows directly because you're summing or integrating non-negative quantities.

Monotonicity

If XYX \leq Y almost surely (i.e., P(XY)=1P(X \leq Y) = 1), then:

E[X]E[Y]\mathbb{E}[X] \leq \mathbb{E}[Y]

You can think of this as: if one random variable is always at most as large as another, its average can't exceed the other's average.

Bounds on expectation

The expectation is bounded by the extreme values of the random variable:

min(X)E[X]max(X)\min(X) \leq \mathbb{E}[X] \leq \max(X)

Example: If XX counts the number of heads in 3 fair coin tosses, then X{0,1,2,3}X \in \{0, 1, 2, 3\}, so 0E[X]30 \leq \mathbb{E}[X] \leq 3. (The actual value is E[X]=1.5\mathbb{E}[X] = 1.5.)

Conditional expectation

Conditional expectation extends expectation to settings where you have partial information. It's one of the most important tools in stochastic processes because it formalizes how predictions update as new information arrives.

Definition and properties

The conditional expectation of XX given an event AA with P(A)>0P(A) > 0 is:

E[XA]=E[X1A]P(A)\mathbb{E}[X \mid A] = \frac{\mathbb{E}[X \cdot \mathbf{1}_A]}{P(A)}

where 1A\mathbf{1}_A is the indicator function of event AA (equals 1 when AA occurs, 0 otherwise). Conditional expectation inherits linearity, non-negativity, and monotonicity from regular expectation.

Example: In a standard deck of 52 cards, assign values Jack = 11, Queen = 12, King = 13. The conditional expected value given that the card is a face card:

E[Xface card]=11+12+133=12\mathbb{E}[X \mid \text{face card}] = \frac{11 + 12 + 13}{3} = 12

Discrete random variables, Discrete Random Variables (2 of 5) | Concepts in Statistics

Tower property (law of iterated expectations)

For random variables XX and YY:

E[E[XY]]=E[X]\mathbb{E}[\mathbb{E}[X \mid Y]] = \mathbb{E}[X]

In words: if you first compute the expectation of XX conditional on YY, then average that over all possible values of YY, you recover the unconditional expectation of XX. This is especially useful when the conditional expectation is easier to compute than the unconditional one.

Law of total expectation

For a partition {A1,A2,,An}\{A_1, A_2, \ldots, A_n\} of the sample space:

E[X]=i=1nE[XAi]P(Ai)\mathbb{E}[X] = \sum_{i=1}^{n} \mathbb{E}[X \mid A_i] \cdot P(A_i)

This is the "discrete version" of the tower property. You break the sample space into cases, compute the expectation in each case, and take a weighted average.

Example: A factory produces items that are defective with probability 0.1. Non-defective items cost $10, defective items cost $50. The expected cost per item:

E[X]=100.9+500.1=14\mathbb{E}[X] = 10 \cdot 0.9 + 50 \cdot 0.1 = 14

Variance and standard deviation

While expectation tells you the center, variance and standard deviation tell you how spread out the distribution is around that center.

Definition of variance

Var(X)=E[(XE[X])2]\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]

Variance is the average squared deviation from the mean. A useful computational shortcut:

Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

This second form is almost always easier to work with. To use it:

  1. Compute E[X]\mathbb{E}[X] (often via LOTUS or linearity).
  2. Compute E[X2]\mathbb{E}[X^2] (using LOTUS).
  3. Subtract: Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2.

Properties of variance

  • Non-negativity: Var(X)0\text{Var}(X) \geq 0 for any random variable XX. Variance equals zero only if XX is constant with probability 1.
  • Scaling: Var(aX)=a2Var(X)\text{Var}(aX) = a^2 \text{Var}(X). Note the square on aa; adding a constant shifts the distribution but doesn't change spread, so Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X).
  • Additivity for independent variables: If XX and YY are independent, Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). Unlike linearity of expectation, this requires independence (or at least zero covariance).

Standard deviation

σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)}

Standard deviation has the same units as XX, making it more interpretable than variance. For a normal distribution with mean μ\mu and standard deviation σ\sigma, the 68-95-99.7 rule applies: approximately 68% of values fall within μ±σ\mu \pm \sigma, 95% within μ±2σ\mu \pm 2\sigma, and 99.7% within μ±3σ\mu \pm 3\sigma.

Coefficient of variation

CV=σXE[X]CV = \frac{\sigma_X}{\mathbb{E}[X]}

The CV is dimensionless, so it lets you compare variability across random variables with different scales.

Example: Stock A has mean return 10% and standard deviation 5% (CV=0.5CV = 0.5). Stock B has mean return 5% and standard deviation 5% (CV=1.0CV = 1.0). Even though both have the same absolute spread, Stock B is twice as variable relative to its mean.

Covariance and correlation

These measure the linear relationship between two random variables. They're central to understanding how random variables interact, which matters a great deal in stochastic processes.

Definition of covariance

Cov(X,Y)=E[(XE[X])(YE[Y])]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]

A computational shortcut analogous to the variance formula:

Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

  • Positive covariance: XX and YY tend to move in the same direction.
  • Negative covariance: they tend to move in opposite directions.
  • Zero covariance: no linear association (but they could still be dependent).

Properties of covariance

  • Symmetry: Cov(X,Y)=Cov(Y,X)\text{Cov}(X, Y) = \text{Cov}(Y, X)
  • Linearity in each argument: Cov(aX+b,Y)=aCov(X,Y)\text{Cov}(aX + b, Y) = a \cdot \text{Cov}(X, Y). Constants added to a variable don't affect covariance.
  • Self-covariance: Cov(X,X)=Var(X)\text{Cov}(X, X) = \text{Var}(X)
  • General variance of a sum: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y). This shows why independence (which implies zero covariance) simplifies things.
Discrete random variables, Probability mass function - Wikipedia

Correlation coefficient

ρX,Y=Cov(X,Y)σXσY\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

Correlation is a normalized version of covariance, always satisfying 1ρX,Y1-1 \leq \rho_{X,Y} \leq 1.

  • ρ=1\rho = 1: perfect positive linear relationship
  • ρ=1\rho = -1: perfect negative linear relationship
  • ρ=0\rho = 0: no linear relationship (uncorrelated)

Keep in mind that uncorrelated does not imply independent, except in special cases (e.g., jointly normal random variables).

Cauchy-Schwarz inequality

(E[XY])2E[X2]E[Y2](\mathbb{E}[XY])^2 \leq \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2]

This is the inequality that guarantees ρX,Y1|\rho_{X,Y}| \leq 1. Equality holds if and only if Y=aX+bY = aX + b for some constants aa and bb (i.e., XX and YY are linearly dependent). The Cauchy-Schwarz inequality appears frequently in proofs throughout probability and statistics.

Moment-generating functions

Moment-generating functions (MGFs) encode all the moments of a distribution into a single function. They're a key tool for identifying distributions and working with sums of independent random variables.

Definition and properties

The MGF of a random variable XX is:

MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]

provided this expectation exists in a neighborhood of t=0t = 0.

Key properties:

  • Uniqueness: If two random variables have the same MGF (in a neighborhood of 0), they have the same distribution.
  • Affine transformation: MaX+b(t)=ebtMX(at)M_{aX+b}(t) = e^{bt} M_X(at)
  • Independence and sums: If XX and YY are independent, MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t). This makes MGFs extremely useful for finding the distribution of sums.

Relationship to expectation and variance

Moments are extracted by differentiating the MGF and evaluating at t=0t = 0:

  1. E[X]=MX(0)\mathbb{E}[X] = M'_X(0)
  2. E[X2]=MX(0)\mathbb{E}[X^2] = M''_X(0)
  3. Var(X)=MX(0)(MX(0))2\text{Var}(X) = M''_X(0) - (M'_X(0))^2

More generally, the nn-th moment is E[Xn]=MX(n)(0)\mathbb{E}[X^n] = M_X^{(n)}(0).

Uniqueness and existence

Not every random variable has an MGF. The MGF exists when E[etX]<\mathbb{E}[e^{tX}] < \infty for all tt in some open interval around 0. Distributions with light tails (e.g., normal, Poisson, exponential) have MGFs. Heavy-tailed distributions like the Cauchy distribution do not.

When the MGF does exist, it uniquely determines the distribution. This is why "matching MGFs" is a valid proof technique for showing two random variables have the same distribution.

Applications in probability calculations

  • Sums of independent random variables: Multiply MGFs, then identify the resulting function. For example, the MGF of a standard normal is MX(t)=et2/2M_X(t) = e^{t^2/2}. The sum of nn independent standard normals has MGF ent2/2e^{nt^2/2}, which is the MGF of N(0,n)N(0, n).
  • Proving limit theorems: MGFs provide one route to proving the central limit theorem and the law of large numbers.
  • Identifying distributions: If you compute an MGF and recognize it as belonging to a known family, you've identified the distribution without inverting a transform.

Inequalities involving expectation and variance

These inequalities give you probability bounds using only moments, without needing the full distribution. They get progressively tighter as you use more information.

Markov's inequality

For a non-negative random variable XX and any a>0a > 0:

P(Xa)E[X]aP(X \geq a) \leq \frac{\mathbb{E}[X]}{a}

This is the weakest of the three inequalities here, but it only requires X0X \geq 0 and a finite mean. It's often used as a stepping stone to derive stronger bounds.

Example: If E[X]=10\mathbb{E}[X] = 10, then P(X50)10/50=0.2P(X \geq 50) \leq 10/50 = 0.2.

Chebyshev's inequality

For any random variable XX with finite mean μ\mu and variance σ2\sigma^2, and any k>0k > 0:

P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

Chebyshev's is stronger than Markov's because it uses both the mean and the variance. It applies to any distribution with finite variance.

Example: For any random variable with finite variance, at least 75% of the probability mass lies within 2 standard deviations of the mean (1/22=0.251/2^2 = 0.25, so at most 25% lies outside).

Chernoff bounds

Chernoff bounds provide exponentially decaying tail bounds for sums of independent random variables, making them much tighter than Markov or Chebyshev for large deviations.

The general technique:

  1. For any t>0t > 0, apply Markov's inequality to etXe^{tX}: P(Xa)=P(etXeta)E[etX]eta=MX(t)etaP(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\mathbb{E}[e^{tX}]}{e^{ta}} = \frac{M_X(t)}{e^{ta}}
  2. Optimize over tt to get the tightest bound.

A common special case (Hoeffding's inequality): for a sum Sn=i=1nXiS_n = \sum_{i=1}^n X_i of independent bounded random variables with Xi1|X_i| \leq 1, and any ε>0\varepsilon > 0:

P(SnE[Sn]ε)e2ε2/nP(S_n - \mathbb{E}[S_n] \geq \varepsilon) \leq e^{-2\varepsilon^2/n}

The exponential decay makes Chernoff bounds far more useful than Chebyshev's inequality when dealing with sums of many independent random variables, which is a common setting in stochastic processes.