Fiveable

🔀Stochastic Processes Unit 2 Review

QR code for Stochastic Processes practice questions

2.6 Limit theorems

2.6 Limit theorems

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🔀Stochastic Processes
Unit & Topic Study Guides

Limit theorems describe how random variables behave as sample sizes grow. They formalize different types of convergence and provide the theoretical backbone for statistical inference, connecting probability theory to practical data analysis.

This topic covers three modes of convergence (in probability, almost sure, and in distribution), the major theorems associated with each, rates of convergence, and applications to statistical practice.

Convergence in probability

Convergence in probability describes a sequence of random variables getting arbitrarily close to some target value as nn grows, in the sense that the probability of a large deviation shrinks to zero. It's weaker than almost sure convergence but still powerful enough to justify many asymptotic results.

Weak law of large numbers

The weak law of large numbers (WLLN) says that the sample mean converges in probability to the population mean. If X1,X2,X_1, X_2, \ldots are i.i.d. with finite mean μ\mu, then for any ε>0\varepsilon > 0:

limnP(Xˉnμ>ε)=0\lim_{n \to \infty} P(|\bar{X}_n - \mu| > \varepsilon) = 0

where Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i.

This is the formal justification for using sample averages to estimate population means. Roll a fair die many times: the average of your rolls will be close to 3.5 with high probability for large nn.

Convergence of random variables

More generally, a sequence {Xn}\{X_n\} converges in probability to a random variable XX if, for every ε>0\varepsilon > 0:

limnP(XnX>ε)=0\lim_{n \to \infty} P(|X_n - X| > \varepsilon) = 0

This is denoted XnpXX_n \xrightarrow{p} X. The key intuition: as nn increases, the chance that XnX_n deviates from XX by more than any fixed amount vanishes. For example, the proportion of heads in nn fair coin flips converges in probability to 0.5, and the sample variance converges in probability to the population variance.

Continuous mapping theorem

If XnpXX_n \xrightarrow{p} X and gg is a function that is continuous at XX, then:

g(Xn)pg(X)g(X_n) \xrightarrow{p} g(X)

This theorem lets you "push" convergence through continuous functions. A practical example: if the sample variance converges in probability to σ2\sigma^2, then the sample standard deviation (which is just g(s2)=s2g(s^2) = \sqrt{s^2}, a continuous function) converges in probability to σ\sigma.

Almost sure convergence

Almost sure (a.s.) convergence is a stronger requirement than convergence in probability. A sequence {Xn}\{X_n\} converges almost surely to XX if the set of outcomes where XnXX_n \to X has probability one. Almost sure convergence implies convergence in probability, but the reverse does not hold in general.

Strong law of large numbers

The strong law of large numbers (SLLN) upgrades the WLLN: the sample mean converges to the population mean almost surely, not just in probability. If X1,X2,X_1, X_2, \ldots are i.i.d. with finite mean μ\mu:

P ⁣(limnXˉn=μ)=1P\!\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1

The distinction matters. The WLLN says large deviations become unlikely; the SLLN says that on almost every realization of the sequence, the sample mean eventually stays close to μ\mu forever. Think of flipping a coin indefinitely: with probability 1, the running proportion of heads converges to 0.5.

Kolmogorov's three-series theorem

This theorem gives necessary and sufficient conditions for the almost sure convergence of a series n=1Xn\sum_{n=1}^\infty X_n of independent random variables. The series converges a.s. if and only if, for some c>0c > 0, all three of the following hold:

  1. n=1P(Xn>c)<\sum_{n=1}^\infty P(|X_n| > c) < \infty
  2. n=1E[Xn1{Xnc}]\sum_{n=1}^\infty E[X_n \mathbf{1}_{\{|X_n| \leq c\}}] converges
  3. n=1Var(Xn1{Xnc})<\sum_{n=1}^\infty \text{Var}(X_n \mathbf{1}_{\{|X_n| \leq c\}}) < \infty

The three conditions control, respectively, the probability of large jumps, the drift of the truncated terms, and the variability of the truncated terms. Applications include determining convergence of random harmonic series and random power series.

Borel-Cantelli lemmas

These two lemmas connect the summability of event probabilities to whether events occur infinitely often (abbreviated "i.o.").

  • First Borel-Cantelli lemma: If n=1P(An)<\sum_{n=1}^\infty P(A_n) < \infty, then P(lim supnAn)=0P(\limsup_{n \to \infty} A_n) = 0. The events AnA_n occur only finitely many times, a.s.
  • Second Borel-Cantelli lemma: If the events AnA_n are independent and n=1P(An)=\sum_{n=1}^\infty P(A_n) = \infty, then P(lim supnAn)=1P(\limsup_{n \to \infty} A_n) = 1. The events AnA_n occur infinitely often, a.s.

The independence condition in the second lemma is critical. Together, these lemmas are a workhorse for proving almost sure results, such as the almost sure recurrence of symmetric random walks on Z\mathbb{Z}.

Convergence in distribution

Convergence in distribution (also called weak convergence) is the weakest of the three modes. A sequence XnX_n converges in distribution to XX if the CDF of XnX_n converges to the CDF of XX at every continuity point of the latter. Convergence in probability implies convergence in distribution, but not vice versa.

Weak law of large numbers, Scientific Memo: Understanding the empirical law of large numbers and the gambler's fallacy

Central limit theorem

The CLT is arguably the most important result in probability. If X1,X2,X_1, X_2, \ldots are i.i.d. with mean μ\mu and variance σ2\sigma^2, then:

i=1nXinμσndN(0,1)as n\frac{\sum_{i=1}^n X_i - n\mu}{\sigma \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

No matter the shape of the underlying distribution, the standardized sample mean is approximately normal for large nn. This is why the normal distribution appears so often in practice: the sum of many small, independent effects tends toward a Gaussian. It underpins confidence intervals, hypothesis tests, and much of statistical inference.

Characteristic functions

The characteristic function of a random variable XX is:

φX(t)=E[eitX]\varphi_X(t) = E[e^{itX}]

where ii is the imaginary unit. Characteristic functions always exist (unlike moment generating functions) and uniquely determine the distribution.

Their main role in limit theory: pointwise convergence of characteristic functions implies convergence in distribution. That is, if φXn(t)φX(t)\varphi_{X_n}(t) \to \varphi_X(t) for all tRt \in \mathbb{R}, then XndXX_n \xrightarrow{d} X. This is often the cleanest way to prove CLT-type results. Classic applications include showing that a properly scaled Binomial converges to a Poisson, or proving the CLT itself.

Lindeberg-Feller theorem

This theorem generalizes the CLT to independent but not necessarily identically distributed random variables. Let X1,,XnX_1, \ldots, X_n be independent with means μi\mu_i and variances σi2\sigma_i^2, and let sn2=i=1nσi2s_n^2 = \sum_{i=1}^n \sigma_i^2.

The Lindeberg condition requires that for every ε>0\varepsilon > 0:

1sn2i=1nE ⁣[(Xiμi)21{Xiμi>εsn}]0\frac{1}{s_n^2} \sum_{i=1}^n E\!\left[(X_i - \mu_i)^2 \mathbf{1}_{\{|X_i - \mu_i| > \varepsilon s_n\}}\right] \to 0

Intuitively, this says no single summand dominates the total variance. If the Lindeberg condition holds, then:

i=1n(Xiμi)sndN(0,1)\frac{\sum_{i=1}^n (X_i - \mu_i)}{s_n} \xrightarrow{d} N(0, 1)

The Feller converse states that if the CLT conclusion holds and the individual variances are uniformly negligible (maxiσi2/sn20\max_i \sigma_i^2 / s_n^2 \to 0), then the Lindeberg condition must hold.

Delta method

The delta method translates the asymptotic normality of an estimator into the asymptotic normality of a smooth function of that estimator. If:

n(Xnθ)dN(0,σ2)\sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2)

and gg is differentiable with g(θ)0g'(\theta) \neq 0, then:

n(g(Xn)g(θ))dN(0,σ2[g(θ)]2)\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{d} N(0, \sigma^2 [g'(\theta)]^2)

This is extremely useful in practice. For instance, if you know the sample mean is asymptotically normal, the delta method immediately gives you the asymptotic distribution of log(Xˉn)\log(\bar{X}_n) or 1/Xˉn1/\bar{X}_n, which you need for constructing confidence intervals for transformed parameters.

Functional limit theorems

Functional limit theorems extend convergence in distribution from random variables to entire stochastic processes. Instead of a sequence of numbers converging to a number, you have a sequence of random functions converging to a random function, with convergence defined in an appropriate function space.

Donsker's theorem

Donsker's theorem is the functional analogue of the CLT. Let X1,X2,X_1, X_2, \ldots be i.i.d. with distribution function FF, and let FnF_n be the empirical distribution function. Then:

n(FnF)dBF\sqrt{n}(F_n - F) \xrightarrow{d} B \circ F

in the space D[0,1]D[0,1] of càdlàg functions, where BB is a standard Brownian bridge. The entire rescaled empirical process converges as a process, not just at a single point.

This result is the foundation for distribution-free goodness-of-fit tests and confidence bands for CDFs.

Empirical process theory

Empirical process theory studies the asymptotic behavior of processes built from random samples, such as the empirical distribution function, empirical characteristic function, and empirical moment functions. The goal is to derive limiting distributions for functionals of these processes (suprema, integrals, etc.).

Key applications include:

  • The Kolmogorov-Smirnov test, which uses supxFn(x)F0(x)\sup_x |F_n(x) - F_0(x)| and relies on the limiting distribution from Donsker's theorem
  • The Cramér-von Mises test, which uses an integrated squared difference (FnF0)2dF0\int (F_n - F_0)^2 \, dF_0

Brownian motion approximation

Many stochastic processes can be approximated by Brownian motion or related Gaussian processes in the large-sample limit. The key idea is the invariance principle (also called the functional CLT): under suitable conditions, the rescaled partial sum process of a sequence of random variables converges in distribution to Brownian motion.

This extends beyond i.i.d. settings. Versions exist for weakly dependent sequences, martingales (the functional CLT for martingales), and even queueing processes (approximated by reflected Brownian motion). These approximations are central to the study of random walks, diffusion limits, and stochastic simulation.

Weak law of large numbers, Distribution of Sample Means (3 of 4) | Concepts in Statistics

Rates of convergence

Knowing that a sequence converges is useful, but knowing how fast it converges is often more important in practice. Rates of convergence tell you how good your normal approximation actually is for a given nn, and they guide the construction of refined approximations.

Berry-Esseen theorem

The Berry-Esseen theorem quantifies the rate of convergence in the CLT. If X1,X2,X_1, X_2, \ldots are i.i.d. with mean μ\mu, variance σ2\sigma^2, and finite third absolute moment ρ=E[X1μ3]\rho = E[|X_1 - \mu|^3], then:

supxP ⁣(i=1nXinμσnx)Φ(x)Cρσ3n\sup_x \left| P\!\left(\frac{\sum_{i=1}^n X_i - n\mu}{\sigma \sqrt{n}} \leq x\right) - \Phi(x) \right| \leq \frac{C \rho}{\sigma^3 \sqrt{n}}

where Φ\Phi is the standard normal CDF and CC is a universal constant (the best known value is C<0.4748C < 0.4748).

The bound is O(1/n)O(1/\sqrt{n}), so the normal approximation improves at rate 1/n1/\sqrt{n}. For skewed distributions (large ρ/σ3\rho/\sigma^3), the approximation is worse at any given nn.

Edgeworth expansions

Edgeworth expansions refine the normal approximation by adding correction terms based on the cumulants of the distribution. The first-order expansion is:

P ⁣(i=1nXinμσnx)=Φ(x)κ36σ3n(1x2)ϕ(x)+o ⁣(1n)P\!\left(\frac{\sum_{i=1}^n X_i - n\mu}{\sigma \sqrt{n}} \leq x\right) = \Phi(x) - \frac{\kappa_3}{6\sigma^3\sqrt{n}} (1 - x^2) \phi(x) + o\!\left(\frac{1}{\sqrt{n}}\right)

where κ3\kappa_3 is the third cumulant (related to skewness) and ϕ\phi is the standard normal density.

The correction term captures the leading-order effect of skewness. Higher-order expansions incorporate kurtosis and beyond. These expansions are used to improve coverage probability of confidence intervals and to justify bootstrap refinements, which achieve O(1/n)O(1/n) accuracy instead of O(1/n)O(1/\sqrt{n}).

Sanov's theorem

Sanov's theorem belongs to large deviations theory and describes the exponential rate at which the empirical distribution of i.i.d. random variables deviates from the true distribution. If X1,,XnX_1, \ldots, X_n are i.i.d. with distribution PP, and P^n=1ni=1nδXi\hat{P}_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i} is the empirical measure, then for a set Γ\Gamma of probability measures:

P(P^nΓ)eninfQΓD(QP)P(\hat{P}_n \in \Gamma) \approx e^{-n \inf_{Q \in \Gamma} D(Q \| P)}

where D(QP)=logdQdPdQD(Q \| P) = \int \log\frac{dQ}{dP} \, dQ is the Kullback-Leibler divergence from PP to QQ.

The probability of the empirical distribution landing in a "wrong" region decays exponentially in nn, with the rate governed by the KL divergence to the nearest distribution in that region. This has direct applications to error exponents in hypothesis testing and to information-theoretic problems.

Applications of limit theorems

The theorems above are not just theoretical results. They provide the asymptotic machinery that makes most of classical statistics work.

Statistical inference

Limit theorems give the asymptotic distributions of estimators. Maximum likelihood estimators, method of moments estimators, and least squares estimators are all asymptotically normal under regularity conditions, a fact established through the CLT and its extensions. This asymptotic normality is what lets you compute standard errors and build confidence intervals without knowing the exact finite-sample distribution.

Hypothesis testing

Major test procedures rely on limit theorems for their justification:

  • The likelihood ratio test statistic is asymptotically χ2\chi^2 under the null (Wilks' theorem)
  • The Wald test uses the asymptotic normality of the MLE
  • The score test uses the asymptotic distribution of the score function

In each case, the CLT or delta method provides the asymptotic null distribution, which determines critical values and p-values.

Confidence intervals

Asymptotic confidence intervals follow directly from the CLT. If θ^n\hat{\theta}_n is asymptotically normal with known asymptotic variance vnv_n, then an approximate (1α)(1-\alpha) confidence interval is:

θ^n±zα/2vn\hat{\theta}_n \pm z_{\alpha/2} \sqrt{v_n}

The delta method extends this to functions of parameters. For example, if you have a confidence interval for a mean μ\mu, the delta method lets you construct one for eμe^\mu or μ2\mu^2.

Asymptotic normality

Many statistical methods are built on the assumption that estimators or test statistics are approximately normal for large nn. The CLT, Lindeberg-Feller theorem, and delta method are the primary tools for establishing this. Once asymptotic normality is proven, you can use standard normal quantiles for inference, which greatly simplifies both computation and interpretation. This applies to MLEs under regularity conditions, to likelihood ratio statistics, and to a wide range of M-estimators and U-statistics.