Limit theorems describe how random variables behave as sample sizes grow. They formalize different types of convergence and provide the theoretical backbone for statistical inference, connecting probability theory to practical data analysis.

This topic covers three modes of convergence (in probability, almost sure, and in distribution), the major theorems associated with each, rates of convergence, and applications to statistical practice.

Convergence in probability

Convergence in probability describes a sequence of random variables getting arbitrarily close to some target value as $n$ grows, in the sense that the probability of a large deviation shrinks to zero. It's weaker than almost sure convergence but still powerful enough to justify many asymptotic results.

Weak law of large numbers

The weak law of large numbers (WLLN) says that the sample mean converges in probability to the population mean. If $X_1, X_2, \ldots$ are i.i.d. with finite mean $\mu$ , then for any $\varepsilon > 0$ :

$\lim_{n \to \infty} P(|\bar{X}_n - \mu| > \varepsilon) = 0$

where $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ .

This is the formal justification for using sample averages to estimate population means. Roll a fair die many times: the average of your rolls will be close to 3.5 with high probability for large $n$ .

Convergence of random variables

More generally, a sequence $\{X_n\}$ converges in probability to a random variable $X$ if, for every $\varepsilon > 0$ :

$\lim_{n \to \infty} P(|X_n - X| > \varepsilon) = 0$

This is denoted $X_n \xrightarrow{p} X$ . The key intuition: as $n$ increases, the chance that $X_n$ deviates from $X$ by more than any fixed amount vanishes. For example, the proportion of heads in $n$ fair coin flips converges in probability to 0.5, and the sample variance converges in probability to the population variance.

Continuous mapping theorem

If $X_n \xrightarrow{p} X$ and $g$ is a function that is continuous at $X$ , then:

$g(X_n) \xrightarrow{p} g(X)$

This theorem lets you "push" convergence through continuous functions. A practical example: if the sample variance converges in probability to $\sigma^2$ , then the sample standard deviation (which is just $g(s^2) = \sqrt{s^2}$ , a continuous function) converges in probability to $\sigma$ .

Almost sure convergence

Almost sure (a.s.) convergence is a stronger requirement than convergence in probability. A sequence $\{X_n\}$ converges almost surely to $X$ if the set of outcomes where $X_n \to X$ has probability one. Almost sure convergence implies convergence in probability, but the reverse does not hold in general.

Strong law of large numbers

The strong law of large numbers (SLLN) upgrades the WLLN: the sample mean converges to the population mean almost surely, not just in probability. If $X_1, X_2, \ldots$ are i.i.d. with finite mean $\mu$ :

$P\!\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1$

The distinction matters. The WLLN says large deviations become unlikely; the SLLN says that on almost every realization of the sequence, the sample mean eventually stays close to $\mu$ forever. Think of flipping a coin indefinitely: with probability 1, the running proportion of heads converges to 0.5.

Kolmogorov's three-series theorem

This theorem gives necessary and sufficient conditions for the almost sure convergence of a series $\sum_{n=1}^\infty X_n$ of independent random variables. The series converges a.s. if and only if, for some $c > 0$ , all three of the following hold:

$\sum_{n=1}^\infty P(|X_n| > c) < \infty$
$\sum_{n=1}^\infty E[X_n \mathbf{1}_{\{|X_n| \leq c\}}]$ converges
$\sum_{n=1}^\infty \text{Var}(X_n \mathbf{1}_{\{|X_n| \leq c\}}) < \infty$

The three conditions control, respectively, the probability of large jumps, the drift of the truncated terms, and the variability of the truncated terms. Applications include determining convergence of random harmonic series and random power series.

Borel-Cantelli lemmas

These two lemmas connect the summability of event probabilities to whether events occur infinitely often (abbreviated "i.o.").

First Borel-Cantelli lemma: If $\sum_{n=1}^\infty P(A_n) < \infty$ , then $P(\limsup_{n \to \infty} A_n) = 0$ . The events $A_n$ occur only finitely many times, a.s.
Second Borel-Cantelli lemma: If the events $A_n$ are independent and $\sum_{n=1}^\infty P(A_n) = \infty$ , then $P(\limsup_{n \to \infty} A_n) = 1$ . The events $A_n$ occur infinitely often, a.s.

The independence condition in the second lemma is critical. Together, these lemmas are a workhorse for proving almost sure results, such as the almost sure recurrence of symmetric random walks on $\mathbb{Z}$ .

Convergence in distribution

Convergence in distribution (also called weak convergence) is the weakest of the three modes. A sequence $X_n$ converges in distribution to $X$ if the CDF of $X_n$ converges to the CDF of $X$ at every continuity point of the latter. Convergence in probability implies convergence in distribution, but not vice versa.

Weak law of large numbers, Scientific Memo: Understanding the empirical law of large numbers and the gambler's fallacy

Central limit theorem

The CLT is arguably the most important result in probability. If $X_1, X_2, \ldots$ are i.i.d. with mean $\mu$ and variance $\sigma^2$ , then:

$\frac{\sum_{i=1}^n X_i - n\mu}{\sigma \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty$

No matter the shape of the underlying distribution, the standardized sample mean is approximately normal for large $n$ . This is why the normal distribution appears so often in practice: the sum of many small, independent effects tends toward a Gaussian. It underpins confidence intervals, hypothesis tests, and much of statistical inference.

Characteristic functions

The characteristic function of a random variable $X$ is:

$\varphi_X(t) = E[e^{itX}]$

where $i$ is the imaginary unit. Characteristic functions always exist (unlike moment generating functions) and uniquely determine the distribution.

Their main role in limit theory: pointwise convergence of characteristic functions implies convergence in distribution. That is, if $\varphi_{X_n}(t) \to \varphi_X(t)$ for all $t \in \mathbb{R}$ , then $X_n \xrightarrow{d} X$ . This is often the cleanest way to prove CLT-type results. Classic applications include showing that a properly scaled Binomial converges to a Poisson, or proving the CLT itself.

Lindeberg-Feller theorem

This theorem generalizes the CLT to independent but not necessarily identically distributed random variables. Let $X_1, \ldots, X_n$ be independent with means $\mu_i$ and variances $\sigma_i^2$ , and let $s_n^2 = \sum_{i=1}^n \sigma_i^2$ .

The Lindeberg condition requires that for every $\varepsilon > 0$ :

$\frac{1}{s_n^2} \sum_{i=1}^n E\!\left[(X_i - \mu_i)^2 \mathbf{1}_{\{|X_i - \mu_i| > \varepsilon s_n\}}\right] \to 0$

Intuitively, this says no single summand dominates the total variance. If the Lindeberg condition holds, then:

$\frac{\sum_{i=1}^n (X_i - \mu_i)}{s_n} \xrightarrow{d} N(0, 1)$

The Feller converse states that if the CLT conclusion holds and the individual variances are uniformly negligible ( $\max_i \sigma_i^2 / s_n^2 \to 0$ ), then the Lindeberg condition must hold.

Delta method

The delta method translates the asymptotic normality of an estimator into the asymptotic normality of a smooth function of that estimator. If:

$\sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2)$

and $g$ is differentiable with $g'(\theta) \neq 0$ , then:

$\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{d} N(0, \sigma^2 [g'(\theta)]^2)$

This is extremely useful in practice. For instance, if you know the sample mean is asymptotically normal, the delta method immediately gives you the asymptotic distribution of $\log(\bar{X}_n)$ or $1/\bar{X}_n$ , which you need for constructing confidence intervals for transformed parameters.

Functional limit theorems

Functional limit theorems extend convergence in distribution from random variables to entire stochastic processes. Instead of a sequence of numbers converging to a number, you have a sequence of random functions converging to a random function, with convergence defined in an appropriate function space.

Donsker's theorem

Donsker's theorem is the functional analogue of the CLT. Let $X_1, X_2, \ldots$ be i.i.d. with distribution function $F$ , and let $F_n$ be the empirical distribution function. Then:

$\sqrt{n}(F_n - F) \xrightarrow{d} B \circ F$

in the space $D[0,1]$ of càdlàg functions, where $B$ is a standard Brownian bridge. The entire rescaled empirical process converges as a process, not just at a single point.

This result is the foundation for distribution-free goodness-of-fit tests and confidence bands for CDFs.

Empirical process theory

Empirical process theory studies the asymptotic behavior of processes built from random samples, such as the empirical distribution function, empirical characteristic function, and empirical moment functions. The goal is to derive limiting distributions for functionals of these processes (suprema, integrals, etc.).

Key applications include:

The Kolmogorov-Smirnov test, which uses $\sup_x |F_n(x) - F_0(x)|$ and relies on the limiting distribution from Donsker's theorem
The Cramér-von Mises test, which uses an integrated squared difference $\int (F_n - F_0)^2 \, dF_0$

Brownian motion approximation

Many stochastic processes can be approximated by Brownian motion or related Gaussian processes in the large-sample limit. The key idea is the invariance principle (also called the functional CLT): under suitable conditions, the rescaled partial sum process of a sequence of random variables converges in distribution to Brownian motion.

This extends beyond i.i.d. settings. Versions exist for weakly dependent sequences, martingales (the functional CLT for martingales), and even queueing processes (approximated by reflected Brownian motion). These approximations are central to the study of random walks, diffusion limits, and stochastic simulation.

Weak law of large numbers, Distribution of Sample Means (3 of 4) | Concepts in Statistics

Rates of convergence

Knowing that a sequence converges is useful, but knowing how fast it converges is often more important in practice. Rates of convergence tell you how good your normal approximation actually is for a given $n$ , and they guide the construction of refined approximations.

Berry-Esseen theorem

The Berry-Esseen theorem quantifies the rate of convergence in the CLT. If $X_1, X_2, \ldots$ are i.i.d. with mean $\mu$ , variance $\sigma^2$ , and finite third absolute moment $\rho = E[|X_1 - \mu|^3]$ , then:

$\sup_x \left| P\!\left(\frac{\sum_{i=1}^n X_i - n\mu}{\sigma \sqrt{n}} \leq x\right) - \Phi(x) \right| \leq \frac{C \rho}{\sigma^3 \sqrt{n}}$

where $\Phi$ is the standard normal CDF and $C$ is a universal constant (the best known value is $C < 0.4748$ ).

The bound is $O(1/\sqrt{n})$ , so the normal approximation improves at rate $1/\sqrt{n}$ . For skewed distributions (large $\rho/\sigma^3$ ), the approximation is worse at any given $n$ .

Edgeworth expansions

Edgeworth expansions refine the normal approximation by adding correction terms based on the cumulants of the distribution. The first-order expansion is:

$P\!\left(\frac{\sum_{i=1}^n X_i - n\mu}{\sigma \sqrt{n}} \leq x\right) = \Phi(x) - \frac{\kappa_3}{6\sigma^3\sqrt{n}} (1 - x^2) \phi(x) + o\!\left(\frac{1}{\sqrt{n}}\right)$

where $\kappa_3$ is the third cumulant (related to skewness) and $\phi$ is the standard normal density.

The correction term captures the leading-order effect of skewness. Higher-order expansions incorporate kurtosis and beyond. These expansions are used to improve coverage probability of confidence intervals and to justify bootstrap refinements, which achieve $O(1/n)$ accuracy instead of $O(1/\sqrt{n})$ .

Sanov's theorem

Sanov's theorem belongs to large deviations theory and describes the exponential rate at which the empirical distribution of i.i.d. random variables deviates from the true distribution. If $X_1, \ldots, X_n$ are i.i.d. with distribution $P$ , and $\hat{P}_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i}$ is the empirical measure, then for a set $\Gamma$ of probability measures:

$P(\hat{P}_n \in \Gamma) \approx e^{-n \inf_{Q \in \Gamma} D(Q \| P)}$

where $D(Q \| P) = \int \log\frac{dQ}{dP} \, dQ$ is the Kullback-Leibler divergence from $P$ to $Q$ .

The probability of the empirical distribution landing in a "wrong" region decays exponentially in $n$ , with the rate governed by the KL divergence to the nearest distribution in that region. This has direct applications to error exponents in hypothesis testing and to information-theoretic problems.

Applications of limit theorems

The theorems above are not just theoretical results. They provide the asymptotic machinery that makes most of classical statistics work.

Statistical inference

Limit theorems give the asymptotic distributions of estimators. Maximum likelihood estimators, method of moments estimators, and least squares estimators are all asymptotically normal under regularity conditions, a fact established through the CLT and its extensions. This asymptotic normality is what lets you compute standard errors and build confidence intervals without knowing the exact finite-sample distribution.

Hypothesis testing

Major test procedures rely on limit theorems for their justification:

The likelihood ratio test statistic is asymptotically $\chi^2$ under the null (Wilks' theorem)
The Wald test uses the asymptotic normality of the MLE
The score test uses the asymptotic distribution of the score function

In each case, the CLT or delta method provides the asymptotic null distribution, which determines critical values and p-values.

Confidence intervals

Asymptotic confidence intervals follow directly from the CLT. If $\hat{\theta}_n$ is asymptotically normal with known asymptotic variance $v_n$ , then an approximate $(1-\alpha)$ confidence interval is:

$\hat{\theta}_n \pm z_{\alpha/2} \sqrt{v_n}$

The delta method extends this to functions of parameters. For example, if you have a confidence interval for a mean $\mu$ , the delta method lets you construct one for $e^\mu$ or $\mu^2$ .

Asymptotic normality

Many statistical methods are built on the assumption that estimators or test statistics are approximately normal for large $n$ . The CLT, Lindeberg-Feller theorem, and delta method are the primary tools for establishing this. Once asymptotic normality is proven, you can use standard normal quantiles for inference, which greatly simplifies both computation and interpretation. This applies to MLEs under regularity conditions, to likelihood ratio statistics, and to a wide range of M-estimators and U-statistics.