Continuous probability distributions model random variables that can take any value within a range, rather than just isolated points. They form the backbone of stochastic process modeling, where quantities like waiting times, signal noise, and particle positions vary continuously. This guide covers the key properties, common distributions, transformations, joint distributions, order statistics, and the limit theorems that tie everything together.

Properties of continuous distributions

A continuous random variable can take on uncountably many values across an interval (or the entire real line). Because of this, you can't assign nonzero probability to individual points. Instead, probabilities come from integrating over intervals, and the machinery for doing that involves PDFs, CDFs, and their associated summary statistics.

Probability density functions

A probability density function (PDF) $f_X(x)$ describes the relative likelihood of a continuous random variable near a particular value. Two properties must hold:

$f_X(x) \geq 0$ for all $x$
$\int_{-\infty}^{\infty} f_X(x)\,dx = 1$

The probability that $X$ falls in an interval $[a, b]$ is the area under the curve:

$P(a \leq X \leq b) = \int_a^b f_X(x)\,dx$

Note that $f_X(x)$ itself is not a probability and can exceed 1 at specific points. Only the integral over an interval gives a probability.

Cumulative distribution functions

The cumulative distribution function (CDF) gives the probability that $X$ is at most some value $x$ :

$F_X(x) = P(X \leq x) = \int_{-\infty}^{x} f_X(t)\,dt$

Key properties:

$F_X$ is non-decreasing, right-continuous, with $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$
The PDF is recovered by differentiation: $f_X(x) = \frac{d}{dx}F_X(x)$ wherever the derivative exists
Interval probabilities follow directly: $P(a < X \leq b) = F_X(b) - F_X(a)$

Expected value and variance

The expected value (mean) of a continuous random variable weights each value by its density:

$E[X] = \int_{-\infty}^{\infty} x\,f_X(x)\,dx$

The variance measures spread around the mean:

$\text{Var}(X) = E[(X - \mu)^2] = \int_{-\infty}^{\infty}(x - \mu)^2 f_X(x)\,dx$

A useful computational shortcut is the alternate form $\text{Var}(X) = E[X^2] - (E[X])^2$ , which often simplifies integration.

Moment generating functions

The moment generating function (MGF) of $X$ is defined as:

$M_X(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x)\,dx$

provided this expectation exists in a neighborhood of $t = 0$ . MGFs are powerful for two reasons:

Extracting moments: The $n$ th moment is $E[X^n] = M_X^{(n)}(0)$ , i.e., the $n$ th derivative evaluated at $t = 0$ . So $E[X] = M_X'(0)$ and $E[X^2] = M_X''(0)$ .
Sums of independent variables: If $X$ and $Y$ are independent, $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ . This makes MGFs a clean tool for finding the distribution of sums.

If two random variables share the same MGF (and it exists in a neighborhood of zero), they have the same distribution. This uniqueness property is what makes MGFs so useful for identification.

Common continuous distributions

Several distributions appear repeatedly in stochastic processes. Each one models a different type of random phenomenon, and knowing their parameters and properties is essential.

Uniform distribution

The uniform distribution on $[a, b]$ assigns equal density to every point in the interval:

$f_X(x) = \frac{1}{b - a}, \quad a \leq x \leq b$

Mean: $E[X] = \frac{a + b}{2}$
Variance: $\text{Var}(X) = \frac{(b-a)^2}{12}$

This distribution is the natural model when you have no reason to favor any value over another within a range. A classic example: if a bus arrives every 20 minutes and you show up at a random time, your waiting time is $\text{Uniform}(0, 20)$ .

Normal distribution

The normal (Gaussian) distribution with mean $\mu$ and variance $\sigma^2$ has PDF:

$f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$

The standard normal is the special case $\mu = 0$ , $\sigma = 1$ , often denoted $Z$ . Any normal variable can be standardized: $Z = \frac{X - \mu}{\sigma}$ .

The normal distribution dominates applied probability for a deep reason: the central limit theorem guarantees that sums of many independent random variables converge to it, regardless of the original distribution. This is why measurement errors, aggregate biological traits, and financial log-returns are often modeled as normal.

Exponential distribution

The exponential distribution with rate $\lambda > 0$ has PDF:

$f_X(x) = \lambda e^{-\lambda x}, \quad x \geq 0$

Mean: $E[X] = \frac{1}{\lambda}$
Variance: $\text{Var}(X) = \frac{1}{\lambda^2}$

It models the waiting time between events in a Poisson process. Its defining property is memorylessness: $P(X > s + t \mid X > s) = P(X > t)$ . The exponential distribution is the only continuous distribution with this property. This makes it the natural model for lifetimes of components that don't age or wear out.

Probability density functions, Introduction to Normal Random Variables | Concepts in Statistics

Gamma distribution

The gamma distribution $\text{Gamma}(\alpha, \beta)$ generalizes the exponential by adding a shape parameter $\alpha > 0$ alongside the rate parameter $\beta > 0$ :

$f_X(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, \quad x > 0$

Mean: $\frac{\alpha}{\beta}$
Variance: $\frac{\alpha}{\beta^2}$

When $\alpha$ is a positive integer $n$ , the $\text{Gamma}(n, \beta)$ distribution is exactly the distribution of the sum of $n$ independent $\text{Exp}(\beta)$ random variables. This makes it a natural model for the total waiting time until the $n$ th event in a Poisson process. The special case $\alpha = 1$ recovers the exponential distribution.

Beta distribution

The beta distribution $\text{Beta}(\alpha, \beta)$ is defined on $[0, 1]$ with PDF:

$f_X(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha - 1}(1 - x)^{\beta - 1}, \quad 0 < x < 1$

Mean: $\frac{\alpha}{\alpha + \beta}$
Variance: $\frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}$

The beta distribution is extremely flexible. Depending on $\alpha$ and $\beta$ , it can be uniform ( $\alpha = \beta = 1$ ), U-shaped, J-shaped, or bell-shaped. It's the standard choice for modeling random proportions or probabilities, and it serves as the conjugate prior for the binomial likelihood in Bayesian inference.

Transformations of random variables

Transformations let you derive the distribution of a new random variable defined as a function of an existing one. This is a core technique you'll use constantly in stochastic processes.

Distribution of functions of random variables

If $Y = g(X)$ where $g$ is a monotone, differentiable function with inverse $g^{-1}$ , the PDF of $Y$ follows from the change-of-variables formula:

$f_Y(y) = f_X(g^{-1}(y)) \left|\frac{d}{dy}g^{-1}(y)\right|$

The absolute value of the derivative of the inverse function acts as a "Jacobian" that accounts for how $g$ stretches or compresses the probability density.

Steps for applying the formula:

Write $Y = g(X)$ and solve for $X = g^{-1}(Y)$
Compute $\frac{d}{dy}g^{-1}(y)$
Take the absolute value
Substitute into the formula, and determine the new support (range of valid $y$ values)

If $g$ is not monotone, you need to split the domain into regions where it is monotone and sum the contributions from each branch.

Convolutions and sums of random variables

When $X$ and $Y$ are independent continuous random variables, the PDF of $Z = X + Y$ is given by the convolution integral:

$f_Z(z) = \int_{-\infty}^{\infty} f_X(x)\,f_Y(z - x)\,dx$

This integral "slides" one density across the other and accumulates the overlap. A few important results:

The sum of two independent normals $N(\mu_1, \sigma_1^2)$ and $N(\mu_2, \sigma_2^2)$ is $N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
The sum of independent $\text{Gamma}(\alpha_1, \beta)$ and $\text{Gamma}(\alpha_2, \beta)$ (same rate) is $\text{Gamma}(\alpha_1 + \alpha_2, \beta)$

In practice, MGFs often provide a faster route than direct convolution: multiply the MGFs, then identify the resulting distribution.

Product distribution

For the product $Z = XY$ of two independent continuous random variables, the PDF is:

$f_Z(z) = \int_{-\infty}^{\infty} \frac{1}{|x|}\,f_X(x)\,f_Y\!\left(\frac{z}{x}\right)dx$

This formula comes from the same change-of-variables logic, with the $\frac{1}{|x|}$ factor acting as the Jacobian. Product distributions arise in contexts like modeling the area of a rectangle with random dimensions, or in signal processing where a signal is multiplied by a random gain.

Joint continuous distributions

Joint distributions describe the simultaneous behavior of two or more continuous random variables. They capture not just individual behavior but also the dependence structure between variables.

Joint probability density functions

The joint PDF $f_{X,Y}(x,y)$ of two continuous random variables must satisfy:

$f_{X,Y}(x,y) \geq 0$ for all $(x,y)$
$\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dx\,dy = 1$

The probability that $(X,Y)$ falls in a region $A$ is:

$P((X,Y) \in A) = \iint_A f_{X,Y}(x,y)\,dx\,dy$

Two random variables are independent if and only if their joint PDF factors: $f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y)$ for all $(x,y)$ .

Probability density functions, The Exponential Distribution | Introduction to Statistics

Marginal and conditional distributions

Marginal distributions recover the distribution of a single variable by integrating out the other:

$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy \qquad f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dx$

Conditional distributions describe one variable given a fixed value of the other:

$f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}, \quad \text{provided } f_X(x) > 0$

This is the continuous analog of conditional probability. The conditional expectation $E[Y \mid X = x] = \int_{-\infty}^{\infty} y\,f_{Y|X}(y|x)\,dy$ is particularly important in stochastic processes, where it forms the basis of filtering and prediction.

Covariance and correlation

Covariance measures the linear co-movement of two random variables:

$\text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$

The computational form $E[XY] - E[X]E[Y]$ is usually easier to evaluate than the definition.

The correlation coefficient normalizes covariance to the range $[-1, 1]$ :

$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$

$\rho = 1$ or $\rho = -1$ : perfect linear relationship
$\rho = 0$ : no linear relationship (but the variables may still be dependent in a nonlinear way)
If $X$ and $Y$ are independent, then $\text{Cov}(X,Y) = 0$ . The converse is not true in general.

Order statistics

Order statistics deal with the sorted values of a random sample. If you draw $n$ independent observations from the same continuous distribution and sort them, the $k$ th smallest value $X_{(k)}$ is the $k$ th order statistic.

Distribution of the kth order statistic

Given $n$ i.i.d. continuous random variables with PDF $f_X(x)$ and CDF $F_X(x)$ , the PDF of the $k$ th order statistic $X_{(k)}$ is:

$f_{X_{(k)}}(x) = \frac{n!}{(k-1)!(n-k)!}\,[F_X(x)]^{k-1}[1 - F_X(x)]^{n-k}f_X(x)$

The intuition: for $X_{(k)}$ to have a density at $x$ , exactly $k-1$ observations must fall below $x$ , one observation must be at $x$ , and $n-k$ must fall above. The combinatorial prefactor counts the ways to assign observations to these three groups.

Two special cases come up constantly:

Minimum ( $k = 1$ ): $f_{X_{(1)}}(x) = n[1 - F_X(x)]^{n-1}f_X(x)$
Maximum ( $k = n$ ): $f_{X_{(n)}}(x) = n[F_X(x)]^{n-1}f_X(x)$

The CDF of the $k$ th order statistic is:

$F_{X_{(k)}}(x) = \sum_{i=k}^{n} \binom{n}{i}[F_X(x)]^i[1 - F_X(x)]^{n-i}$

Extreme value distributions

When $n$ grows large, the distribution of the sample maximum (after appropriate centering and scaling) converges to one of three extreme value distributions, classified by the tail behavior of the parent distribution:

Gumbel (Type I): For distributions with exponentially decaying tails (e.g., normal, exponential). The CDF is $\exp(-e^{-(x-\mu)/\beta})$ .
Fréchet (Type II): For distributions with heavy (polynomial) tails (e.g., Pareto, Cauchy).
Weibull (Type III): For distributions with a finite upper endpoint (e.g., uniform, beta).

These three families are unified by the Generalized Extreme Value (GEV) distribution, parameterized by a shape parameter $\xi$ that determines which type applies. Extreme value theory is central to risk modeling in finance, hydrology, and engineering, where you need to estimate the probability of rare, large events.

Limit theorems

Limit theorems describe what happens to sums and averages of random variables as the sample size grows. They provide the theoretical justification for much of statistical inference.

Law of large numbers for continuous variables

The law of large numbers (LLN) says that the sample mean converges to the population mean as $n$ grows. Formally, for i.i.d. random variables $X_1, X_2, \ldots$ with mean $\mu$ :

$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu \quad \text{as } n \to \infty$

The weak LLN gives convergence in probability (shown above). The strong LLN gives almost sure convergence, meaning $P(\lim_{n\to\infty} \bar{X}_n = \mu) = 1$ . The strong version requires $E[|X|] < \infty$ ; the weak version can hold under slightly weaker conditions.

The LLN justifies using sample averages as estimators and underpins simulation methods like Monte Carlo estimation.

Central limit theorem for continuous variables

The central limit theorem (CLT) is arguably the most important result in probability. For i.i.d. random variables with mean $\mu$ and finite variance $\sigma^2$ :

$Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty$

Equivalently, for large $n$ , the sum $S_n = \sum_{i=1}^n X_i$ is approximately $N(n\mu, n\sigma^2)$ .

The CLT holds regardless of the shape of the original distribution, as long as the variance is finite. This is why normal-based confidence intervals and hypothesis tests work even when the underlying data aren't normal, provided the sample size is large enough. As a rough guideline, $n \geq 30$ is often sufficient for moderately skewed distributions, but highly skewed or heavy-tailed distributions may require larger samples.