Random variables and distributions provide the mathematical language for modeling uncertainty in causal inference. They let you formally describe treatment assignments, potential outcomes, and observed data, which are the building blocks for estimating causal effects. This topic covers the types of random variables, their probability distributions and properties, common named distributions, and how to transform random variables.

Types of random variables

A random variable assigns a numerical value to each outcome of a random process. In causal inference, random variables represent things like whether a subject received treatment, what outcome they experienced, or what covariates they have. The type of random variable you're working with determines which mathematical tools you'll use.

Discrete vs continuous

Discrete random variables take on a countable set of values, each with a positive probability. Think of counts: the number of patients who recover, the number of defective items in a batch.

Continuous random variables take on uncountably many values within some interval of real numbers. Think of measurements: blood pressure, income, time until an event.

This distinction matters because discrete variables use sums (and probability mass functions), while continuous variables use integrals (and probability density functions). Choosing the wrong framework leads to incorrect calculations.

Bernoulli random variables

Bernoulli random variables are the simplest type: they model a single binary outcome. The variable equals 1 ("success") with probability $p$ and 0 ("failure") with probability $1-p$ .

In causal inference, Bernoulli variables show up constantly. Treatment assignment in a randomized experiment is often Bernoulli: each subject either gets the treatment (1) or doesn't (0).

Expected value: $E[X] = p$
Variance: $Var(X) = p(1-p)$

Notice the variance is maximized when $p = 0.5$ and shrinks toward zero as $p$ approaches 0 or 1.

Binomial random variables

If you repeat a Bernoulli trial $n$ independent times and count the total successes, you get a binomial random variable. For example, if 10 patients each independently have a 0.3 probability of recovery, the total number who recover follows a Binomial(10, 0.3) distribution.

PMF: $P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$
Expected value: $E[X] = np$
Variance: $Var(X) = np(1-p)$

The binomial is just the sum of $n$ independent Bernoulli variables, so its mean and variance follow directly from linearity of expectation and independence.

Poisson random variables

Poisson random variables model the count of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate. Examples: the number of hospital admissions per day, or the number of mutations in a stretch of DNA.

The single parameter $\lambda$ represents the average rate of events per interval.

PMF: $P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$
Expected value: $E[X] = \lambda$
Variance: $Var(X) = \lambda$

A useful fact: the mean equals the variance. If you see count data where the variance is much larger than the mean, a Poisson model may not be appropriate (this is called overdispersion).

Gaussian random variables

The Gaussian (normal) distribution is the most widely used continuous distribution. It's symmetric and bell-shaped, fully characterized by its mean $\mu$ and standard deviation $\sigma$ .

PDF: $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
Expected value: $E[X] = \mu$
Variance: $Var(X) = \sigma^2$

The central limit theorem explains why the Gaussian is so important: the sum (or average) of many independent random variables tends toward a Gaussian distribution regardless of the original distribution, as long as certain regularity conditions hold. This is why sample means are approximately normal in large samples, which underpins most of the hypothesis testing and confidence interval construction you'll encounter in causal inference.

Probability distributions

Probability distributions formalize how likely each possible value of a random variable is. In causal inference, they model uncertainty in treatment assignments, potential outcomes, and observed data. The three main ways to describe a distribution are through mass functions, density functions, and cumulative distribution functions.

Probability mass functions

A probability mass function (PMF) applies to discrete random variables. It gives the probability that the variable takes each specific value: $P(X = x)$ .

Two requirements for a valid PMF:

$P(X=x) \geq 0$ for all $x$
$\sum_x P(X=x) = 1$

You can read probabilities directly off a PMF. For instance, if $X$ is Binomial(3, 0.5), then $P(X=2) = \binom{3}{2}(0.5)^2(0.5)^1 = 0.375$ .

Probability density functions

A probability density function (PDF) applies to continuous random variables. Unlike a PMF, the PDF value at a single point is not a probability. Instead, you integrate the PDF over an interval to get the probability of falling in that interval:

$P(a \leq X \leq b) = \int_a^b f(x)\, dx$

Two requirements for a valid PDF:

$f(x) \geq 0$ for all $x$
$\int_{-\infty}^{\infty} f(x)\, dx = 1$

A common mistake is interpreting $f(x)$ as a probability. It's a density, so it can actually exceed 1 at some points, as long as the total area under the curve equals 1.

Cumulative distribution functions

The cumulative distribution function (CDF) works for both discrete and continuous variables. It gives the probability that the variable is less than or equal to a given value:

$F(x) = P(X \leq x)$

Key properties:

Non-decreasing: if $a < b$ , then $F(a) \leq F(b)$
$\lim_{x \to -\infty} F(x) = 0$ and $\lim_{x \to \infty} F(x) = 1$
For continuous variables, the CDF is a smooth curve; for discrete variables, it's a step function

The CDF is especially handy for computing interval probabilities: $P(a < X \leq b) = F(b) - F(a)$ .

Discrete vs continuous, Continuous Probability Distribution (2 of 2) | Concepts in Statistics

Joint distributions

Joint distributions describe the simultaneous behavior of two or more random variables. For discrete variables $X$ and $Y$ , the joint PMF is $P(X=x, Y=y)$ . For continuous variables, the joint PDF is $f(x, y)$ .

Joint distributions are essential in causal inference because you're almost always dealing with multiple variables at once (treatment, outcome, covariates). From a joint distribution, you can derive conditional and marginal distributions, which are the tools for reasoning about how variables relate to each other.

Conditional distributions

A conditional distribution describes the distribution of one variable given a known value of another. This is central to causal inference, where you often want to know the distribution of an outcome given a particular treatment.

Discrete case: $P(Y=y \mid X=x) = \frac{P(X=x, Y=y)}{P(X=x)}$
Continuous case: $f(y \mid x) = \frac{f(x, y)}{f(x)}$

The denominator must be nonzero (you can only condition on events that have positive probability or density). Conditional distributions are what connect observed associations to potential causal relationships, though moving from association to causation requires additional assumptions.

Marginal distributions

A marginal distribution gives the distribution of a single variable, ignoring (or "marginalizing out") the others.

Discrete case: $P(X=x) = \sum_y P(X=x, Y=y)$
Continuous case: $f(x) = \int_{-\infty}^{\infty} f(x, y)\, dy$

You recover the marginal by summing or integrating the joint distribution over all values of the other variable. In causal inference, comparing marginal distributions of outcomes across treatment groups is one way to assess average treatment effects.

Properties of distributions

These summary quantities let you describe and compare distributions without specifying the full PMF or PDF. In causal inference, they're used to quantify treatment effects, measure precision of estimates, and assess relationships between variables.

Expected value

The expected value (mean) measures the center of a distribution. It's the long-run average you'd observe if you could repeat the random process infinitely many times.

Discrete: $E[X] = \sum_x x\, P(X=x)$
Continuous: $E[X] = \int_{-\infty}^{\infty} x\, f(x)\, dx$

The most important property of expected value is linearity: for any random variables $X$ and $Y$ and constants $a$ and $b$ ,

$E[aX + bY] = aE[X] + bE[Y]$

This holds whether or not $X$ and $Y$ are independent. Linearity is used constantly in deriving estimators for causal effects.

Variance and standard deviation

Variance measures how spread out a distribution is around its mean:

$Var(X) = E[(X - E[X])^2]$

A useful computational shortcut: $Var(X) = E[X^2] - (E[X])^2$ .

The standard deviation is $\sqrt{Var(X)}$ , which has the same units as $X$ and is often easier to interpret.

Unlike expectation, variance is not linear. For a constant $a$ : $Var(aX) = a^2 Var(X)$ . For independent variables: $Var(X + Y) = Var(X) + Var(Y)$ . If they're not independent, you need the covariance term.

In causal inference, variance quantifies the precision of your treatment effect estimates and drives the width of confidence intervals.

Covariance and correlation

Covariance measures the linear association between two random variables:

$Cov(X, Y) = E[(X - E[X])(Y - E[Y])]$

Positive covariance means the variables tend to move together; negative means they tend to move in opposite directions; zero means no linear relationship (but there could still be a nonlinear one).

Correlation standardizes covariance to the range $[-1, 1]$ :

$Corr(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X)\,Var(Y)}}$

A correlation of $\pm 1$ means a perfect linear relationship. In causal inference, covariance and correlation help identify potential confounders and assess relationships between variables, though correlation alone never establishes causation.

Moment generating functions

A moment generating function (MGF) uniquely characterizes a distribution (when it exists). For a random variable $X$ :

$M_X(t) = E[e^{tX}]$

The name comes from the fact that you can extract moments by differentiation:

$E[X] = M_X'(0)$ (first derivative at $t=0$ )
$E[X^2] = M_X''(0)$ (second derivative at $t=0$ )

MGFs are particularly useful for proving that sums of independent random variables follow specific distributions, since the MGF of a sum equals the product of the individual MGFs.

Characteristic functions

A characteristic function (CF) serves a similar role to the MGF but always exists:

$\phi_X(t) = E[e^{itX}]$

where $i$ is the imaginary unit. The MGF may not exist for distributions with heavy tails (like the Cauchy distribution), but the CF always does. CFs are used in more advanced theoretical work, such as proving convergence results for sequences of random variables.

Discrete vs continuous, Introduction to Continuous Random Variables | Introduction to Statistics

Common distributions

These named distributions appear repeatedly across statistics and causal inference. Each one models a particular type of data-generating process.

Uniform distribution

The uniform distribution assigns equal probability to all values in an interval $[a, b]$ .

PDF: $f(x) = \frac{1}{b-a}$ for $a \leq x \leq b$ , and 0 otherwise
Expected value: $\frac{a+b}{2}$
Variance: $\frac{(b-a)^2}{12}$

In causal inference, uniform distributions naturally model completely random treatment assignment, where every unit has the same probability of being assigned to each group.

Exponential distribution

The exponential distribution models waiting times between events in a Poisson process.

PDF: $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$
Expected value: $\frac{1}{\lambda}$
Variance: $\frac{1}{\lambda^2}$

The exponential distribution has the memoryless property: $P(X > s + t \mid X > s) = P(X > t)$ . The probability of waiting another $t$ units doesn't depend on how long you've already waited. This is the only continuous distribution with this property.

Gamma distribution

The gamma distribution generalizes the exponential. While the exponential models the time until the first event, the gamma models the time until the $k$ -th event in a Poisson process.

PDF: $f(x) = \frac{\lambda^k x^{k-1} e^{-\lambda x}}{\Gamma(k)}$ for $x \geq 0$ , where $\Gamma(\cdot)$ is the gamma function
Expected value: $\frac{k}{\lambda}$
Variance: $\frac{k}{\lambda^2}$

When $k = 1$ , the gamma reduces to the exponential. The gamma is useful for modeling positive, right-skewed continuous variables.

Beta distribution

The beta distribution is defined on $[0, 1]$ , making it natural for modeling probabilities or proportions.

PDF: $f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}$ for $0 \leq x \leq 1$ , where $B(\cdot, \cdot)$ is the beta function
Expected value: $\frac{\alpha}{\alpha + \beta}$
Variance: $\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$

The beta is extremely flexible: depending on $\alpha$ and $\beta$ , it can be uniform ( $\alpha = \beta = 1$ ), U-shaped, skewed left, skewed right, or symmetric and peaked. It's also the conjugate prior for Bernoulli and binomial likelihoods, which makes it central to Bayesian approaches in causal inference.

Chi-squared distribution

The chi-squared distribution with $k$ degrees of freedom is the distribution of the sum of $k$ independent squared standard normal variables: if $Z_1, \ldots, Z_k \sim N(0,1)$ independently, then $\sum_{i=1}^k Z_i^2 \sim \chi^2_k$ .

Expected value: $k$
Variance: $2k$

Chi-squared distributions are used in hypothesis testing (goodness-of-fit tests, tests of independence) and in constructing confidence intervals for variances.

Student's t-distribution

The t-distribution with $\nu$ degrees of freedom arises when you estimate the mean of a normal population using the sample standard deviation instead of the true standard deviation.

It's symmetric and bell-shaped like the normal, but has heavier tails, meaning extreme values are more likely
As $\nu \to \infty$ , the t-distribution converges to the standard normal
At small sample sizes (say $\nu < 30$ ), the heavier tails matter and produce wider confidence intervals than you'd get using a normal distribution

The t-distribution is the workhorse for hypothesis tests and confidence intervals about means when the population variance is unknown, which is almost always the case in practice.

F-distribution

The F-distribution with $d_1$ and $d_2$ degrees of freedom is the distribution of the ratio of two independent chi-squared variables, each divided by their degrees of freedom.

It's defined only for positive values and is right-skewed
Used primarily in ANOVA (comparing means across multiple groups) and in F-tests for comparing variances

In causal inference, F-tests come up when testing whether treatment effects differ across multiple treatment arms simultaneously.

Transformations of random variables

Transformations create new random variables by applying functions to existing ones. In causal inference, you'll encounter transformations when computing propensity scores, inverse probability weights, log-transformed outcomes, and other derived quantities.

Linear transformations

A linear transformation takes the form $Y = aX + b$ , where $a$ and $b$ are constants.

The effect on the distribution's summary statistics is straightforward:

$E[Y] = aE[X] + b$
$Var(Y) = a^2 Var(X)$

Notice that adding a constant $b$ shifts the mean but doesn't affect the variance. Multiplying by $a$ scales both the mean and the standard deviation (by $|a|$ ), and scales the variance by $a^2$ .

A common example: standardizing a variable by computing $Z = \frac{X - \mu}{\sigma}$ . This is a linear transformation that produces a variable with mean 0 and variance 1.

Functions of random variables

For a general (possibly nonlinear) transformation $Y = g(X)$ , finding the distribution of $Y$ requires more work. Two standard approaches:

CDF method: Find $F_Y(y) = P(Y \leq y) = P(g(X) \leq y)$ , then differentiate to get the PDF.
Change of variables formula: If $g$ is monotonic and differentiable, the PDF of $Y$ is:

$f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|$

The expected value of $Y = g(X)$ can be computed without first finding the distribution of $Y$ , using the law of the unconscious statistician (LOTUS):

Discrete: $E[g(X)] = \sum_x g(x)\, P(X=x)$
Continuous: $E[g(X)] = \int_{-\infty}^{\infty} g(x)\, f(x)\, dx$

LOTUS is a practical shortcut you'll use frequently when computing expected values of transformed variables.

2,589 studying →