Fiveable

📊Causal Inference Unit 1 Review

QR code for Causal Inference practice questions

1.2 Random variables and distributions

1.2 Random variables and distributions

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📊Causal Inference
Unit & Topic Study Guides

Random variables and distributions provide the mathematical language for modeling uncertainty in causal inference. They let you formally describe treatment assignments, potential outcomes, and observed data, which are the building blocks for estimating causal effects. This topic covers the types of random variables, their probability distributions and properties, common named distributions, and how to transform random variables.

Types of random variables

A random variable assigns a numerical value to each outcome of a random process. In causal inference, random variables represent things like whether a subject received treatment, what outcome they experienced, or what covariates they have. The type of random variable you're working with determines which mathematical tools you'll use.

Discrete vs continuous

Discrete random variables take on a countable set of values, each with a positive probability. Think of counts: the number of patients who recover, the number of defective items in a batch.

Continuous random variables take on uncountably many values within some interval of real numbers. Think of measurements: blood pressure, income, time until an event.

This distinction matters because discrete variables use sums (and probability mass functions), while continuous variables use integrals (and probability density functions). Choosing the wrong framework leads to incorrect calculations.

Bernoulli random variables

Bernoulli random variables are the simplest type: they model a single binary outcome. The variable equals 1 ("success") with probability pp and 0 ("failure") with probability 1p1-p.

In causal inference, Bernoulli variables show up constantly. Treatment assignment in a randomized experiment is often Bernoulli: each subject either gets the treatment (1) or doesn't (0).

  • Expected value: E[X]=pE[X] = p
  • Variance: Var(X)=p(1p)Var(X) = p(1-p)

Notice the variance is maximized when p=0.5p = 0.5 and shrinks toward zero as pp approaches 0 or 1.

Binomial random variables

If you repeat a Bernoulli trial nn independent times and count the total successes, you get a binomial random variable. For example, if 10 patients each independently have a 0.3 probability of recovery, the total number who recover follows a Binomial(10, 0.3) distribution.

  • PMF: P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k} p^k (1-p)^{n-k}
  • Expected value: E[X]=npE[X] = np
  • Variance: Var(X)=np(1p)Var(X) = np(1-p)

The binomial is just the sum of nn independent Bernoulli variables, so its mean and variance follow directly from linearity of expectation and independence.

Poisson random variables

Poisson random variables model the count of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate. Examples: the number of hospital admissions per day, or the number of mutations in a stretch of DNA.

The single parameter λ\lambda represents the average rate of events per interval.

  • PMF: P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}
  • Expected value: E[X]=λE[X] = \lambda
  • Variance: Var(X)=λVar(X) = \lambda

A useful fact: the mean equals the variance. If you see count data where the variance is much larger than the mean, a Poisson model may not be appropriate (this is called overdispersion).

Gaussian random variables

The Gaussian (normal) distribution is the most widely used continuous distribution. It's symmetric and bell-shaped, fully characterized by its mean μ\mu and standard deviation σ\sigma.

  • PDF: f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
  • Expected value: E[X]=μE[X] = \mu
  • Variance: Var(X)=σ2Var(X) = \sigma^2

The central limit theorem explains why the Gaussian is so important: the sum (or average) of many independent random variables tends toward a Gaussian distribution regardless of the original distribution, as long as certain regularity conditions hold. This is why sample means are approximately normal in large samples, which underpins most of the hypothesis testing and confidence interval construction you'll encounter in causal inference.

Probability distributions

Probability distributions formalize how likely each possible value of a random variable is. In causal inference, they model uncertainty in treatment assignments, potential outcomes, and observed data. The three main ways to describe a distribution are through mass functions, density functions, and cumulative distribution functions.

Probability mass functions

A probability mass function (PMF) applies to discrete random variables. It gives the probability that the variable takes each specific value: P(X=x)P(X = x).

Two requirements for a valid PMF:

  • P(X=x)0P(X=x) \geq 0 for all xx
  • xP(X=x)=1\sum_x P(X=x) = 1

You can read probabilities directly off a PMF. For instance, if XX is Binomial(3, 0.5), then P(X=2)=(32)(0.5)2(0.5)1=0.375P(X=2) = \binom{3}{2}(0.5)^2(0.5)^1 = 0.375.

Probability density functions

A probability density function (PDF) applies to continuous random variables. Unlike a PMF, the PDF value at a single point is not a probability. Instead, you integrate the PDF over an interval to get the probability of falling in that interval:

P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x)\, dx

Two requirements for a valid PDF:

  • f(x)0f(x) \geq 0 for all xx
  • f(x)dx=1\int_{-\infty}^{\infty} f(x)\, dx = 1

A common mistake is interpreting f(x)f(x) as a probability. It's a density, so it can actually exceed 1 at some points, as long as the total area under the curve equals 1.

Cumulative distribution functions

The cumulative distribution function (CDF) works for both discrete and continuous variables. It gives the probability that the variable is less than or equal to a given value:

F(x)=P(Xx)F(x) = P(X \leq x)

Key properties:

  • Non-decreasing: if a<ba < b, then F(a)F(b)F(a) \leq F(b)
  • limxF(x)=0\lim_{x \to -\infty} F(x) = 0 and limxF(x)=1\lim_{x \to \infty} F(x) = 1
  • For continuous variables, the CDF is a smooth curve; for discrete variables, it's a step function

The CDF is especially handy for computing interval probabilities: P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a).

Discrete vs continuous, Continuous Probability Distribution (2 of 2) | Concepts in Statistics

Joint distributions

Joint distributions describe the simultaneous behavior of two or more random variables. For discrete variables XX and YY, the joint PMF is P(X=x,Y=y)P(X=x, Y=y). For continuous variables, the joint PDF is f(x,y)f(x, y).

Joint distributions are essential in causal inference because you're almost always dealing with multiple variables at once (treatment, outcome, covariates). From a joint distribution, you can derive conditional and marginal distributions, which are the tools for reasoning about how variables relate to each other.

Conditional distributions

A conditional distribution describes the distribution of one variable given a known value of another. This is central to causal inference, where you often want to know the distribution of an outcome given a particular treatment.

  • Discrete case: P(Y=yX=x)=P(X=x,Y=y)P(X=x)P(Y=y \mid X=x) = \frac{P(X=x, Y=y)}{P(X=x)}
  • Continuous case: f(yx)=f(x,y)f(x)f(y \mid x) = \frac{f(x, y)}{f(x)}

The denominator must be nonzero (you can only condition on events that have positive probability or density). Conditional distributions are what connect observed associations to potential causal relationships, though moving from association to causation requires additional assumptions.

Marginal distributions

A marginal distribution gives the distribution of a single variable, ignoring (or "marginalizing out") the others.

  • Discrete case: P(X=x)=yP(X=x,Y=y)P(X=x) = \sum_y P(X=x, Y=y)
  • Continuous case: f(x)=f(x,y)dyf(x) = \int_{-\infty}^{\infty} f(x, y)\, dy

You recover the marginal by summing or integrating the joint distribution over all values of the other variable. In causal inference, comparing marginal distributions of outcomes across treatment groups is one way to assess average treatment effects.

Properties of distributions

These summary quantities let you describe and compare distributions without specifying the full PMF or PDF. In causal inference, they're used to quantify treatment effects, measure precision of estimates, and assess relationships between variables.

Expected value

The expected value (mean) measures the center of a distribution. It's the long-run average you'd observe if you could repeat the random process infinitely many times.

  • Discrete: E[X]=xxP(X=x)E[X] = \sum_x x\, P(X=x)
  • Continuous: E[X]=xf(x)dxE[X] = \int_{-\infty}^{\infty} x\, f(x)\, dx

The most important property of expected value is linearity: for any random variables XX and YY and constants aa and bb,

E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y]

This holds whether or not XX and YY are independent. Linearity is used constantly in deriving estimators for causal effects.

Variance and standard deviation

Variance measures how spread out a distribution is around its mean:

Var(X)=E[(XE[X])2]Var(X) = E[(X - E[X])^2]

A useful computational shortcut: Var(X)=E[X2](E[X])2Var(X) = E[X^2] - (E[X])^2.

The standard deviation is Var(X)\sqrt{Var(X)}, which has the same units as XX and is often easier to interpret.

Unlike expectation, variance is not linear. For a constant aa: Var(aX)=a2Var(X)Var(aX) = a^2 Var(X). For independent variables: Var(X+Y)=Var(X)+Var(Y)Var(X + Y) = Var(X) + Var(Y). If they're not independent, you need the covariance term.

In causal inference, variance quantifies the precision of your treatment effect estimates and drives the width of confidence intervals.

Covariance and correlation

Covariance measures the linear association between two random variables:

Cov(X,Y)=E[(XE[X])(YE[Y])]Cov(X, Y) = E[(X - E[X])(Y - E[Y])]

Positive covariance means the variables tend to move together; negative means they tend to move in opposite directions; zero means no linear relationship (but there could still be a nonlinear one).

Correlation standardizes covariance to the range [1,1][-1, 1]:

Corr(X,Y)=Cov(X,Y)Var(X)Var(Y)Corr(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X)\,Var(Y)}}

A correlation of ±1\pm 1 means a perfect linear relationship. In causal inference, covariance and correlation help identify potential confounders and assess relationships between variables, though correlation alone never establishes causation.

Moment generating functions

A moment generating function (MGF) uniquely characterizes a distribution (when it exists). For a random variable XX:

MX(t)=E[etX]M_X(t) = E[e^{tX}]

The name comes from the fact that you can extract moments by differentiation:

  • E[X]=MX(0)E[X] = M_X'(0) (first derivative at t=0t=0)
  • E[X2]=MX(0)E[X^2] = M_X''(0) (second derivative at t=0t=0)

MGFs are particularly useful for proving that sums of independent random variables follow specific distributions, since the MGF of a sum equals the product of the individual MGFs.

Characteristic functions

A characteristic function (CF) serves a similar role to the MGF but always exists:

ϕX(t)=E[eitX]\phi_X(t) = E[e^{itX}]

where ii is the imaginary unit. The MGF may not exist for distributions with heavy tails (like the Cauchy distribution), but the CF always does. CFs are used in more advanced theoretical work, such as proving convergence results for sequences of random variables.

Discrete vs continuous, Introduction to Continuous Random Variables | Introduction to Statistics

Common distributions

These named distributions appear repeatedly across statistics and causal inference. Each one models a particular type of data-generating process.

Uniform distribution

The uniform distribution assigns equal probability to all values in an interval [a,b][a, b].

  • PDF: f(x)=1baf(x) = \frac{1}{b-a} for axba \leq x \leq b, and 0 otherwise
  • Expected value: a+b2\frac{a+b}{2}
  • Variance: (ba)212\frac{(b-a)^2}{12}

In causal inference, uniform distributions naturally model completely random treatment assignment, where every unit has the same probability of being assigned to each group.

Exponential distribution

The exponential distribution models waiting times between events in a Poisson process.

  • PDF: f(x)=λeλxf(x) = \lambda e^{-\lambda x} for x0x \geq 0
  • Expected value: 1λ\frac{1}{\lambda}
  • Variance: 1λ2\frac{1}{\lambda^2}

The exponential distribution has the memoryless property: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t). The probability of waiting another tt units doesn't depend on how long you've already waited. This is the only continuous distribution with this property.

Gamma distribution

The gamma distribution generalizes the exponential. While the exponential models the time until the first event, the gamma models the time until the kk-th event in a Poisson process.

  • PDF: f(x)=λkxk1eλxΓ(k)f(x) = \frac{\lambda^k x^{k-1} e^{-\lambda x}}{\Gamma(k)} for x0x \geq 0, where Γ()\Gamma(\cdot) is the gamma function
  • Expected value: kλ\frac{k}{\lambda}
  • Variance: kλ2\frac{k}{\lambda^2}

When k=1k = 1, the gamma reduces to the exponential. The gamma is useful for modeling positive, right-skewed continuous variables.

Beta distribution

The beta distribution is defined on [0,1][0, 1], making it natural for modeling probabilities or proportions.

  • PDF: f(x)=xα1(1x)β1B(α,β)f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} for 0x10 \leq x \leq 1, where B(,)B(\cdot, \cdot) is the beta function
  • Expected value: αα+β\frac{\alpha}{\alpha + \beta}
  • Variance: αβ(α+β)2(α+β+1)\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}

The beta is extremely flexible: depending on α\alpha and β\beta, it can be uniform (α=β=1\alpha = \beta = 1), U-shaped, skewed left, skewed right, or symmetric and peaked. It's also the conjugate prior for Bernoulli and binomial likelihoods, which makes it central to Bayesian approaches in causal inference.

Chi-squared distribution

The chi-squared distribution with kk degrees of freedom is the distribution of the sum of kk independent squared standard normal variables: if Z1,,ZkN(0,1)Z_1, \ldots, Z_k \sim N(0,1) independently, then i=1kZi2χk2\sum_{i=1}^k Z_i^2 \sim \chi^2_k.

  • Expected value: kk
  • Variance: 2k2k

Chi-squared distributions are used in hypothesis testing (goodness-of-fit tests, tests of independence) and in constructing confidence intervals for variances.

Student's t-distribution

The t-distribution with ν\nu degrees of freedom arises when you estimate the mean of a normal population using the sample standard deviation instead of the true standard deviation.

  • It's symmetric and bell-shaped like the normal, but has heavier tails, meaning extreme values are more likely
  • As ν\nu \to \infty, the t-distribution converges to the standard normal
  • At small sample sizes (say ν<30\nu < 30), the heavier tails matter and produce wider confidence intervals than you'd get using a normal distribution

The t-distribution is the workhorse for hypothesis tests and confidence intervals about means when the population variance is unknown, which is almost always the case in practice.

F-distribution

The F-distribution with d1d_1 and d2d_2 degrees of freedom is the distribution of the ratio of two independent chi-squared variables, each divided by their degrees of freedom.

  • It's defined only for positive values and is right-skewed
  • Used primarily in ANOVA (comparing means across multiple groups) and in F-tests for comparing variances

In causal inference, F-tests come up when testing whether treatment effects differ across multiple treatment arms simultaneously.

Transformations of random variables

Transformations create new random variables by applying functions to existing ones. In causal inference, you'll encounter transformations when computing propensity scores, inverse probability weights, log-transformed outcomes, and other derived quantities.

Linear transformations

A linear transformation takes the form Y=aX+bY = aX + b, where aa and bb are constants.

The effect on the distribution's summary statistics is straightforward:

  • E[Y]=aE[X]+bE[Y] = aE[X] + b
  • Var(Y)=a2Var(X)Var(Y) = a^2 Var(X)

Notice that adding a constant bb shifts the mean but doesn't affect the variance. Multiplying by aa scales both the mean and the standard deviation (by a|a|), and scales the variance by a2a^2.

A common example: standardizing a variable by computing Z=XμσZ = \frac{X - \mu}{\sigma}. This is a linear transformation that produces a variable with mean 0 and variance 1.

Functions of random variables

For a general (possibly nonlinear) transformation Y=g(X)Y = g(X), finding the distribution of YY requires more work. Two standard approaches:

  1. CDF method: Find FY(y)=P(Yy)=P(g(X)y)F_Y(y) = P(Y \leq y) = P(g(X) \leq y), then differentiate to get the PDF.
  2. Change of variables formula: If gg is monotonic and differentiable, the PDF of YY is:

fY(y)=fX(g1(y))ddyg1(y)f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|

The expected value of Y=g(X)Y = g(X) can be computed without first finding the distribution of YY, using the law of the unconscious statistician (LOTUS):

  • Discrete: E[g(X)]=xg(x)P(X=x)E[g(X)] = \sum_x g(x)\, P(X=x)
  • Continuous: E[g(X)]=g(x)f(x)dxE[g(X)] = \int_{-\infty}^{\infty} g(x)\, f(x)\, dx

LOTUS is a practical shortcut you'll use frequently when computing expected values of transformed variables.