Probability theory provides the mathematical foundation for causal inference. It gives you the tools to quantify uncertainty, model relationships between variables, and reason about cause and effect. This guide covers the core probability concepts you'll need: axioms, distributions, independence, Bayes' theorem, expectation and variance, common distributions, limit theorems, and how probability connects to causal reasoning.

Basics of probability

Probability quantifies how likely an event is to occur, on a scale from 0 (impossible) to 1 (certain). In causal inference, you'll use probability constantly to express uncertainty about whether one thing actually caused another, and to assess how strong your evidence is for a causal claim.

Probability axioms

These are the foundational rules that all probabilities must follow:

Non-negativity: The probability of any event is at least zero: $P(A) \geq 0$
Normalization: The probabilities across the entire sample space sum to 1: $P(S) = 1$
Additivity: For mutually exclusive events A and B (they can't both happen), $P(A \cup B) = P(A) + P(B)$
Complement rule: An event and its complement (everything that isn't that event) always sum to 1: $P(A) + P(A') = 1$

The complement rule is derived from the first three axioms, but it's used so frequently that it's worth memorizing on its own.

Sample spaces and events

The sample space (S) is the set of all possible outcomes of a random experiment. For a coin toss, $S = \{H, T\}$ . For rolling a die, $S = \{1, 2, 3, 4, 5, 6\}$ .

An event is any subset of the sample space. Events can be:

Simple: a single outcome (rolling a 3)
Compound: a combination of outcomes (rolling an even number, which is $\{2, 4, 6\}$ )
Mutually exclusive: two events that cannot occur at the same time (rolling a 1 and rolling a 6 on a single roll)

Conditional probability

Conditional probability $P(A|B)$ is the probability of event A occurring given that event B has already occurred. It's calculated as:

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

where $P(A \cap B)$ is the probability that both A and B occur, and $P(B) > 0$ .

This formula captures a simple idea: once you know B happened, you restrict your attention to only those outcomes where B is true, then ask how often A also occurs within that restricted set. Conditional probability is central to causal inference because it lets you update your beliefs as new information arrives and reason about how variables depend on each other.

Probability distributions

A probability distribution is a function that assigns probabilities to the possible values of a random variable. Distributions are how you formally model uncertainty and variability in data.

Discrete probability distributions

Discrete random variables take on a countable number of values (e.g., the number of defective items in a batch, or the number of heads in 10 coin flips).

A probability mass function (PMF) assigns a probability to each possible value. For any valid PMF, all probabilities are non-negative and they sum to 1.

Common discrete distributions include the Bernoulli, binomial, and Poisson (covered in detail below).

Continuous probability distributions

Continuous random variables can take any value within a range (e.g., height, weight, temperature).

A probability density function (PDF) describes the relative likelihood of different values. Unlike a PMF, the PDF value at a single point is not a probability. Instead, you calculate probabilities by integrating the PDF over an interval:

$P(a \leq X \leq b) = \int_a^b f(x) \, dx$

Common continuous distributions include the normal, exponential, and uniform.

Joint probability distributions

A joint probability distribution describes the probabilities of two or more random variables occurring together, denoted $P(X, Y)$ for random variables X and Y.

Joint distributions let you model how variables relate to each other. From a joint distribution, you can derive both marginal and conditional distributions.

Marginal probability distributions

A marginal distribution gives the probability distribution of a single variable, ignoring the others. You obtain it by summing out (discrete case) or integrating out (continuous case) the other variables from the joint distribution.

For example, if you have $P(X, Y)$ , the marginal distribution of X is:

$P(X = x) = \sum_y P(X = x, Y = y)$

This is useful when you want to focus on one variable's behavior without worrying about the others.

Independence and dependence

These concepts describe whether knowing about one event tells you anything about another. Getting this right is critical for causal inference, because confusing independence with dependence (or vice versa) leads to incorrect causal conclusions.

Independent events

Events A and B are independent if the occurrence of one doesn't change the probability of the other. Formally:

$P(A|B) = P(A) \quad \text{and} \quad P(B|A) = P(B)$

An equivalent condition: $P(A \cap B) = P(A) \times P(B)$

Example: Flipping a fair coin twice. The outcome of the second flip is independent of the first. Knowing you got heads on flip 1 tells you nothing about flip 2.

Dependent events

Events A and B are dependent if the occurrence of one does change the probability of the other:

$P(A|B) \neq P(A)$

Example: Drawing cards from a deck without replacement. If you draw an ace first, the probability of drawing another ace on the second draw drops from $\frac{4}{52}$ to $\frac{3}{51}$ .

Probability axioms, Probability axioms - Wikipedia

Conditional independence

Events A and B are conditionally independent given event C if:

$P(A|B, C) = P(A|C) \quad \text{and} \quad P(B|A, C) = P(B|C)$

This means that once you know C, learning about A gives you no additional information about B (and vice versa). Two variables can be dependent overall but become independent once you condition on a third variable.

Conditional independence is especially important in causal inference. It's the key concept behind identifying confounders: if you condition on the right set of variables, the treatment and outcome become independent of unmeasured confounding.

Bayes' theorem

Bayes' theorem tells you how to update your beliefs about an event after observing new evidence. It's one of the most important results in probability for causal inference.

Bayes' rule

$P(A|B) = \frac{P(B|A) \, P(A)}{P(B)}$

Each piece has a name:

$P(A)$ : the prior (your belief about A before seeing evidence B)
$P(B|A)$ : the likelihood (how probable the evidence is if A is true)
$P(B)$ : the marginal likelihood or evidence (total probability of observing B)
$P(A|B)$ : the posterior (your updated belief about A after seeing B)

Example: Suppose a disease affects 1% of the population ( $P(D) = 0.01$ ). A test has a 95% true positive rate ( $P(+|D) = 0.95$ ) and a 5% false positive rate ( $P(+|\neg D) = 0.05$ ). If someone tests positive, the probability they actually have the disease is:

$P(D|+) = \frac{0.95 \times 0.01}{(0.95 \times 0.01) + (0.05 \times 0.99)} = \frac{0.0095}{0.059} \approx 0.161$

Only about 16%. This result surprises many people, but it makes sense: when the disease is rare, most positive tests come from the large pool of healthy people.

Prior vs posterior probabilities

The prior probability $P(A)$ reflects what you know before observing data. The posterior probability $P(A|B)$ reflects what you know after. Bayes' rule is the bridge between them.

The prior can come from previous studies, domain knowledge, or population-level data. As you observe more evidence, the posterior gets increasingly driven by the data and less by the prior.

Bayesian inference

Bayesian inference applies Bayes' theorem systematically: you specify a prior distribution over parameters of interest, observe data, and compute a posterior distribution.

This approach naturally incorporates prior knowledge into your analysis and provides a full distribution over parameter values (not just a point estimate). In causal inference, Bayesian methods are used for estimating causal effects, handling missing data, and conducting sensitivity analyses to see how results change under different assumptions.

Expectation and variance

These two quantities summarize the center and spread of a probability distribution. You'll use them constantly when defining and estimating causal effects.

Expected value

The expected value (mean) of a random variable X, written $E(X)$ , is its long-run average if you could repeat the experiment infinitely many times.

Discrete: $E(X) = \sum_{x} x \cdot P(X = x)$
Continuous: $E(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$

For example, the expected value of a fair six-sided die is $E(X) = \frac{1+2+3+4+5+6}{6} = 3.5$ . You'll never actually roll a 3.5, but it's the average over many rolls.

Variance and standard deviation

Variance measures how spread out a distribution is around its mean:

$Var(X) = E[(X - E(X))^2]$

A useful computational shortcut: $Var(X) = E[X^2] - (E[X])^2$

Standard deviation $\sigma = \sqrt{Var(X)}$ is in the same units as X, which makes it easier to interpret than variance. A small standard deviation means values cluster tightly around the mean; a large one means they're spread out.

Covariance and correlation

Covariance measures how two random variables move together:

$Cov(X, Y) = E[(X - E(X))(Y - E(Y))]$

Positive covariance: X and Y tend to increase together
Negative covariance: when X increases, Y tends to decrease
Zero covariance: no linear relationship (but there could still be a nonlinear one)

Correlation standardizes covariance to a $[-1, 1]$ scale:

$\rho(X, Y) = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$

A correlation of $+1$ or $-1$ means a perfect linear relationship. A correlation of 0 means no linear relationship. A critical point for causal inference: correlation does not imply causation. Two variables can be highly correlated because of a shared confounder, not because one causes the other.

Common probability distributions

Knowing these distributions lets you choose the right model for your data. Each one captures a different type of random process.

Bernoulli and binomial distributions

The Bernoulli distribution models a single trial with two outcomes: success (probability $p$ ) or failure (probability $1-p$ ).

$P(X = 1) = p, \quad P(X = 0) = 1 - p$

The binomial distribution extends this to $n$ independent trials, counting the total number of successes:

$P(X = k) = \binom{n}{k} p^k (1 - p)^{n-k}$

For example, if a drug has a 70% success rate ( $p = 0.7$ ) and you treat 10 patients ( $n = 10$ ), the binomial distribution tells you the probability of exactly $k$ patients improving.

Probability axioms, Why It Matters: Linking Probability to Statistical Inference | Concepts in Statistics

Poisson distribution

The Poisson distribution models the count of events occurring in a fixed interval of time or space, given a constant average rate $\lambda$ :

$P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}$

It's commonly used for rare events: the number of hospital admissions per hour, the number of typos per page, or the number of mutations per gene. Its mean and variance are both equal to $\lambda$ .

Normal distribution

The normal (Gaussian) distribution is the symmetric, bell-shaped distribution defined by its mean $\mu$ and standard deviation $\sigma$ :

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$

About 68% of values fall within $\pm 1\sigma$ of the mean, 95% within $\pm 2\sigma$ , and 99.7% within $\pm 3\sigma$ . The normal distribution shows up everywhere, partly because of the Central Limit Theorem (see below): averages of many independent random variables tend toward a normal distribution regardless of the original distribution.

Exponential distribution

The exponential distribution models the time between events in a Poisson process:

$f(x) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0$

Its mean is $\frac{1}{\lambda}$ . A distinctive property is memorylessness: the probability of waiting another $t$ minutes is the same regardless of how long you've already waited. This makes it useful for modeling waiting times (time between customer arrivals, time until equipment failure) when there's no "aging" effect.

Limit theorems

Limit theorems describe what happens to sample statistics as you collect more and more data. They justify many of the statistical methods used in causal inference.

Law of large numbers

The law of large numbers (LLN) says that the sample average of i.i.d. (independent and identically distributed) random variables converges to the true expected value as the sample size grows.

Weak LLN: The sample mean converges in probability to $E(X)$ . Roughly, the probability that the sample mean is far from $E(X)$ shrinks toward zero.
Strong LLN: The sample mean converges almost surely to $E(X)$ . This is a stronger guarantee that says the sample mean will eventually stay close to $E(X)$ with probability 1.

The practical takeaway: larger samples give you more reliable estimates of population quantities.

Central limit theorem

The Central Limit Theorem (CLT) states that the standardized sum (or average) of a large number of i.i.d. random variables approaches a standard normal distribution, regardless of the original distribution of those variables.

Formally, if $X_1, X_2, \ldots, X_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$ , then:

$\frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty$

This is why normal-based methods (confidence intervals, z-tests, t-tests) work even when your data isn't normally distributed, as long as your sample size is large enough. In practice, $n \geq 30$ is a common rule of thumb, though the required size depends on how skewed the underlying distribution is.

Convergence in probability vs distribution

These are two different ways a sequence of random variables can "settle down" as $n$ grows:

Convergence in probability: $X_n \xrightarrow{P} X$ means that for any $\epsilon > 0$ , $P(|X_n - X| > \epsilon) \to 0$ as $n \to \infty$ . The random variables get arbitrarily close to $X$ with high probability.
Convergence in distribution: $X_n \xrightarrow{d} X$ means the CDF of $X_n$ approaches the CDF of $X$ at all continuity points. The shape of the distribution stabilizes, but individual realizations might not get close to $X$ .

Convergence in probability is the stronger condition. If $X_n$ converges in probability to $X$ , it also converges in distribution, but not vice versa. The LLN involves convergence in probability; the CLT involves convergence in distribution.

Probability in causal inference

This is where probability theory connects directly to the causal questions you'll spend the rest of the course on. The key idea: causal effects are defined using potential outcomes (also called counterfactuals), and probability is the language for expressing uncertainty about those outcomes.

Probability of causation

The probability of causation (PC) asks: given that the cause was present and the effect occurred, what's the probability the effect wouldn't have occurred without the cause?

$PC = P(Y_0 = 0 \mid Y_1 = 1)$

Here $Y_1$ is the outcome when the cause is present, and $Y_0$ is the counterfactual outcome when the cause is absent. PC quantifies how responsible a specific cause is for an observed effect. This concept appears in legal and medical contexts (e.g., "Did this exposure cause the plaintiff's illness?").

Probability of necessity and sufficiency

These refine the idea of causation into two distinct questions:

Probability of necessity (PN): Among cases where the effect occurred, how likely is it that removing the cause would have prevented the effect?

$PN = P(Y_0 = 0 \mid Y = 1)$

Probability of sufficiency (PS): Among cases where the effect did not occur, how likely is it that introducing the cause would have produced the effect?

$PS = P(Y_1 = 1 \mid Y = 0)$

High PN means the cause is necessary for the effect (without it, the effect likely wouldn't happen). High PS means the cause is sufficient (with it, the effect likely would happen). A cause can be necessary without being sufficient, or vice versa.

Probability and counterfactuals

Counterfactuals are hypothetical scenarios: "What would have happened if things had been different?" In causal inference, you define causal effects by comparing potential outcomes under different conditions.

The average causal effect (ACE) is defined as:

$ACE = E[Y_1] - E[Y_0]$

where $Y_1$ is the potential outcome under treatment and $Y_0$ is the potential outcome under control. The fundamental problem is that you can never observe both $Y_1$ and $Y_0$ for the same individual. Probability provides the framework for estimating these quantities from data, using assumptions about independence, conditional independence, and distributional properties to bridge the gap between what you observe and what you want to know.