Conditional probability measures the likelihood of an event occurring given that another event has already happened. Instead of asking "what's the chance of A?", you're asking "what's the chance of A now that I know B happened?" This distinction matters because new information often changes the odds.

Notation for conditional probability

The notation $P(A|B)$ reads as "the probability of A given B." That vertical bar means "given that" or "conditional on."

The formula:

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

where $P(A \cap B)$ is the joint probability of both A and B occurring, and $P(B)$ is the probability of the event you already know happened. Note that $P(B)$ must be greater than zero (you can't condition on an impossible event).

Quick example: Suppose 30% of students play sports, 20% play sports and are on the honor roll, and you want to know the probability a student is on the honor roll given they play sports.

$P(\text{Honor Roll} | \text{Sports}) = \frac{0.20}{0.30} = 0.667$

So about 66.7% of student-athletes are on the honor roll.

Difference from joint probability

These two get confused a lot, so keep them straight:

Joint probability $P(A \cap B)$ : the chance that both A and B happen together
Conditional probability $P(A|B)$ : the chance that A happens assuming B already did

They're related by this rearrangement of the conditional probability formula:

$P(A \cap B) = P(A|B) \cdot P(B)$

Joint probability looks at the full sample space. Conditional probability shrinks the sample space down to only the cases where B occurred, then asks how often A shows up within that smaller group.

Fundamental concepts

Dependence vs. independence

Two events are independent if knowing one happened tells you nothing new about the other. Mathematically, for independent events:

$P(A|B) = P(A) \quad \text{and} \quad P(B|A) = P(B)$

If these equalities don't hold, the events are dependent, meaning one event's occurrence changes the probability of the other.

For example, drawing two cards from a deck without replacement creates dependent events: the first draw changes what's left in the deck. Drawing with replacement keeps them independent.

Multiplication rule of probability

The multiplication rule lets you find the probability of multiple events all happening:

$P(A \text{ and } B) = P(A) \times P(B|A)$

This generalizes to chains of events:

$P(A \text{ and } B \text{ and } C) = P(A) \times P(B|A) \times P(C|A \text{ and } B)$

When events are independent, it simplifies because the conditional probabilities just become regular probabilities:

$P(A \text{ and } B) = P(A) \times P(B)$

Bayes' theorem

Bayes' theorem lets you "flip" a conditional probability. If you know $P(B|A)$ but need $P(A|B)$ , Bayes' theorem gets you there.

Formula and components

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Each piece has a name:

$P(A)$ = the prior: your initial belief about A before seeing any new evidence
$P(B|A)$ = the likelihood: how probable the evidence B is if A were true
$P(B)$ = the marginal likelihood: the overall probability of observing B (this acts as a normalizing constant)
$P(A|B)$ = the posterior: your updated belief about A after accounting for B

Applications of Bayes' theorem

Medical diagnosis is the classic example. Suppose a disease affects 1% of the population, and a test is 95% accurate (both for true positives and true negatives). If you test positive, what's the actual probability you have the disease?

Prior: $P(\text{Disease}) = 0.01$
Likelihood: $P(\text{Positive}|\text{Disease}) = 0.95$
Marginal likelihood: $P(\text{Positive}) = (0.95)(0.01) + (0.05)(0.99) = 0.059$
Posterior: $P(\text{Disease}|\text{Positive}) = \frac{(0.95)(0.01)}{0.059} \approx 0.161$

Only about 16.1%. That surprises most people, and it's exactly why understanding conditional probability matters.

Other applications include spam filtering (is this email spam given the words it contains?), forensic science (probability of guilt given DNA evidence), and machine learning classifiers.

Law of total probability

The law of total probability lets you calculate $P(A)$ when you don't know it directly but do know how A behaves under different scenarios.

Formula and explanation

If $B_1, B_2, \ldots, B_n$ are mutually exclusive events that cover all possibilities (they're exhaustive), then:

$P(A) = \sum_{i} P(A|B_i) \cdot P(B_i)$

You're breaking A into pieces based on which scenario $B_i$ occurs, calculating the probability of A in each scenario, and adding them up weighted by how likely each scenario is.

This is exactly what we used in the medical diagnosis example above to find $P(\text{Positive})$ .

Notation for conditional probability, intuition - How can you picture Conditional Probability in a 2D Venn Diagram? - Mathematics ...

Connection to decision trees

Decision trees (or probability trees) are a visual way to apply the law of total probability:

Each branch from a node represents one of the mutually exclusive scenarios
You multiply probabilities along a path to get joint probabilities
You add across paths that lead to the same final outcome to get the marginal probability

These trees are especially helpful when a problem has multiple stages and you need to track how probabilities combine.

Conditional probability in the real world

Medical diagnosis examples

Medical testing is where conditional probability shows up most vividly. Doctors routinely deal with questions like:

Given a positive screening result, what's the actual chance the patient has the disease? (This depends heavily on how common the disease is.)
How does the probability of a diagnosis change as new test results and symptoms come in?
Is a treatment effective for patients with specific characteristics?

The key insight: a "95% accurate" test does not mean a positive result gives you a 95% chance of being sick. The base rate of the disease matters enormously.

Legal applications

In courtrooms, conditional probability helps evaluate evidence:

How strong is DNA evidence? (What's the probability of a match given innocence vs. guilt?)
How should multiple independent pieces of evidence be combined?
How reliable is eyewitness testimony given known error rates?

Getting these calculations wrong has real consequences, which is why the fallacies in the next section are so important.

Common misconceptions

Base rate fallacy

The base rate fallacy happens when you ignore or underweight the prior probability (base rate) of an event. In the medical example above, people hear "95% accurate test" and jump to thinking a positive result means 95% chance of disease. They forget that the disease only affects 1% of people, which dramatically lowers the posterior probability.

To avoid this: always ask "how common is this event in the first place?" before interpreting conditional evidence.

Prosecutor's fallacy

This fallacy confuses two very different probabilities:

$P(\text{Evidence} | \text{Innocence})$ : the chance of seeing this evidence if the person is innocent
$P(\text{Innocence} | \text{Evidence})$ : the chance the person is innocent given the evidence

A prosecutor might say "there's only a 1-in-a-million chance an innocent person would match this DNA," implying the defendant is almost certainly guilty. But that ignores the base rate: in a city of millions, multiple innocent people could match. Bayes' theorem is the proper way to handle this reasoning.

Calculating conditional probabilities

Using Venn diagrams

Venn diagrams work well for simple problems with two or three events. Overlapping circles represent events, and the overlap region represents $P(A \cap B)$ .

To find $P(A|B)$ from a Venn diagram:

Identify the circle for event B (this is your new, restricted sample space)
Look at the overlap region where A and B intersect
Divide the overlap area by the total area of B

This visual approach helps build intuition for why conditioning shrinks the sample space.

Two-way tables for calculations

Two-way tables (also called contingency tables) organize data by two categorical variables. They're one of the most practical tools for conditional probability.

	Disease	No Disease	Total
Test Positive	95	495	590
Test Negative	5	9,405	9,410
Total	100	9,900	10,000

To find $P(\text{Disease} | \text{Positive})$ : look at the "Test Positive" row only, then divide the Disease cell by the row total: $95 / 590 \approx 0.161$ . This matches the Bayes' theorem calculation from earlier.

Marginal probabilities come from the "Total" row and column. Conditional probabilities come from restricting your attention to a single row or column.

Notation for conditional probability, Conditional probability distribution - Wikipedia

Conditional vs. marginal probability

Differences and similarities

Marginal probability $P(A)$ : the overall probability of A, considering all possibilities
Conditional probability $P(A|B)$ : the probability of A within the restricted world where B has occurred

You can always recover marginal probabilities from conditional ones using the law of total probability. And if A and B are independent, the conditional probability equals the marginal probability.

When to use each

Use marginal probabilities when you have no additional information, or when events are independent and extra information doesn't help.

Use conditional probabilities when you have relevant information that changes the likelihood of the event you care about. In practice, this is most situations: you almost always know something that narrows things down.

Conditional independence

Definition and properties

Events A and B are conditionally independent given C if:

$P(A|B, C) = P(A|C)$

This means once you know C, learning B gives you no extra information about A. Equivalently:

$P(A \cap B | C) = P(A|C) \times P(B|C)$

An important subtlety: conditional independence given C does not mean A and B are independent overall (marginally). And the reverse is also true: marginally independent events can become dependent once you condition on something.

Examples in statistics

Naive Bayes classifiers assume that all features are conditionally independent given the class label. This simplification often works surprisingly well in practice.
Markov chains assume the future state depends only on the current state, not on the history of past states. This is a form of conditional independence.
In medical studies, two symptoms might be correlated overall but conditionally independent once you account for the underlying disease.

Probability trees

Construction and interpretation

Probability trees represent multi-step random processes visually.

To build one:

Start with a root node representing the initial situation
Draw branches for each possible outcome of the first event, labeling each with its probability
From each of those endpoints, draw branches for the next event's outcomes with their conditional probabilities
Continue until all stages are represented
Check that branches from any single node sum to 1

The leaf nodes (endpoints) represent complete sequences of outcomes.

Solving multi-step problems

Once the tree is built:

Find a joint probability: multiply all the probabilities along a single path from root to leaf
Find a marginal probability: identify all paths that lead to the outcome you care about, calculate each path's joint probability, and add them together
Find a conditional probability: use the joint and marginal results in the formula $P(A|B) = \frac{P(A \cap B)}{P(B)}$

Trees keep you organized when problems have multiple dependent stages, and they make it harder to accidentally skip a step.

Conditional probability in machine learning

Naive Bayes classifier

The Naive Bayes classifier predicts which class a data point belongs to using Bayes' theorem. For features $X_1, X_2, \ldots, X_n$ and class $C$ :

$P(C|X_1, X_2, \ldots, X_n) \propto P(C) \times \prod_{i} P(X_i|C)$

The "naive" part is the assumption that features are conditionally independent given the class. This is rarely true in reality, but the classifier still performs well for tasks like spam filtering, sentiment analysis, and document classification.

Hidden Markov models

Hidden Markov models (HMMs) deal with sequences where the underlying states aren't directly observable. They use conditional probabilities in two ways:

Transition probabilities: the chance of moving from one hidden state to another
Emission probabilities: the chance of observing a particular output given the current hidden state

HMMs are used in speech recognition, gene sequence analysis, and natural language processing. Algorithms like the Viterbi algorithm use these conditional probabilities to find the most likely sequence of hidden states behind an observed sequence of data.