Conditional probability and independence are the tools you use to update beliefs when new information arrives and to determine whether events influence each other. In stochastic processes, nearly everything builds on these ideas: Markov chains rely on conditional independence, Bayesian inference is a direct application of Bayes' theorem, and the multiplication rule shows up constantly when you work with joint distributions.

Definition of conditional probability

Conditional probability quantifies how likely an event is once you know something else has happened. Instead of looking at the entire sample space, you zoom in on just the outcomes where the given event occurred and ask how probable your target event is within that restricted space.

Probability of an event given another

The notation $P(A|B)$ is read as "the probability of A given B." It represents the probability that event A occurs when you already know event B has occurred. You're essentially restricting your attention to the subset of outcomes where B happens, then measuring how much of that subset also contains A.

Notation and formula

The conditional probability of A given B is:

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

$P(A \cap B)$ is the joint probability of A and B both occurring.
$P(B)$ is the marginal probability of B.
This formula is only defined when $P(B) > 0$ , since division by zero is undefined.

The intuition: you take the probability that both events happen and normalize it by the probability of the event you're conditioning on.

Calculating conditional probabilities

There are several practical methods for computing conditional probabilities, depending on what information you have.

From a joint probability distribution

When you have the full joint distribution of two (or more) random variables, you can read off conditional probabilities directly. Find $P(A \cap B)$ from the joint table, then divide by the marginal $P(B)$ . The marginal is obtained by summing across the appropriate row or column.

Using a tree diagram

Tree diagrams break a problem into sequential stages. Each branch represents a possible outcome, and the probability on each branch is a conditional probability given the preceding branches.

Draw the first set of branches for the initial event (e.g., B or $B'$ ), labeling each with its probability.
From each first-stage branch, draw branches for the second event (e.g., A or $A'$ ), labeling each with the appropriate conditional probability.
To find $P(A \cap B)$ , multiply the probabilities along the path that leads to both A and B.
To find $P(A|B)$ , take that product and divide by $P(B)$ .

With a contingency table

A contingency table (two-way table) displays frequencies or probabilities for two categorical variables. To calculate $P(A|B)$ :

Locate the cell where A and B intersect. This gives you the joint count (or probability).
Find the row or column total for B.
Divide the joint value by the B total.

For example, if 30 out of 200 people are in both category A and category B, and 80 people total are in category B, then $P(A|B) = \frac{30}{80} = 0.375$ .

Multiplication rule

The multiplication rule connects joint probabilities to conditional probabilities. It's the workhorse formula for computing the probability that multiple events all occur.

Deriving the general formula

Rearranging the conditional probability definition gives:

$P(A \cap B) = P(A) \cdot P(B|A)$

Equivalently, $P(A \cap B) = P(B) \cdot P(A|B)$ . Both forms are useful depending on which conditional probability is easier to find.

For three events, the rule extends via the chain rule:

$P(A \cap B \cap C) = P(A) \cdot P(B|A) \cdot P(C|A \cap B)$

This pattern generalizes to any finite number of events and is heavily used when working with stochastic processes.

For independent events

When A and B are independent, knowing one occurred tells you nothing about the other. The conditional probability collapses: $P(B|A) = P(B)$ . The multiplication rule simplifies to:

$P(A \cap B) = P(A) \cdot P(B)$

For dependent events

When A and B are dependent, the occurrence of one changes the probability of the other. You must use the full multiplication rule with the conditional probability:

$P(A \cap B) = P(A) \cdot P(B|A)$

A classic example: drawing cards without replacement. The probability of drawing two aces is $P(\text{1st ace}) \cdot P(\text{2nd ace}|\text{1st ace}) = \frac{4}{52} \cdot \frac{3}{51}$ .

Probability of an event given another, Conditional probability - Wikipedia

Law of total probability

The law of total probability lets you compute $P(A)$ by breaking the sample space into simpler pieces and summing up contributions from each piece.

Partitioning the sample space

A partition of the sample space is a collection of events $B_1, B_2, \ldots, B_n$ that are:

Mutually exclusive: no two can occur at the same time ( $B_i \cap B_j = \emptyset$ for $i \neq j$ ).
Exhaustive: together they cover every possible outcome ( $B_1 \cup B_2 \cup \cdots \cup B_n = \Omega$ ).

Applying the law of total probability

Given a partition $B_1, B_2, \ldots, B_n$ :

$P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)$

Each term $P(A|B_i) \cdot P(B_i)$ represents the contribution to $P(A)$ from the scenario where $B_i$ occurs. You weight each conditional probability by how likely that scenario is, then add them up.

With a tree diagram

On a tree diagram, the law of total probability corresponds to summing all paths that lead to event A:

The first-level branches represent the partition $B_1, \ldots, B_n$ .
The second-level branches represent A occurring (or not) given each $B_i$ .
Multiply along each path that ends at A.
Sum all those products to get $P(A)$ .

Bayes' theorem

Bayes' theorem lets you "reverse" a conditional probability. If you know $P(B|A)$ but need $P(A|B)$ , Bayes' theorem is the bridge.

Derivation using conditional probability

Start from two expressions for the joint probability:

$P(A \cap B) = P(B|A) \cdot P(A) = P(A|B) \cdot P(B)$

Setting these equal and solving for $P(A|B)$ :

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

The denominator $P(B)$ is often computed using the law of total probability.

Updating probabilities with new information

Bayes' theorem has a natural interpretation in terms of updating beliefs:

Prior $P(A)$ : your belief about A before seeing any new data.
Likelihood $P(B|A)$ : how probable the observed data B is if A were true.
Posterior $P(A|B)$ : your updated belief about A after observing B.
Evidence $P(B)$ : the total probability of observing B (acts as a normalizing constant).

The posterior is proportional to the prior times the likelihood: $P(A|B) \propto P(B|A) \cdot P(A)$ .

Applications in decision making

Bayes' theorem appears wherever you need to update probabilities with evidence:

Medical diagnosis: A test has sensitivity $P(+|\text{disease})$ and specificity $P(-|\text{no disease})$ . Bayes' theorem converts these into the clinically useful quantity: $P(\text{disease}|+)$ , the positive predictive value.
Spam filtering: Given features of an email, Bayes' theorem updates the probability that it's spam.
Machine learning: Model parameters are updated as new data arrives, treating the parameter as the "hypothesis" and the data as the "evidence."

Independence of events

Independence means that learning one event occurred gives you zero information about whether another event occurred. This property dramatically simplifies calculations.

Definition of independence

Two events A and B are independent if and only if:

$P(A \cap B) = P(A) \cdot P(B)$

Equivalently, $P(A|B) = P(A)$ (assuming $P(B) > 0$ ). The joint probability factors into the product of the marginals.

Probability of an event given another, How Do You Compute Conditional Probability From Data? – Math FAQ

Checking for independence

To test whether A and B are independent:

Compute $P(A)$ , $P(B)$ , and $P(A \cap B)$ .
Check whether $P(A \cap B) = P(A) \cdot P(B)$ .
If equality holds, the events are independent. If not, they're dependent.

Alternatively, check whether $P(A|B) = P(A)$ . Both tests are equivalent.

Properties of independent events

If A and B are independent, then $A$ and $B'$ are also independent (and likewise $A'$ and $B$ , and $A'$ and $B'$ ).
Pairwise independence does not imply mutual independence. Three events A, B, C can each be pairwise independent while still being mutually dependent. Mutual independence requires that every subset satisfies the product rule, including the triple: $P(A \cap B \cap C) = P(A) \cdot P(B) \cdot P(C)$ .
For mutually independent events, any sub-collection is also independent.

Conditional independence

Conditional independence extends the idea of independence to situations where a third event provides context. Two events that are dependent overall might become independent once you condition on a third event (or vice versa).

Definition and notation

Events A and B are conditionally independent given C if:

$P(A \cap B|C) = P(A|C) \cdot P(B|C)$

This is denoted $A \perp B \mid C$ . Intuitively, once you know C has occurred, learning A gives you no additional information about B.

Conditional independence vs marginal independence

These are genuinely different properties. Neither one implies the other:

A and B can be marginally dependent but conditionally independent given C. (Example: two diseases that share a common symptom become independent once you condition on the symptom.)
A and B can be marginally independent but conditionally dependent given C. (This is known as Berkson's paradox or "explaining away.")

Confusing these two types of independence is a common source of modeling errors.

Markov chains and conditional independence

Markov chains are stochastic processes where the Markov property holds: the future state depends only on the present state, not on the full history. Formally, for states $X_0, X_1, \ldots, X_n$ :

$P(X_{n+1} | X_n, X_{n-1}, \ldots, X_0) = P(X_{n+1} | X_n)$

This is a conditional independence statement: $X_{n+1} \perp (X_0, \ldots, X_{n-1}) \mid X_n$ . The Markov property is what makes these processes tractable, since you only need to track the current state rather than the entire past.

Applications of conditional probability

In medical testing and diagnosis

Medical tests are characterized by two conditional probabilities:

Sensitivity: $P(\text{positive} | \text{disease})$ , the true positive rate.
Specificity: $P(\text{negative} | \text{no disease})$ , the true negative rate.

What patients and doctors actually want is the positive predictive value $P(\text{disease} | \text{positive})$ . Bayes' theorem, combined with the disease prevalence (the prior), converts sensitivity and specificity into this clinically relevant number. When prevalence is low, even a highly specific test can have a surprisingly low positive predictive value.

In machine learning and data science

Naive Bayes classifiers predict class labels by applying Bayes' theorem and assuming conditional independence among features given the class. Despite this strong assumption, they often perform well in practice (e.g., text classification).
Bayesian networks represent probabilistic relationships among variables as directed acyclic graphs. Each node stores a conditional probability distribution given its parents, and the full joint distribution factors according to the graph structure.

In genetics and probability

Conditional probability underlies genetic inheritance calculations. Punnett squares, for instance, display the conditional probabilities of offspring genotypes given parental genotypes. In population genetics, conditional probability is used to study how allele frequencies shift under processes like natural selection, genetic drift, and migration.