Probability axioms are the basic rules that all of probability theory is built on. Without them, there'd be no consistent way to assign numbers to uncertain events or to check whether a probability calculation makes sense.

Kolmogorov's axioms

In 1933, Andrey Kolmogorov proposed three axioms that define what a valid probability measure must satisfy. Every probability rule you'll encounter in this course can be traced back to these three statements.

Non-negativity: The probability of any event is never negative. $P(A) \geq 0$ for any event $A$
Normalization: The probability of the entire sample space equals 1. Something has to happen. $P(S) = 1$ where $S$ is the sample space
Additivity: For mutually exclusive events (events that can't happen at the same time), the probability of one or the other occurring equals the sum of their individual probabilities. $P(A \cup B) = P(A) + P(B)$ when $A$ and $B$ are mutually exclusive

These three axioms look simple, but they're powerful. Every theorem in probability, from Bayes' theorem to the law of large numbers, is derived from them.

Why these axioms matter

They guarantee consistency: you can't get contradictory results if you follow the axioms correctly.
They allow mathematical modeling of real-world uncertainty, from insurance pricing to quantum mechanics.
All advanced techniques (hypothesis testing, confidence intervals, Bayesian inference) rest on this foundation.

Properties of probability

Non-negativity

Probabilities can never be negative. If you calculate a negative probability somewhere, that's a signal you've made an error. This applies to every kind of event, whether simple (rolling a 3) or compound (rolling a 3 and flipping heads).

$P(A) \geq 0$ for any event $A$

Normalization

The probabilities of all possible outcomes in a sample space must add up to exactly 1. This reflects the certainty that something in the sample space will occur.

$P(S) = 1$

This also means any single event's probability falls between 0 and 1 inclusive. You can convert directly to percentages by multiplying by 100: a probability of 0.25 is 25%.

Additivity vs. multiplicativity

These two operations show up constantly, and mixing them up is one of the most common mistakes in probability.

Additivity is for mutually exclusive events (events that can't both happen). You add their probabilities to find the probability of either one occurring.

$P(A \cup B) = P(A) + P(B)$ when $A$ and $B$ are mutually exclusive

Multiplicativity is for independent events (events where one doesn't affect the other). You multiply their probabilities to find the probability of both occurring.

$P(A \cap B) = P(A) \times P(B)$ when $A$ and $B$ are independent

Quick check: "or" problems often involve addition; "and" problems often involve multiplication. But always verify whether the events are actually mutually exclusive or independent before applying these rules.

Sample space and events

Universal set (sample space)

The sample space is the complete set of all possible outcomes of a probability experiment. It's denoted by $S$ or $\Omega$ .

For a coin toss: $S = \{H, T\}$
For a single die roll: $S = \{1, 2, 3, 4, 5, 6\}$

Sample spaces can also be countably infinite (number of coin tosses until you get heads) or uncountably infinite (the exact time a radioactive atom decays). The key requirement is that the sample space is exhaustive: it includes every possible outcome with no overlaps between individual outcomes.

Mutually exclusive events

Two events are mutually exclusive if they cannot occur at the same time. Their intersection is the empty set:

$P(A \cap B) = 0$

For example, on a single die roll, "rolling a 2" and "rolling a 5" are mutually exclusive. You can't get both at once. This property is what makes the addition rule (Axiom 3) work cleanly.

Exhaustive events

A set of events is exhaustive if together they cover every outcome in the sample space. Their union equals $S$ , so their probabilities sum to 1.

For example, on a die roll, the events "roll 1–3" and "roll 4–6" are both mutually exclusive and exhaustive. Exhaustive events are useful for partitioning a sample space, which is the basis for the law of total probability.

Probability of compound events

Kolmogorov's axioms, Probability axioms - Wikipedia

Union of events

The union $P(A \cup B)$ is the probability that at least one of the events occurs.

Mutually exclusive events: $P(A \cup B) = P(A) + P(B)$
Non-mutually exclusive events: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

You subtract $P(A \cap B)$ in the second formula to avoid double-counting the overlap. This generalizes to more than two events through the inclusion-exclusion principle.

Intersection of events

The intersection $P(A \cap B)$ is the probability that both events occur.

Independent events: $P(A \cap B) = P(A) \times P(B)$
Dependent events: $P(A \cap B) = P(A) \times P(B|A)$

Here $P(B|A)$ is the conditional probability of $B$ given that $A$ has occurred. For example, drawing two aces in a row from a deck without replacement involves dependent events: the first draw changes the deck composition for the second.

Complement of events

The complement of event $A$ (written $A'$ or $A^c$ ) is everything in the sample space that is not $A$ .

$P(A') = 1 - P(A)$

This is surprisingly useful. When a probability is hard to calculate directly, it's often easier to find the complement and subtract from 1. For example, "probability of rolling at least one 6 in four rolls" is much easier to compute as $1 - P(\text{no sixes in four rolls})$ .

Conditional probability

Definition and notation

Conditional probability is the probability of an event occurring given that another event has already occurred. It's written $P(A|B)$ , read as "the probability of $A$ given $B$ ."

$P(A|B) = \frac{P(A \cap B)}{P(B)}$ where $P(B) > 0$

Think of it as narrowing the sample space. Once you know $B$ happened, you're no longer considering all of $S$ ; you're only looking at outcomes within $B$ . The formula asks: of all the ways $B$ can happen, how many of those also include $A$ ?

Bayes' theorem

Bayes' theorem lets you "reverse" a conditional probability. If you know $P(B|A)$ but need $P(A|B)$ , Bayes' theorem provides the bridge:

$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$

This is essential when reasoning from an observed effect back to its cause. A classic example: a medical test comes back positive ( $B$ ). What's the probability you actually have the disease ( $A$ )? You need the test's accuracy $P(B|A)$ , the disease's prevalence $P(A)$ , and the overall positive rate $P(B)$ to answer correctly.

Bayes' theorem is widely used in medical diagnosis, spam filtering, machine learning, and forensic science.

Independence of events

Definition of independence

Events $A$ and $B$ are independent if knowing one occurred tells you nothing about whether the other occurred. Formally:

$P(A|B) = P(A)$

which is equivalent to:

$P(A \cap B) = P(A) \times P(B)$

Coin flips are the classic example: the result of the first flip has zero effect on the second. Independence is also a common assumption in statistical models, so recognizing when it holds (and when it doesn't) is critical.

Pairwise vs. mutual independence

For three or more events, there's an important distinction:

Pairwise independence means every pair satisfies the multiplication rule:
- $P(A \cap B) = P(A) \times P(B)$
- $P(A \cap C) = P(A) \times P(C)$
- $P(B \cap C) = P(B) \times P(C)$
Mutual independence requires pairwise independence plus the joint condition:
- $P(A \cap B \cap C) = P(A) \times P(B) \times P(C)$

Pairwise independence does not guarantee mutual independence. There are known counterexamples where all pairs are independent but the three events together are not. For most applications in this course, you'll want mutual independence, so be careful about which condition you're checking.

Applications of probability axioms

Risk assessment

Probability theory quantifies potential negative outcomes so that decisions can be made rationally. Insurance companies use it to set premiums, engineers use it to estimate failure rates, and public health officials use it to model disease spread. Conditional probability is especially important here: the risk of a flood given that it rained heavily upstream is very different from the unconditional risk.

Kolmogorov's axioms, 3.3 Compound Events – Significant Statistics

Statistical inference

Statistical inference uses sample data to draw conclusions about a larger population. Confidence intervals, p-values, and hypothesis tests all rely on probability distributions that follow Kolmogorov's axioms. Without the axioms guaranteeing consistency, these tools wouldn't be trustworthy.

Decision theory

Decision theory combines probabilities with utility (how much you value different outcomes) to identify optimal choices under uncertainty. Expected value calculations weight each outcome's value by its probability. Investment strategies, medical treatment decisions, and AI planning algorithms all use this framework.

Common misconceptions

Gambler's fallacy

The gambler's fallacy is the mistaken belief that past outcomes influence future independent events. If a fair coin lands heads five times in a row, the probability of heads on the next flip is still 0.5. The coin has no memory.

This fallacy violates the principle of independence. It shows up frequently in casino gambling and even in everyday reasoning ("I'm due for some good luck"). Understanding independence is the direct antidote.

Base rate fallacy

The base rate fallacy occurs when you ignore the overall prevalence of an event and focus only on specific evidence. For example, if a disease affects 1 in 10,000 people and a test is 99% accurate, a positive result still doesn't mean you almost certainly have the disease. The low base rate (prior probability) matters enormously.

This is a direct misapplication of Bayes' theorem. The fix is to always account for the prior probability $P(A)$ before updating with new evidence.

Probability distributions

Discrete vs. continuous distributions

Probability distributions describe how probability is spread across possible outcomes. They come in two types:

Discrete distributions apply to countable outcomes (number of heads in 10 flips, number of customers per hour). Each outcome has a specific probability assigned by a probability mass function (PMF).
- Common examples: binomial, Poisson
Continuous distributions apply to outcomes on a continuous scale (height, temperature, time). Individual points have probability zero; instead, probabilities are assigned to intervals using a probability density function (PDF).
- Common examples: normal, exponential

Both types follow the probability axioms. The total probability across all outcomes (summed for discrete, integrated for continuous) equals 1.

Cumulative distribution functions

The cumulative distribution function (CDF) gives the probability that a random variable $X$ takes a value less than or equal to some number $x$ :

$F(x) = P(X \leq x)$

Key properties of CDFs:

Always non-decreasing (it can stay flat but never goes down)
Ranges from 0 to 1
Works for both discrete and continuous distributions
For continuous distributions, the PDF is the derivative of the CDF

CDFs are useful for finding probabilities over ranges and for computing quantiles (e.g., "what value does 95% of the distribution fall below?").

Axioms in different interpretations

Frequentist vs. Bayesian approaches

The axioms themselves are the same in both frameworks. What differs is the interpretation of what probability means.

Frequentist interpretation: Probability is the long-run relative frequency of an event. If you flip a fair coin infinitely many times, the proportion of heads converges to 0.5. Frequentist methods include hypothesis testing and confidence intervals.
Bayesian interpretation: Probability represents a degree of belief, updated as new evidence arrives. Bayesian methods use prior distributions and Bayes' theorem to compute posterior probabilities and credible intervals.

Both approaches produce valid mathematics because both obey Kolmogorov's axioms. The debate is about philosophy and methodology, not about the underlying math.

Subjective probability

Subjective probability assigns a probability based on personal judgment, knowledge, and experience. It still must follow the axioms (your beliefs shouldn't violate non-negativity or normalization), but it doesn't require repeatable experiments.

This is useful for one-of-a-kind events. What's the probability a specific startup succeeds? There's no long-run frequency to appeal to, but a knowledgeable investor can still assign a meaningful probability. Subjective probability is the foundation of Bayesian statistics and decision theory.