Fiveable

🧠Thinking Like a Mathematician Unit 6 Review

QR code for Thinking Like a Mathematician practice questions

6.4 Probability distributions

6.4 Probability distributions

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🧠Thinking Like a Mathematician
Unit & Topic Study Guides

Probability distributions give you a framework for understanding and predicting random events. They're the backbone of statistical analysis, connecting raw randomness to patterns you can actually work with for predictions and decision-making.

This guide covers discrete and continuous distributions, their key properties (mean, variance, skewness), joint distributions, sampling distributions, and how all of this gets applied in fields like finance, engineering, and data science.

Fundamentals of probability distributions

Probability distributions describe how likely different outcomes are for a random process. Once you understand the type of distribution you're dealing with, you can calculate probabilities, make predictions, and compare different scenarios mathematically.

Concept of random variables

A random variable assigns a numerical value to each outcome of a random process. There are two types:

  • Discrete random variables take on distinct, countable values. Think: the number of heads in 10 coin flips. Their probabilities are described by a probability mass function (PMF), which gives the exact probability of each specific outcome.
  • Continuous random variables can take any value within a range. Think: the height of a randomly selected person. Their probabilities are described by a probability density function (PDF), which gives the relative likelihood across a continuum of values.

Types of probability distributions

  • Discrete distributions deal with countable outcomes (binomial, Poisson)
  • Continuous distributions handle outcomes that can be any value in a range (normal, exponential)
  • Univariate distributions involve a single random variable
  • Multivariate distributions describe relationships between two or more random variables
  • Empirical distributions are derived from observed data rather than theoretical models

Probability density functions

A PDF is a mathematical function that describes the likelihood of different outcomes for a continuous random variable. You don't read probabilities directly off the curve. Instead, the area under the curve over an interval gives you the probability of the variable falling in that range.

  • The function must be non-negative for all possible values
  • The total area under the curve always equals 1 (total probability)
  • The shape of the PDF reveals characteristics like symmetry and spread

Cumulative distribution functions

A cumulative distribution function (CDF) gives the probability that a random variable takes a value less than or equal to a given point. It answers the question: what's the probability of getting this value or lower?

  • For discrete distributions, you calculate it by summing probabilities of all values up to that point
  • For continuous distributions, you integrate the PDF up to that point
  • The CDF is always monotonically increasing, ranging from 0 to 1
  • CDFs are especially useful for calculating probabilities over ranges and finding percentiles

Discrete probability distributions

Discrete distributions model situations where outcomes are countable. Each distribution below fits a specific type of scenario, so recognizing which one applies is half the battle.

Bernoulli distribution

The simplest discrete distribution. It models a single trial with exactly two outcomes: success (1) or failure (0).

  • PMF: P(X=x)=px(1p)1xP(X=x) = p^x(1-p)^{1-x} where xx is 0 or 1
  • Mean: E(X)=pE(X) = p
  • Variance: Var(X)=p(1p)Var(X) = p(1-p)

A coin flip is the classic example. Any yes/no question or single pass/fail event follows a Bernoulli distribution.

Binomial distribution

This extends the Bernoulli to multiple trials. It models the number of successes in nn independent trials, each with the same probability pp of success.

  • PMF: P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k}p^k(1-p)^{n-k}
  • Mean: E(X)=npE(X) = np
  • Variance: Var(X)=np(1p)Var(X) = np(1-p)

For example, if a factory produces items with a 3% defect rate and you inspect 50 items, the number of defective items follows a binomial distribution with n=50n=50 and p=0.03p=0.03.

Poisson distribution

Models the number of events occurring in a fixed interval of time or space, where events happen independently at a constant average rate.

  • PMF: P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}
  • Both the mean and variance equal λ\lambda (the rate parameter)
  • Approximates the binomial distribution when nn is large and pp is small

A hospital emergency room that averages 4.2 arrivals per hour would use a Poisson distribution with λ=4.2\lambda = 4.2 to model the number of arrivals in any given hour.

Geometric distribution

Models how many trials you need before getting your first success in a sequence of independent Bernoulli trials.

  • PMF: P(X=k)=(1p)k1pP(X=k) = (1-p)^{k-1}p
  • Mean: E(X)=1pE(X) = \frac{1}{p}
  • Variance: Var(X)=1pp2Var(X) = \frac{1-p}{p^2}

If you're rolling a die waiting for a six, the number of rolls needed follows a geometric distribution with p=16p = \frac{1}{6}, giving an expected value of 6 rolls.

Continuous probability distributions

Continuous distributions model variables that can take any value in a range. Because there are infinitely many possible values, the probability of any single exact value is technically zero. You always work with probabilities over intervals.

Uniform distribution

The simplest continuous distribution: all outcomes in a range [a,b][a, b] are equally likely.

  • PDF: f(x)=1baf(x) = \frac{1}{b-a} for axba \leq x \leq b
  • Mean: E(X)=a+b2E(X) = \frac{a+b}{2}
  • Variance: Var(X)=(ba)212Var(X) = \frac{(b-a)^2}{12}

Random number generators typically produce values from a uniform distribution. If a bus arrives at a stop every 15 minutes and you show up at a random time, your wait time follows a uniform distribution on [0,15][0, 15].

Normal distribution

The famous bell curve. It shows up constantly in nature and statistics because of the Central Limit Theorem.

  • PDF: f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
  • Fully characterized by its mean (μ\mu) and standard deviation (σ\sigma)
  • Symmetric around the mean
  • The 68-95-99.7 rule: approximately 68% of data falls within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3
  • The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's shape

Applications include modeling human heights, IQ scores, and measurement errors in experiments.

Exponential distribution

Models the time between consecutive events in a Poisson process.

  • PDF: f(x)=λeλxf(x) = \lambda e^{-\lambda x} for x0x \geq 0
  • Mean: E(X)=1λE(X) = \frac{1}{\lambda}
  • Variance: Var(X)=1λ2Var(X) = \frac{1}{\lambda^2}
  • Has the memoryless property: the probability of waiting another 5 minutes is the same whether you've already waited 2 minutes or 20 minutes

If customers arrive at a store at an average rate of 3 per hour, the time between arrivals follows an exponential distribution with λ=3\lambda = 3, giving a mean wait of 13\frac{1}{3} hour (20 minutes).

Gamma distribution

Generalizes the exponential distribution. While the exponential models the time until one event, the gamma models the time until α\alpha events occur.

  • PDF: f(x)=βαΓ(α)xα1eβxf(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x} for x>0x > 0
  • Shape parameter (α\alpha) and rate parameter (β\beta) determine the distribution's behavior
  • Mean: E(X)=αβE(X) = \frac{\alpha}{\beta}
  • Variance: Var(X)=αβ2Var(X) = \frac{\alpha}{\beta^2}
  • When α=1\alpha = 1, the gamma distribution reduces to the exponential distribution

Used in modeling rainfall amounts, insurance claim sizes, and total service times in queuing systems.

Concept of random variables, Continuous Probability Distribution (2 of 2) | Concepts in Statistics

Properties of distributions

These properties give you standardized ways to describe and compare any distribution. Knowing the mean and variance of a distribution tells you a lot, but skewness and kurtosis fill in the rest of the picture.

Expected value

The expected value (mean) represents the long-run average outcome of a random variable. It's the value you'd converge on if you repeated the experiment infinitely many times.

  • For discrete distributions: E(X)=ixiP(X=xi)E(X) = \sum_{i} x_i P(X=x_i)
  • For continuous distributions: E(X)=xf(x)dxE(X) = \int_{-\infty}^{\infty} x f(x) dx

The expected value provides a measure of central tendency and is widely used in decision-making and risk assessment. For example, the expected return on an investment helps you compare options.

Variance and standard deviation

Variance measures how spread out a distribution is around its mean. A small variance means values cluster tightly; a large variance means they're more dispersed.

  • For discrete distributions: Var(X)=i(xiμ)2P(X=xi)Var(X) = \sum_{i} (x_i-\mu)^2 P(X=x_i)
  • For continuous distributions: Var(X)=(xμ)2f(x)dxVar(X) = \int_{-\infty}^{\infty} (x-\mu)^2 f(x) dx

Standard deviation is the square root of variance. It's often more useful in practice because it's in the same units as the original data, making it easier to interpret.

Skewness and kurtosis

Skewness measures asymmetry in a distribution:

  • Positive skew (right-skewed): longer tail on the right, bulk of data on the left. Income distributions are a classic example.
  • Negative skew (left-skewed): longer tail on the left.
  • A perfectly symmetric distribution (like the normal) has skewness of 0.

Kurtosis measures how heavy the tails are:

  • Leptokurtic (kurtosis > 3): heavier tails and a sharper peak than normal. More extreme outliers.
  • Platykurtic (kurtosis < 3): lighter tails and a flatter peak.
  • Mesokurtic (kurtosis = 3): the normal distribution is the baseline.

Both are used in financial modeling to assess risk beyond what variance alone can capture.

Moments of distributions

Moments provide a systematic way to describe a distribution's shape. Each moment captures a different aspect:

  • First moment: mean (location)
  • Second central moment: variance (spread)
  • Third central moment: related to skewness (asymmetry)
  • Fourth central moment: related to kurtosis (tail weight)

Moment generating functions (MGFs) uniquely determine a probability distribution. If two distributions have the same MGF, they're the same distribution. MGFs also simplify calculations when working with sums of independent random variables.

Joint probability distributions

When you're dealing with two or more random variables at once, joint distributions let you analyze how they behave together, not just individually.

Bivariate distributions

A bivariate distribution describes the joint behavior of two random variables. For discrete variables, you use a joint PMF; for continuous variables, a joint PDF.

  • These allow you to calculate probabilities for events involving both variables simultaneously
  • Visualized using 3D surface plots or contour plots for continuous cases
  • Used for analyzing relationships like height and weight, or pairs of stock prices

Marginal distributions

A marginal distribution extracts the distribution of one variable from a joint distribution, ignoring the other variable entirely.

  • For discrete variables: P(X=x)=yP(X=x,Y=y)P(X=x) = \sum_y P(X=x, Y=y)
  • For continuous variables: fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy

You "marginalize out" the variable you don't care about by summing (discrete) or integrating (continuous) over all its possible values.

Conditional distributions

A conditional distribution describes the probability distribution of one variable given that the other variable takes a specific value.

  • For discrete variables: P(Y=yX=x)=P(X=x,Y=y)P(X=x)P(Y=y|X=x) = \frac{P(X=x, Y=y)}{P(X=x)}
  • For continuous variables: fYX(yx)=fX,Y(x,y)fX(x)f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}

This is the foundation of Bayesian inference, where you update your beliefs about one variable after observing another.

Covariance and correlation

Covariance measures how two random variables move together:

Cov(X,Y)=E[(XμX)(YμY)]Cov(X,Y) = E[(X-\mu_X)(Y-\mu_Y)]

  • Positive covariance: variables tend to increase together
  • Negative covariance: when one increases, the other tends to decrease
  • Zero covariance: no linear relationship (but there could still be a non-linear one)

The correlation coefficient normalizes covariance to a 1-1 to 11 scale, making it easier to interpret:

ρX,Y=Cov(X,Y)σXσY\rho_{X,Y} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}

A correlation of +1+1 means perfect positive linear relationship; 1-1 means perfect negative; 00 means no linear relationship.

Sampling distributions

A sampling distribution describes how a sample statistic (like the sample mean) varies across all possible samples of a given size from a population. This is what makes statistical inference possible.

Central limit theorem

The Central Limit Theorem (CLT) is one of the most important results in statistics. It states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's original shape (as long as the population has finite variance).

  • A sample size of at least 30 is the common rule of thumb for the CLT to kick in
  • The mean of the sampling distribution equals the population mean
  • The standard error decreases as sample size increases, meaning larger samples give more precise estimates

This is why the normal distribution appears so often in statistical methods.

Distribution of sample mean

For a sample of size nn drawn from a population with mean μ\mu and standard deviation σ\sigma:

  • The sampling distribution of Xˉ\bar{X} has mean μ\mu
  • The standard error is: SEXˉ=σnSE_{\bar{X}} = \frac{\sigma}{\sqrt{n}}
  • For large nn, this distribution is approximately normal (by the CLT)

This is what you use when constructing confidence intervals or performing hypothesis tests about population means.

Distribution of sample variance

For samples drawn from a normally distributed population:

  • The quantity (n1)S2σ2\frac{(n-1)S^2}{\sigma^2} follows a chi-square distribution with n1n-1 degrees of freedom
  • Mean of the sampling distribution: E(S2)=σ2E(S^2) = \sigma^2 (the sample variance is an unbiased estimator)
  • Variance of the sampling distribution: Var(S2)=2σ4n1Var(S^2) = \frac{2\sigma^4}{n-1}

This result is used in hypothesis tests and confidence intervals for population variance.

Concept of random variables, File:Discrete probability distribution illustration.png - Wikimedia Commons

Chi-square distribution

The chi-square distribution arises from summing the squares of independent standard normal random variables. It's characterized by its degrees of freedom (df).

  • Mean = df
  • Variance = 2 × df
  • Right-skewed, but becomes more symmetric as df increases
  • Used in goodness-of-fit tests, tests of independence for categorical data, and variance-related inference

Applications of probability distributions

Statistical inference

Statistical inference uses probability distributions to draw conclusions about populations from sample data. The core activities are:

  • Estimation: finding point estimates and confidence intervals for population parameters
  • Hypothesis testing: making formal decisions about population characteristics
  • Bayesian inference: updating probability estimates as new evidence arrives

Sampling distributions quantify the uncertainty in your estimates, which is what separates statistical reasoning from guessing.

Hypothesis testing

Hypothesis testing is a formal procedure for deciding whether sample data provides enough evidence to reject a claim about a population.

  1. State the null hypothesis (H0H_0): the default assumption (e.g., no effect, no difference)
  2. State the alternative hypothesis (H1H_1): the claim you're testing
  3. Calculate a test statistic from the sample data. Under H0H_0, this statistic follows a known distribution
  4. Find the p-value: the probability of getting results as extreme as (or more extreme than) what you observed, assuming H0H_0 is true
  5. Compare the p-value to your significance level (α\alpha, typically 0.05). If the p-value is less than α\alpha, reject H0H_0

Applications include testing whether a new drug is more effective than a placebo, or whether two manufacturing processes produce different defect rates.

Confidence intervals

A confidence interval provides a range of plausible values for a population parameter, along with a specified confidence level (commonly 95%).

  • For a population mean (small sample): Xˉ±tα/2,n1sn\bar{X} \pm t_{\alpha/2, \, n-1} \frac{s}{\sqrt{n}}
  • For a population proportion: p^±zα/2p^(1p^)n\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

The width of the interval depends on three things: the confidence level, the sample size, and the variability in the data. Larger samples and lower confidence levels produce narrower intervals.

Risk assessment and decision making

  • Expected value and variance guide risk-reward tradeoffs
  • Value at Risk (VaR) uses distribution tails to quantify potential losses at a given confidence level
  • Monte Carlo simulations generate thousands of random outcomes based on specified distributions to model complex scenarios
  • Decision trees incorporate probabilities of different outcomes to evaluate choices

Applications span financial portfolio management, insurance pricing, and project planning.

Transformations of random variables

Transformations let you derive new distributions from existing ones. If you know the distribution of XX, you can often figure out the distribution of Y=g(X)Y = g(X).

Linear transformations

A linear transformation takes the form Y=aX+bY = aX + b. The effects on the distribution are predictable:

  • Mean: E(Y)=aE(X)+bE(Y) = aE(X) + b
  • Variance: Var(Y)=a2Var(X)Var(Y) = a^2 Var(X)

The shape of the distribution stays the same; only the location and scale change. Converting temperatures from Celsius to Fahrenheit (F=1.8C+32F = 1.8C + 32) is a linear transformation. Standardization (Z=XμσZ = \frac{X - \mu}{\sigma}) is another common example, converting any distribution to one with mean 0 and standard deviation 1.

Non-linear transformations

Non-linear transformations like Y=X2Y = X^2 or Y=ln(X)Y = \ln(X) change the shape of the distribution, not just its location and scale.

For continuous distributions, you use the change of variable technique:

  1. Express xx in terms of yy: find the inverse function x=g1(y)x = g^{-1}(y)
  2. Compute the Jacobian (the derivative dxdy\frac{dx}{dy}), which accounts for how the transformation stretches or compresses probability
  3. The new PDF is: fY(y)=fX(g1(y))dxdyf_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{dx}{dy}\right|

Logarithmic transformations are commonly used to convert right-skewed data (like income) into something closer to a normal distribution.

Convolution of distributions

Convolution gives you the distribution of the sum of independent random variables.

  • For discrete variables: the PMF of the sum is the convolution of the individual PMFs
  • For continuous variables: the PDF of the sum is the convolution of the individual PDFs
  • The convolution theorem states that the Fourier transform of a convolution equals the product of the individual Fourier transforms, which often simplifies computation

This comes up when modeling total waiting times, aggregate insurance claims, or combined signals in engineering.

Moment generating functions

The moment generating function (MGF) of a random variable XX is defined as:

MX(t)=E[etX]M_X(t) = E[e^{tX}]

Why MGFs are useful:

  • They uniquely determine a distribution. Same MGF = same distribution.
  • You can find the nnth moment by taking the nnth derivative of MX(t)M_X(t) and evaluating at t=0t = 0
  • For sums of independent random variables, the MGF of the sum equals the product of the individual MGFs, which is much simpler than convolution

Probability distributions in real-world

Financial modeling

  • Normal distribution: models short-term stock returns
  • Log-normal distribution: models asset prices over time (prices can't go negative, and returns compound)
  • Student's t-distribution: captures the heavier tails observed in financial returns compared to the normal
  • Poisson distribution: models rare events like defaults or market crashes
  • Copulas: model dependencies between multiple financial variables
  • VaR calculations rely on distribution tails to estimate worst-case losses at a given confidence level

Quality control

  • Binomial distribution: models defective items in a sample (e.g., 3 defectives out of 100 inspected)
  • Poisson distribution: models rare defects in large production runs
  • Normal distribution: describes variation in continuous measurements (part dimensions, fill weights)
  • Exponential distribution: models time between failures
  • Weibull distribution: characterizes product lifetimes with varying failure rates over time
  • Control charts use these distributions to set limits and flag when a process drifts out of specification

Reliability engineering

  • Exponential distribution: models constant failure rates (common for electronics)
  • Weibull distribution: handles failure rates that change over a product's lifetime (increasing wear-out, or decreasing infant mortality)
  • Gamma distribution: models cumulative damage or wear-out processes
  • Log-normal distribution: represents repair times or time-to-failure for certain systems
  • Extreme value distributions: model maximum loads or stresses on structures
  • Reliability functions derived from these distributions help predict maintenance schedules and design redundant systems

Data science applications

  • Normal distribution: underlies many classical statistical techniques (regression, ANOVA)
  • Poisson distribution: models count data in large datasets (click-through rates, fraud occurrences)
  • Exponential and Pareto distributions: describe heavy-tailed phenomena in network science and web traffic
  • Multinomial distribution: models categorical outcomes in classification tasks
  • Beta distribution: represents probabilities or proportions in Bayesian inference
  • Dirichlet distribution: generalizes the beta distribution for modeling distributions over multiple categories (used in topic modeling)