upgrade
upgrade

🧮Data Science Numerical Analysis

Common Statistical Distributions

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Statistical distributions are the mathematical backbone of everything you'll do in data science and statistics. When you're fitting models, running hypothesis tests, or making predictions, you're implicitly assuming your data follows some underlying distribution. Choose the wrong one, and your confidence intervals are meaningless, your p-values are garbage, and your predictions fall apart. The distributions in this guide aren't just abstract math—they're the tools you'll use to model everything from customer arrivals to stock prices to survival times.

You're being tested on more than just memorizing PDFs and parameters. Exams will ask you to identify which distribution fits a scenario, derive relationships between distributions, and justify computational choices. The key concepts here include discrete vs. continuous modeling, conjugate relationships, limiting behaviors, and the role of parameters in shaping distributions. Don't just memorize formulas—know what problem each distribution solves and when one distribution approximates or generalizes another.


Foundational Discrete Distributions

These distributions model countable outcomes—successes, failures, and events. They form the building blocks of discrete probability and appear constantly in sampling, quality control, and event modeling.

Bernoulli Distribution

  • Single binary trial—the simplest random variable, taking value 1 (success) with probability pp and 0 (failure) with probability 1p1-p
  • PMF: P(X=x)=px(1p)1xP(X=x) = p^x(1-p)^{1-x} for x{0,1}x \in \{0,1\}, with mean μ=p\mu = p and variance σ2=p(1p)\sigma^2 = p(1-p)
  • Foundation for compound distributions—the binomial, geometric, and negative binomial all build on independent Bernoulli trials

Binomial Distribution

  • Counts successes in nn fixed trials—each trial independent with success probability pp, giving XBinomial(n,p)X \sim \text{Binomial}(n, p)
  • PMF: P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k}p^k(1-p)^{n-k}, with mean npnp and variance np(1p)np(1-p)
  • Normal approximation applies when np10np \geq 10 and n(1p)10n(1-p) \geq 10—critical for computational efficiency in large-sample problems

Geometric Distribution

  • Trials until first success—models waiting time in discrete steps, with XGeometric(p)X \sim \text{Geometric}(p)
  • Memoryless property for discrete distributions: P(X>m+nX>m)=P(X>n)P(X > m + n \mid X > m) = P(X > n)—past failures don't affect future success probability
  • Mean 1/p1/p and variance 1pp2\frac{1-p}{p^2}—useful in reliability testing and retry-until-success algorithms

Compare: Geometric vs. Negative Binomial—both model trials until success, but geometric stops at the first success while negative binomial waits for rr successes. If an FRQ asks about "number of attempts needed," identify whether you need one success or multiple.


Count and Rate-Based Distributions

When you're modeling the number of events in a fixed interval—whether time, space, or another continuous domain—these distributions capture the underlying randomness of arrival processes.

Poisson Distribution

  • Events in fixed intervals—models count data when events occur independently at constant rate λ\lambda, giving XPoisson(λ)X \sim \text{Poisson}(\lambda)
  • PMF: P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}, with the elegant property that mean = variance = λ\lambda
  • Binomial limit: as nn \to \infty and p0p \to 0 with np=λnp = \lambda fixed, Binomial(n,p)Poisson(λ)\text{Binomial}(n,p) \to \text{Poisson}(\lambda)—this is your go-to approximation for rare events

Negative Binomial Distribution

  • Trials until rr successes—generalizes geometric distribution with parameters rr (target successes) and pp (success probability)
  • Handles overdispersion in count data where variance exceeds mean—unlike Poisson, allows Var(X)>E[X]\text{Var}(X) > E[X]
  • Mean r(1p)p\frac{r(1-p)}{p} and variance r(1p)p2\frac{r(1-p)}{p^2}—preferred over Poisson when modeling clustered or bursty events

Compare: Poisson vs. Negative Binomial—both model counts, but Poisson assumes mean equals variance. When your data shows overdispersion (variance > mean), negative binomial is the better choice. This distinction appears frequently in regression model selection.


Continuous Distributions for Modeling Data

These workhorses model measurements, times, and proportions. Understanding their shapes, parameters, and relationships is essential for both theoretical derivations and practical modeling.

Normal (Gaussian) Distribution

  • The universal limit—symmetric, bell-shaped with XN(μ,σ2)X \sim N(\mu, \sigma^2); PDF is f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
  • Central Limit Theorem: sample means of any distribution converge to normal as nn \to \infty—this justifies most large-sample inference
  • Standard normal ZN(0,1)Z \sim N(0,1) is the reference; transform via Z=XμσZ = \frac{X - \mu}{\sigma} for probability calculations and hypothesis tests

Uniform Distribution

  • Equal probability across [a,b][a, b]—the "maximum entropy" distribution when you only know the range, with PDF f(x)=1baf(x) = \frac{1}{b-a}
  • Mean a+b2\frac{a+b}{2} and variance (ba)212\frac{(b-a)^2}{12}—memorize these for quick calculations
  • Simulation foundation: if UUniform(0,1)U \sim \text{Uniform}(0,1), you can generate any distribution via inverse transform sampling—this is how random number generators work

Lognormal Distribution

  • Exponential of normal—if YN(μ,σ2)Y \sim N(\mu, \sigma^2), then X=eYX = e^Y is lognormal; always positive and right-skewed
  • Multiplicative processes naturally produce lognormal data—stock returns, biological growth, income distributions
  • Mean eμ+σ2/2e^{\mu + \sigma^2/2} and variance e2μ+σ2(eσ21)e^{2\mu + \sigma^2}(e^{\sigma^2} - 1)—note that μ\mu and σ\sigma are parameters of the log, not the distribution itself

Compare: Normal vs. Lognormal—normal models additive effects (measurement error, heights), while lognormal models multiplicative effects (returns, concentrations). If your data is strictly positive and right-skewed, try logging it and checking for normality.


Waiting Time and Reliability Distributions

These continuous distributions model duration, lifetime, and time-to-event data. They're fundamental in survival analysis, queuing theory, and engineering reliability.

Exponential Distribution

  • Time between Poisson events—the continuous analog of geometric, with rate λ\lambda and PDF f(x)=λeλxf(x) = \lambda e^{-\lambda x} for x0x \geq 0
  • Memoryless property: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t)—the only continuous memoryless distribution, modeling "fresh start" scenarios
  • Mean 1/λ1/\lambda and variance 1/λ21/\lambda^2—connects directly to Poisson: if arrivals are Poisson(λ\lambda), inter-arrival times are Exponential(λ\lambda)

Gamma Distribution

  • Sum of exponentials—if X1,,XkX_1, \ldots, X_k are i.i.d. Exponential(λ\lambda), then XiGamma(k,λ)\sum X_i \sim \text{Gamma}(k, \lambda)
  • Two-parameter flexibility with shape kk and rate λ\lambda (or scale θ=1/λ\theta = 1/\lambda); mean k/λk/\lambda and variance k/λ2k/\lambda^2
  • Special cases: Exponential is Gamma(1, λ\lambda); Chi-square(ν\nu) is Gamma(ν/2\nu/2, 1/2)—these relationships are heavily tested

Weibull Distribution

  • Flexible failure modeling—shape parameter kk controls hazard rate behavior: k<1k < 1 (decreasing), k=1k = 1 (constant = exponential), k>1k > 1 (increasing)
  • PDF: f(x)=kλ(xλ)k1e(x/λ)kf(x) = \frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1}e^{-(x/\lambda)^k} for x0x \geq 0, with scale λ\lambda
  • Reliability standard—models infant mortality (k<1k < 1), random failures (k=1k = 1), and wear-out (k>1k > 1) in a single framework

Compare: Exponential vs. Weibull—exponential assumes constant failure rate (memoryless), while Weibull allows the failure rate to change over time. If a problem mentions "aging" or "wear-out," Weibull with k>1k > 1 is your answer.


Distributions for Proportions and Bounded Data

When your random variable is constrained to a finite interval—especially [0,1][0, 1]—these distributions provide the necessary flexibility.

Beta Distribution

  • Flexible on [0,1][0, 1]—shape parameters α\alpha and β\beta control skewness; PDF f(x)=xα1(1x)β1B(α,β)f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}
  • Conjugate prior for binomial likelihood in Bayesian inference—if prior is Beta(α,β\alpha, \beta) and you observe kk successes in nn trials, posterior is Beta(α+k,β+nk\alpha + k, \beta + n - k)
  • Mean αα+β\frac{\alpha}{\alpha + \beta} and special cases: Uniform(0,1) = Beta(1,1); symmetric when α=β\alpha = \beta

Compare: Beta vs. Uniform—uniform is just Beta(1,1), assuming no prior information. As you observe data, the beta posterior concentrates around the true proportion. This prior-to-posterior update is a classic Bayesian exam topic.


Sampling and Inference Distributions

These distributions arise from sampling theory and are essential for hypothesis testing, confidence intervals, and model comparison. They're derived from the normal distribution and appear whenever you're doing inference.

Chi-Square Distribution

  • Sum of squared normals—if Z1,,ZνZ_1, \ldots, Z_\nu are i.i.d. N(0,1)N(0,1), then Zi2χν2\sum Z_i^2 \sim \chi^2_\nu with ν\nu degrees of freedom
  • Variance inference: sample variance S2S^2 from normal data satisfies (n1)S2σ2χn12\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}—used for confidence intervals on variance
  • Goodness-of-fit and independence tests—the test statistic (OiEi)2Ei\sum \frac{(O_i - E_i)^2}{E_i} follows chi-square under the null hypothesis

Student's t-Distribution

  • Ratio of normal to chi-square—if ZN(0,1)Z \sim N(0,1) and Vχν2V \sim \chi^2_\nu independently, then T=ZV/νtνT = \frac{Z}{\sqrt{V/\nu}} \sim t_\nu
  • Heavier tails than normal—accounts for uncertainty in estimating σ\sigma; converges to N(0,1)N(0,1) as ν\nu \to \infty
  • Small-sample inference: use tn1t_{n-1} for confidence intervals and hypothesis tests on means when σ\sigma is unknown—this is the default for real data

F-Distribution

  • Ratio of chi-squares—if Uχd12U \sim \chi^2_{d_1} and Vχd22V \sim \chi^2_{d_2} independently, then F=U/d1V/d2Fd1,d2F = \frac{U/d_1}{V/d_2} \sim F_{d_1, d_2}
  • ANOVA test statistic: compares between-group variance to within-group variance; large FF suggests group means differ
  • Regression significance: the overall F-test checks if at least one predictor has nonzero coefficient—always report alongside R2R^2

Compare: t vs. F distributions—t-tests compare one or two means, while F-tests compare variances or multiple means simultaneously. Note that tν2=F1,νt^2_\nu = F_{1,\nu}—a two-sided t-test is equivalent to an F-test with 1 numerator degree of freedom.


Quick Reference Table

ConceptBest Examples
Discrete counts (fixed trials)Bernoulli, Binomial, Geometric, Negative Binomial
Event rates in continuous time/spacePoisson, Exponential
Symmetric continuous dataNormal, Student's t
Positive, right-skewed dataLognormal, Gamma, Weibull, Exponential
Bounded proportions [0,1][0,1]Beta, Uniform
Variance and model comparisonChi-Square, F
Small-sample mean inferenceStudent's t
Bayesian conjugate priorsBeta (for binomial), Gamma (for Poisson)

Self-Check Questions

  1. A call center receives an average of 4 calls per minute. Which distribution models the number of calls in a 5-minute window, and which models the time until the next call?

  2. You're modeling the proportion of defective items in a batch using Bayesian inference with a binomial likelihood. What distribution family should your prior belong to, and why?

  3. Compare the exponential and Weibull distributions: under what conditions does Weibull reduce to exponential, and when would you prefer Weibull in a reliability analysis?

  4. Your sample variance from 25 observations is used to construct a confidence interval for the population variance. What distribution does the pivotal quantity follow, and how many degrees of freedom does it have?

  5. Explain why the normal distribution appears so frequently in inference, even when the underlying data is clearly non-normal. What theorem justifies this, and what conditions must hold?