Inferential statistics is the bridge between your sample data and the broader population you actually care about. Every time you see a poll's "margin of error," read about a drug trial's results, or encounter a claim that one algorithm outperforms another, you're seeing inferential statistics at work. In data science, you're constantly being tested on your ability to quantify uncertainty, compare groups rigorously, and draw defensible conclusions from limited data.
The concepts here aren't just isolated techniques—they form an interconnected toolkit. Hypothesis testing and confidence intervals are two sides of the same coin. Point estimation feeds into maximum likelihood, which connects to Bayesian inference. Understanding these relationships is what separates memorization from mastery. Don't just learn what each method does—know when to use it, what assumptions it requires, and how it relates to the others.
Estimation: From Sample to Population
Before you can test anything, you need to estimate population parameters from your sample. These methods answer the fundamental question: given what I observed, what can I infer about the true underlying value?
Point Estimation
Single-value estimates—provides one "best guess" for a population parameter, such as using xˉ to estimate μ
Common estimators include the sample mean, sample proportion, and sample variance, each targeting its population counterpart
No uncertainty information—point estimates alone don't tell you how confident you should be in that single number
Confidence Intervals
Range of plausible values—a 95% CI means if you repeated sampling many times, about 95% of intervals would contain the true parameter
Width reflects precision—narrower intervals indicate more certainty, driven by larger sample sizes and lower variability
Interpretation matters—the parameter is fixed; it's the interval that's random across repeated samples
Maximum Likelihood Estimation
Finds parameters that maximize L(θ∣data)—the likelihood function measures how probable your observed data is under different parameter values
Asymptotically optimal—MLEs are consistent, efficient, and approximately normal for large samples
Foundation for advanced models—used in logistic regression, GLMs, survival analysis, and most modern statistical software
Compare: Point Estimation vs. Confidence Intervals—both estimate population parameters, but point estimates give a single value while CIs quantify the uncertainty around that estimate. FRQs often ask you to interpret CIs correctly—remember, it's about the procedure's reliability, not the probability that the parameter falls in a specific interval.
Hypothesis Testing Framework
Hypothesis testing formalizes the process of using data to make decisions. The core logic: assume nothing interesting is happening (null hypothesis), then ask whether your data are surprising enough to reject that assumption.
Hypothesis Testing
Null vs. alternative—H0 represents "no effect" or "no difference"; Ha is what you're trying to find evidence for
Significance level α—typically 0.05, this is your threshold for how much Type I error (false positive) risk you'll accept
p-value interpretation—the probability of observing data this extreme if H0 were true; small p-values suggest H0 is unlikely
Power Analysis
Determines required sample size—calculates how many observations you need to detect a true effect with high probability
Balances four quantities—effect size, α, sample size n, and power (1−β); fixing three determines the fourth
Minimizes Type II errors—ensures your study can actually detect meaningful effects rather than being underpowered
Effect Size Estimation
Quantifies practical significance—statistical significance doesn't mean the effect matters; effect size tells you how much it matters
Common measures include Cohen's d (standardized mean difference), r2 (variance explained), and η2 (for ANOVA)
Essential for interpretation—a tiny effect can be statistically significant with large n; always report effect sizes alongside p-values
Compare: p-values vs. Effect Sizes—p-values tell you whether an effect exists, while effect sizes tell you whether it's meaningful. A study with n=100,000 might find p<0.001 for a practically negligible difference. If an FRQ asks about study interpretation, address both.
Comparing Groups: Tests for Means and Variances
When you need to determine whether groups differ, these tests provide the statistical machinery. The choice of test depends on how many groups you're comparing, whether data are paired, and what assumptions you can justify.
t-Tests
Compares means of two groups—independent t-tests for separate groups, paired t-tests for matched or repeated measurements
Test statistict=SExˉ1−xˉ2—measures how many standard errors apart the means are
Assumptions—normality (relaxed for large n) and equal variances (Welch's t-test handles unequal variances)
z-Tests
Used when σ is known or n is large—relies on the standard normal distribution rather than the t-distribution
Test statisticz=σ/nxˉ−μ0—appropriate for proportions and large-sample means
Converges with t-test—as n→∞, the t-distribution approaches the standard normal
Analysis of Variance (ANOVA)
Compares means across 3+ groups simultaneously—avoids inflated Type I error from multiple t-tests
F-statisticF=MSwithinMSbetween—ratio of between-group variance to within-group variance
Assumptions—normality, independence, and homogeneity of variances (Levene's test can check the latter)
F-Tests
Compares variances between groups—tests whether populations have equal spread, not just equal centers
Underlies ANOVA and regression—the F-statistic in ANOVA is an F-test; regression uses it to test overall model significance
Sensitive to non-normality—violations of normality affect F-tests more than violations affect t-tests
Compare: t-Test vs. ANOVA—both compare means, but t-tests handle exactly two groups while ANOVA handles three or more. Running multiple t-tests inflates your false positive rate; ANOVA controls this with a single omnibus test. Follow-up with post-hoc tests (Tukey, Bonferroni) to identify which groups differ.
Categorical Data and Associations
Not all data are continuous. When you're working with counts, categories, or proportions, these methods test for relationships and goodness of fit. The key insight: compare what you observed to what you'd expect if variables were independent.
Chi-Square Tests
Tests association between categorical variables—compares observed cell counts to expected counts under independence
Test statisticχ2=∑E(O−E)2—large values indicate observed data deviate substantially from expectation
Two main uses—goodness-of-fit (does data match a distribution?) and test of independence (are variables related?)
Compare: Chi-Square vs. t-Test—chi-square tests work with categorical data and frequencies, while t-tests require continuous outcomes. If your dependent variable is "pass/fail" or "category A/B/C," reach for chi-square. If it's a measurement like height or score, use t-tests.
Modeling Relationships
Regression methods go beyond "is there a difference?" to ask "how are variables related, and can we predict outcomes?" These techniques model the functional form of relationships and quantify how predictors influence responses.
Coefficients have interpretations—in linear regression, β1 is the expected change in Y for a one-unit increase in X
Inference on coefficients—t-tests assess whether individual βj=0; F-tests assess overall model significance
Compare: ANOVA vs. Regression—mathematically equivalent for comparing group means (ANOVA is regression with dummy variables), but regression extends to continuous predictors and complex models. Use ANOVA language for experimental designs, regression language for predictive modeling.
Alternative Inference Frameworks
Classical frequentist methods aren't the only game in town. These approaches either relax distributional assumptions or incorporate prior knowledge, expanding your inferential toolkit.
Bayesian Inference
Updates beliefs with data—uses Bayes' theorem: P(θ∣data)∝P(data∣θ)⋅P(θ)
Prior + likelihood = posterior—your prior beliefs about parameters combine with observed evidence to produce updated probabilities
Naturally quantifies uncertainty—posterior distributions give full probability statements about parameters, not just point estimates
Non-parametric Tests
No distributional assumptions—don't require normality; work with ranks rather than raw values
Key examples—Mann-Whitney U (alternative to independent t-test), Wilcoxon signed-rank (paired t-test), Kruskal-Wallis (ANOVA)
Robust but less powerful—sacrifice some ability to detect true effects in exchange for validity under violated assumptions
Bootstrapping
Resamples with replacement—creates thousands of "new" datasets from your original sample to approximate the sampling distribution
Estimates SEs and CIs without formulas—particularly valuable when theoretical distributions are unknown or complex
Handles unusual estimators—works for medians, ratios, or any statistic where closed-form standard errors don't exist
Compare: Frequentist vs. Bayesian—frequentists treat parameters as fixed and data as random; Bayesians treat parameters as random variables with distributions. Frequentist CIs say "95% of intervals constructed this way contain the true value"; Bayesian credible intervals say "there's a 95% probability the parameter lies here." Know which framework a problem assumes.
Quick Reference Table
Concept
Best Examples
Point vs. Interval Estimation
Point Estimation, Confidence Intervals, MLE
Testing Framework
Hypothesis Testing, Power Analysis, Effect Size
Comparing Two Means
t-Tests, z-Tests
Comparing 3+ Means
ANOVA, F-Tests
Categorical Data
Chi-Square Tests
Modeling Relationships
Regression Analysis
Distribution-Free Methods
Non-parametric Tests, Bootstrapping
Incorporating Prior Knowledge
Bayesian Inference
Self-Check Questions
What's the key difference between a confidence interval and a Bayesian credible interval, and when would you prefer one over the other?
You're comparing customer satisfaction scores across four different product versions. Which test should you use, and why would running six separate t-tests be problematic?
Compare and contrast: How do maximum likelihood estimation and Bayesian inference differ in their treatment of parameters? What additional input does Bayesian inference require?
Your data are heavily skewed and your sample size is only 15. Which methods from this guide would still be valid, and which would you avoid?
A study reports p=0.03 but Cohen's d=0.1. Explain what this combination tells you about the results and what you'd want to know about the study design.