Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Inferential statistics is the bridge between your sample data and the broader population you actually care about. Every time you see a poll's "margin of error," read about a drug trial's results, or encounter a claim that one algorithm outperforms another, you're seeing inferential statistics at work. In data science and statistics courses, you're constantly being tested on your ability to quantify uncertainty, compare groups rigorously, and draw defensible conclusions from limited data.
These concepts form an interconnected toolkit, not a list of isolated techniques. Hypothesis testing and confidence intervals are two sides of the same coin. Point estimation feeds into maximum likelihood, which connects to Bayesian inference. Understanding these relationships is what separates memorization from mastery. For each method, know when to use it, what assumptions it requires, and how it relates to the others.
Before you can test anything, you need to estimate population parameters from your sample. These methods answer a fundamental question: given what I observed, what can I infer about the true underlying value?
A point estimate is your single "best guess" for a population parameter. For example, the sample mean estimates the population mean , and the sample proportion estimates the population proportion .
A confidence interval (CI) provides a range of plausible values for a parameter, built around a point estimate. The general form is:
Maximum likelihood estimation (MLE) finds the parameter values that make your observed data most probable. You write a likelihood function and find the that maximizes it (often by maximizing the log-likelihood instead, since it's easier to work with).
Compare: Point Estimation vs. Confidence Intervals: both estimate population parameters, but point estimates give a single value while CIs quantify the uncertainty around that estimate. If you're asked to interpret a CI, remember it's about the procedure's reliability across repeated samples, not the probability that the parameter falls in one specific interval.
Hypothesis testing formalizes the process of using data to make decisions. The core logic: assume nothing interesting is happening (null hypothesis), then ask whether your data are surprising enough to reject that assumption.
The process follows a consistent structure:
The p-value is the probability of observing data this extreme or more extreme, assuming is true. A small p-value means your data would be unlikely under the null, which is evidence against it. A large p-value does not prove is true; it just means you lack sufficient evidence to reject it.
Two types of errors to track:
Statistical power () is the probability of correctly rejecting when a real effect exists. Power analysis is typically done before collecting data to determine how large your sample needs to be.
Statistical significance alone doesn't tell you whether an effect is meaningful. Effect size quantifies the magnitude of a difference or relationship.
A tiny effect can be statistically significant with a large enough . Always report and consider effect sizes alongside p-values.
Compare: p-values vs. Effect Sizes: p-values tell you whether an effect is likely real, while effect sizes tell you whether it's practically meaningful. A study with might find for a negligible difference. When interpreting results, address both.
When you need to determine whether groups differ, these tests provide the statistical machinery. The choice of test depends on how many groups you're comparing, whether data are paired, and what assumptions you can justify.
The t-test compares means when the population standard deviation is unknown (which is almost always the case in practice).
The z-test is used when the population standard deviation is known or when is large enough that the normal approximation applies (common for proportions).
ANOVA compares means across three or more groups simultaneously, avoiding the inflated Type I error that comes from running multiple t-tests.
The F-test compares two variances by taking their ratio. It's the engine behind ANOVA and also tests overall significance in regression.
Compare: t-Test vs. ANOVA: both compare means, but t-tests handle exactly two groups while ANOVA handles three or more. Running multiple t-tests (e.g., six pairwise comparisons for four groups) inflates your false positive rate well beyond . ANOVA controls this with a single omnibus test, then you follow up with post-hoc tests to pinpoint which groups differ.
Not all data are continuous. When you're working with counts, categories, or proportions, chi-square methods test for relationships and goodness of fit. The core idea: compare what you observed to what you'd expect if variables were independent.
The chi-square test compares observed frequencies to expected frequencies using the statistic:
Large values mean the observed data deviate substantially from what's expected.
Two main applications:
Assumptions: observations are independent, and expected cell counts should generally be at least 5. For small expected counts, consider Fisher's exact test instead.
Compare: Chi-Square vs. t-Test: chi-square tests work with categorical data and frequencies, while t-tests require a continuous outcome variable. If your dependent variable is "pass/fail" or "category A/B/C," use chi-square. If it's a measurement like height or test score, use a t-test.
Regression methods go beyond "is there a difference?" to ask "how are variables related, and can we predict outcomes?" These techniques model the functional form of relationships and quantify how predictors influence responses.
The general linear regression model is , where represents random error.
Compare: ANOVA vs. Regression: these are mathematically equivalent for comparing group means (ANOVA is just regression with dummy-coded categorical predictors). Regression extends naturally to continuous predictors and more complex models. Use ANOVA language for experimental designs with categorical factors, and regression language for predictive modeling with continuous or mixed predictors.
Classical frequentist methods aren't the only approach. These alternatives either relax distributional assumptions or incorporate prior knowledge, expanding your inferential toolkit.
Bayesian inference updates prior beliefs with observed data using Bayes' theorem:
Non-parametric tests don't assume a specific distribution (like normality) for the data. They typically work with ranks rather than raw values.
Bootstrapping estimates the sampling distribution of a statistic by resampling with replacement from your original data, typically thousands of times.
This approach is especially valuable when theoretical formulas for standard errors don't exist or when the sampling distribution is hard to derive analytically (e.g., for medians, correlation coefficients, or custom statistics).
Compare: Frequentist vs. Bayesian: frequentists treat parameters as fixed and data as random; Bayesians treat parameters as random variables with distributions. A frequentist 95% CI says "95% of intervals constructed this way would contain the true value." A Bayesian 95% credible interval says "there's a 95% probability the parameter lies in this range." Know which framework a problem assumes, because the interpretation changes.
| Concept | Best Examples |
|---|---|
| Point vs. Interval Estimation | Point Estimation, Confidence Intervals, MLE |
| Testing Framework | Hypothesis Testing, Power Analysis, Effect Size |
| Comparing Two Means | t-Tests, z-Tests |
| Comparing 3+ Means | ANOVA, F-Tests |
| Categorical Data | Chi-Square Tests |
| Modeling Relationships | Regression Analysis |
| Distribution-Free Methods | Non-parametric Tests, Bootstrapping |
| Incorporating Prior Knowledge | Bayesian Inference |
What's the key difference between a confidence interval and a Bayesian credible interval, and when would you prefer one over the other?
You're comparing customer satisfaction scores across four different product versions. Which test should you use, and why would running six separate t-tests be problematic?
How do maximum likelihood estimation and Bayesian inference differ in their treatment of parameters? What additional input does Bayesian inference require?
Your data are heavily skewed and your sample size is only 15. Which methods from this guide would still be valid, and which would you avoid?
A study reports but Cohen's . Explain what this combination tells you about the results and what you'd want to know about the study design.