Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Hypothesis testing is how you move from "I think there's a pattern here" to "I can confidently claim this effect is real." Every time a study claims a new drug works, a policy made a difference, or two groups behave differently, hypothesis testing is doing the heavy lifting behind the scenes. You're being tested on your ability to choose the right test for the right situation, interpret results correctly, and understand what "statistically significant" actually means.
The methods in this guide aren't just formulas to memorize. They represent different tools for different jobs. Some compare means, others compare variances or distributions. Some require your data to be normally distributed; others don't care. The key concepts to master include parametric vs. non-parametric approaches, comparing means vs. variances vs. distributions, and when assumptions matter. Don't just memorize which test does what. Know why you'd reach for one tool instead of another.
Most hypothesis tests you'll encounter ask a simple question: are these means different? The tests below handle this question under different conditions: known vs. unknown variance, one group vs. two, independent vs. paired observations.
The numerator measures how far your sample mean is from the hypothesized population mean. The denominator is the standard error, which captures how much sampling variability you'd expect. A large value means your sample mean is far from relative to what random chance would produce.
Compare: Z-test vs. T-test: both compare means to a reference value, but Z-tests require known population variance while T-tests estimate it from sample data. On exams, if they give you , think Z-test; if they give you , think T-test. As grows large, the T-distribution converges to the Z-distribution, so the distinction matters most with small samples.
When you have three or more groups, running multiple T-tests inflates your Type I error rate (the probability of falsely rejecting a true null hypothesis). With three groups, you'd need three pairwise T-tests, and the chance of at least one false positive climbs well above your chosen . ANOVA solves this by testing all groups simultaneously using variance decomposition.
Compare: T-test vs. ANOVA: T-tests handle two groups; ANOVA handles three or more. If a problem gives you multiple treatment conditions, ANOVA is your go-to. Remember: a significant ANOVA result tells you something differs but not what. You need post-hoc tests (like Tukey's HSD) to identify which specific group means are different from each other.
These methods go beyond "are groups different?" to ask "how are variables related?" and "which model explains the data better?" Regression quantifies relationships; likelihood ratio tests compare competing model specifications.
Compare: Regression coefficient tests vs. Likelihood ratio tests: coefficient tests ask "does this one predictor matter?" while likelihood ratio tests ask "does this set of predictors improve the model overall?" Use likelihood ratio tests when comparing models with different numbers of parameters, especially outside the OLS framework.
Not all data is continuous. When you're working with counts, categories, or frequencies, you need tests designed for discrete distributions. The chi-square test is your primary tool here.
The chi-square test compares what you actually observe in your data to what you'd expect if there were no association between the variables.
Compare: Chi-square vs. T-test: Chi-square handles categorical data (counts in categories); T-tests handle continuous data (measured values). If your data is "how many people chose option A vs. B," think chi-square. If it's "what was the average score," think T-test.
Parametric tests (Z, T, ANOVA) assume your data is normally distributed. When that assumption is violated due to small samples, skewed distributions, or ordinal data, non-parametric tests provide valid alternatives by working with ranks instead of raw values.
Compare: Wilcoxon vs. Kolmogorov-Smirnov: both are non-parametric, but Wilcoxon focuses on whether one group tends to have larger values (similar to a median comparison), while K-S tests whether the entire distribution shapes match. Wilcoxon is your T-test replacement; K-S is for full distribution comparison.
When you can't rely on theoretical distributions or your sample is unusual, bootstrap methods let you empirically estimate sampling distributions by repeatedly resampling your own data.
Here's how bootstrapping works:
You can use this bootstrap distribution to estimate confidence intervals and standard errors without requiring normality or known formulas. This is particularly useful for complex statistics like medians, ratios, or correlation coefficients where theoretical distributions are hard to derive.
The bootstrap is assumption-light but not assumption-free. It does assume your original sample reasonably represents the population. If your sample is biased or too small to capture the population's structure, bootstrapping won't fix that.
Compare: Bootstrap vs. Traditional tests: traditional tests use theoretical distributions (Z, T, F, chi-square); bootstrap builds the distribution empirically from your data. When a problem mentions "violated assumptions" or asks about inference for an unusual statistic, bootstrap is often the right approach.
| Situation | Best Test(s) |
|---|---|
| Comparing one mean to a known value | Z-test (variance known), One-sample T-test (variance unknown) |
| Comparing two independent means | Two-sample T-test, Wilcoxon rank-sum |
| Comparing paired/dependent observations | Paired T-test |
| Comparing three or more means | One-way ANOVA, Two-way ANOVA |
| Comparing variances | F-test |
| Testing categorical associations | Chi-square test |
| Modeling relationships between variables | Simple regression, Multiple regression |
| Comparing nested model fit | Likelihood ratio test |
| Distribution comparison (non-parametric) | Kolmogorov-Smirnov test |
| Assumption-free inference | Bootstrap methods |
You have two independent groups and non-normal data with several outliers. Which two tests could you use, and why might you prefer the non-parametric option?
A researcher wants to test whether a new teaching method improves scores by measuring the same students before and after the intervention. Which test is appropriate, and why would a two-sample T-test be incorrect here?
Compare one-way ANOVA and the two-sample T-test. Under what conditions does ANOVA become necessary, and what additional information does two-way ANOVA provide?
A problem presents count data showing how many customers preferred each of four product designs across three age groups. Which test would you use, and what assumption must you verify before proceeding?
Your data violates normality assumptions, and you need to construct a 95% confidence interval for the median. Which method allows you to do this without relying on theoretical distributions, and briefly describe how it works.