🎲Data, Inference, and Decisions

Hypothesis Testing Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Hypothesis testing is the backbone of statistical inference—it's how you move from "I think there's a pattern here" to "I can confidently claim this effect is real." Every time you see a study claiming a new drug works, a policy made a difference, or two groups behave differently, hypothesis testing is doing the heavy lifting behind the scenes. You're being tested on your ability to choose the right test for the right situation, interpret results correctly, and understand what "statistically significant" actually means.

The methods in this guide aren't just formulas to memorize—they represent different tools for different jobs. Some compare means, others compare variances or distributions. Some require your data to be normally distributed; others don't care. The key concepts you need to master include parametric vs. non-parametric approaches, comparing means vs. variances vs. distributions, and when assumptions matter. Don't just memorize which test does what—know why you'd reach for one tool instead of another.

Comparing Means: The Workhorses of Hypothesis Testing

Most hypothesis tests you'll encounter ask a simple question: are these means different? The tests below handle this question under different conditions—known vs. unknown variance, one group vs. two, independent vs. paired observations.

Z-Test

Use when population variance is known—this is rare in practice but common on exams as a foundational concept
Requires normality or large samples ( $n > 30$ ) thanks to the Central Limit Theorem
Test statistic formula: $Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}$ , where $\sigma$ is the known population standard deviation

T-Test (One-Sample, Two-Sample, Paired)

One-sample T-test compares a sample mean to a hypothesized population mean when variance is unknown—uses $s$ instead of $\sigma$
Two-sample T-test compares means from two independent groups; choose pooled or Welch's version based on whether variances are equal
Paired T-test handles dependent observations (before/after, matched subjects)—analyzes the differences within pairs, not raw scores

Compare: Z-test vs. T-test—both compare means to a reference value, but Z-tests require known population variance while T-tests estimate it from sample data. On exams, if they give you $\sigma$ , think Z-test; if they give you $s$ , think T-test.

Comparing Multiple Groups: Beyond Two Means

When you have three or more groups, running multiple T-tests inflates your Type I error rate. ANOVA solves this by testing all groups simultaneously using variance decomposition—comparing variation between groups to variation within groups.

ANOVA (One-Way, Two-Way)

One-way ANOVA tests whether means differ across three or more groups based on a single factor—the null hypothesis is $H_0: \mu_1 = \mu_2 = ... = \mu_k$
Two-way ANOVA examines two factors simultaneously and can detect interaction effects—when the impact of one factor depends on the level of another
F-statistic is the ratio of between-group variance to within-group variance; larger values suggest group means truly differ

F-Test

Compares two variances to test if they're significantly different—the ratio $F = \frac{s_1^2}{s_2^2}$ follows an F-distribution
Underlies ANOVA calculations—every ANOVA table includes an F-statistic for testing mean equality
Assumes normality and independence—sensitive to violations, especially with small samples

Compare: T-test vs. ANOVA—T-tests handle two groups; ANOVA handles three or more. If an FRQ gives you multiple treatment conditions, ANOVA is your go-to. Remember: ANOVA tells you something differs but not what—you need post-hoc tests for that.

Testing Relationships: Regression and Model Comparison

These methods go beyond "are groups different?" to ask "how are variables related?" and "which model explains the data better?" Regression quantifies relationships; likelihood ratio tests compare competing explanations.

Regression Analysis (Simple and Multiple)

Simple regression models the relationship between one predictor and one outcome: $Y = \beta_0 + \beta_1 X + \epsilon$
Multiple regression includes two or more predictors, allowing you to control for confounding variables and assess unique contributions
Hypothesis tests on coefficients (is $\beta_1 = 0$ ?) determine whether predictors have statistically significant effects

Likelihood Ratio Test

Compares nested models by examining the ratio of their maximum likelihoods: $\Lambda = \frac{L(\text{null})}{L(\text{alternative})}$
Test statistic $-2 \ln(\Lambda)$ follows a chi-square distribution with degrees of freedom equal to the difference in parameters
Essential for logistic regression and generalized linear models where traditional methods don't apply

Compare: Regression coefficients vs. Likelihood ratio tests—coefficient tests ask "does this one predictor matter?" while likelihood ratio tests ask "does this set of predictors improve the model?" Use likelihood ratios when comparing models with different numbers of parameters.

Categorical Data: When Means Don't Apply

Not all data is continuous. When you're working with counts, categories, or frequencies, you need tests designed for discrete distributions. The chi-square test is your primary tool here.

Chi-Square Test

Tests association between categorical variables by comparing observed frequencies to expected frequencies under independence
Test statistic: $\chi^2 = \sum \frac{(O - E)^2}{E}$ , where $O$ = observed and $E$ = expected counts
Requires adequate expected frequencies—rule of thumb is $E \geq 5$ in each cell; otherwise, consider Fisher's exact test

Compare: Chi-square vs. T-test—Chi-square handles categorical data (counts in categories); T-tests handle continuous data (measured values). If your data is "how many people chose option A vs. B," think chi-square. If it's "what was the average score," think T-test.

Non-Parametric Alternatives: When Assumptions Fail

Parametric tests (Z, T, ANOVA) assume your data is normally distributed. When that assumption is violated—small samples, skewed distributions, ordinal data—non-parametric tests save the day by working with ranks instead of raw values.

Wilcoxon Rank-Sum Test

Non-parametric alternative to the two-sample T-test—compares distributions of two independent groups without assuming normality
Works with ranks rather than actual values, making it robust to outliers and skewed data
Also called Mann-Whitney U test—same procedure, different name; know both for exams

Kolmogorov-Smirnov Test

Compares entire distributions, not just central tendency—tests whether two samples come from the same population
Examines cumulative distribution functions—sensitive to differences in location, spread, and shape
One-sample version tests whether data follows a specific theoretical distribution (e.g., is this sample normally distributed?)

Compare: Wilcoxon vs. Kolmogorov-Smirnov—both are non-parametric, but Wilcoxon focuses on whether one group tends to have larger values (like a median comparison), while K-S tests whether the entire distribution shapes match. Wilcoxon is your T-test replacement; K-S is for distribution comparison.

Resampling Methods: Let the Data Speak

When you can't rely on theoretical distributions or your sample is unusual, bootstrap methods let you empirically estimate sampling distributions by repeatedly resampling your own data.

Bootstrap Methods

Resamples with replacement from your original data thousands of times to build an empirical sampling distribution
Estimates confidence intervals and standard errors without requiring normality or known formulas—particularly useful for complex statistics like medians or ratios
Assumption-light approach—doesn't require parametric assumptions, but does assume your sample reasonably represents the population

Compare: Bootstrap vs. Traditional tests—traditional tests use theoretical distributions (Z, T, F, chi-square); bootstrap builds the distribution empirically from your data. When exam questions mention "violated assumptions" or "unusual statistics," bootstrap is often the answer.

Quick Reference Table

Concept	Best Examples
Comparing one mean to a known value	Z-test (variance known), One-sample T-test (variance unknown)
Comparing two independent means	Two-sample T-test, Wilcoxon rank-sum
Comparing paired/dependent observations	Paired T-test
Comparing three or more means	One-way ANOVA, Two-way ANOVA
Comparing variances	F-test
Testing categorical associations	Chi-square test
Modeling relationships between variables	Simple regression, Multiple regression
Comparing model fit	Likelihood ratio test
Distribution comparison (non-parametric)	Kolmogorov-Smirnov test
Assumption-free inference	Bootstrap methods

Self-Check Questions

You have two independent groups and non-normal data with several outliers. Which two tests could you use, and why might you prefer the non-parametric option?
A researcher wants to test whether a new teaching method improves scores by measuring the same students before and after the intervention. Which test is appropriate, and why would a two-sample T-test be incorrect here?
Compare and contrast one-way ANOVA and the two-sample T-test. Under what conditions does ANOVA become necessary, and what additional information does two-way ANOVA provide?
An FRQ presents count data showing how many customers preferred each of four product designs across three age groups. Which test would you use, and what assumption must you verify before proceeding?
Your data violates normality assumptions, and you need to construct a 95% confidence interval for the median. Which method allows you to do this without relying on theoretical distributions, and briefly explain how it works?

🎲Data, Inference, and Decisions

Hypothesis Testing Methods

Why This Matters

Comparing Means: The Workhorses of Hypothesis Testing

Z-Test

T-Test (One-Sample, Two-Sample, Paired)

Comparing Multiple Groups: Beyond Two Means

ANOVA (One-Way, Two-Way)

F-Test

Testing Relationships: Regression and Model Comparison

Regression Analysis (Simple and Multiple)

Likelihood Ratio Test

Categorical Data: When Means Don't Apply

Chi-Square Test

Non-Parametric Alternatives: When Assumptions Fail

Wilcoxon Rank-Sum Test

Kolmogorov-Smirnov Test

Resampling Methods: Let the Data Speak

Bootstrap Methods

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes