upgrade
upgrade

🎲Data, Inference, and Decisions

Hypothesis Testing Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Hypothesis testing is the backbone of statistical inference—it's how you move from "I think there's a pattern here" to "I can confidently claim this effect is real." Every time you see a study claiming a new drug works, a policy made a difference, or two groups behave differently, hypothesis testing is doing the heavy lifting behind the scenes. You're being tested on your ability to choose the right test for the right situation, interpret results correctly, and understand what "statistically significant" actually means.

The methods in this guide aren't just formulas to memorize—they represent different tools for different jobs. Some compare means, others compare variances or distributions. Some require your data to be normally distributed; others don't care. The key concepts you need to master include parametric vs. non-parametric approaches, comparing means vs. variances vs. distributions, and when assumptions matter. Don't just memorize which test does what—know why you'd reach for one tool instead of another.


Comparing Means: The Workhorses of Hypothesis Testing

Most hypothesis tests you'll encounter ask a simple question: are these means different? The tests below handle this question under different conditions—known vs. unknown variance, one group vs. two, independent vs. paired observations.

Z-Test

  • Use when population variance is known—this is rare in practice but common on exams as a foundational concept
  • Requires normality or large samples (n>30n > 30) thanks to the Central Limit Theorem
  • Test statistic formula: Z=xˉμσ/nZ = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}, where σ\sigma is the known population standard deviation

T-Test (One-Sample, Two-Sample, Paired)

  • One-sample T-test compares a sample mean to a hypothesized population mean when variance is unknown—uses ss instead of σ\sigma
  • Two-sample T-test compares means from two independent groups; choose pooled or Welch's version based on whether variances are equal
  • Paired T-test handles dependent observations (before/after, matched subjects)—analyzes the differences within pairs, not raw scores

Compare: Z-test vs. T-test—both compare means to a reference value, but Z-tests require known population variance while T-tests estimate it from sample data. On exams, if they give you σ\sigma, think Z-test; if they give you ss, think T-test.


Comparing Multiple Groups: Beyond Two Means

When you have three or more groups, running multiple T-tests inflates your Type I error rate. ANOVA solves this by testing all groups simultaneously using variance decomposition—comparing variation between groups to variation within groups.

ANOVA (One-Way, Two-Way)

  • One-way ANOVA tests whether means differ across three or more groups based on a single factor—the null hypothesis is H0:μ1=μ2=...=μkH_0: \mu_1 = \mu_2 = ... = \mu_k
  • Two-way ANOVA examines two factors simultaneously and can detect interaction effects—when the impact of one factor depends on the level of another
  • F-statistic is the ratio of between-group variance to within-group variance; larger values suggest group means truly differ

F-Test

  • Compares two variances to test if they're significantly different—the ratio F=s12s22F = \frac{s_1^2}{s_2^2} follows an F-distribution
  • Underlies ANOVA calculations—every ANOVA table includes an F-statistic for testing mean equality
  • Assumes normality and independence—sensitive to violations, especially with small samples

Compare: T-test vs. ANOVA—T-tests handle two groups; ANOVA handles three or more. If an FRQ gives you multiple treatment conditions, ANOVA is your go-to. Remember: ANOVA tells you something differs but not what—you need post-hoc tests for that.


Testing Relationships: Regression and Model Comparison

These methods go beyond "are groups different?" to ask "how are variables related?" and "which model explains the data better?" Regression quantifies relationships; likelihood ratio tests compare competing explanations.

Regression Analysis (Simple and Multiple)

  • Simple regression models the relationship between one predictor and one outcome: Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon
  • Multiple regression includes two or more predictors, allowing you to control for confounding variables and assess unique contributions
  • Hypothesis tests on coefficients (is β1=0\beta_1 = 0?) determine whether predictors have statistically significant effects

Likelihood Ratio Test

  • Compares nested models by examining the ratio of their maximum likelihoods: Λ=L(null)L(alternative)\Lambda = \frac{L(\text{null})}{L(\text{alternative})}
  • Test statistic 2ln(Λ)-2 \ln(\Lambda) follows a chi-square distribution with degrees of freedom equal to the difference in parameters
  • Essential for logistic regression and generalized linear models where traditional methods don't apply

Compare: Regression coefficients vs. Likelihood ratio tests—coefficient tests ask "does this one predictor matter?" while likelihood ratio tests ask "does this set of predictors improve the model?" Use likelihood ratios when comparing models with different numbers of parameters.


Categorical Data: When Means Don't Apply

Not all data is continuous. When you're working with counts, categories, or frequencies, you need tests designed for discrete distributions. The chi-square test is your primary tool here.

Chi-Square Test

  • Tests association between categorical variables by comparing observed frequencies to expected frequencies under independence
  • Test statistic: χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}, where OO = observed and EE = expected counts
  • Requires adequate expected frequencies—rule of thumb is E5E \geq 5 in each cell; otherwise, consider Fisher's exact test

Compare: Chi-square vs. T-test—Chi-square handles categorical data (counts in categories); T-tests handle continuous data (measured values). If your data is "how many people chose option A vs. B," think chi-square. If it's "what was the average score," think T-test.


Non-Parametric Alternatives: When Assumptions Fail

Parametric tests (Z, T, ANOVA) assume your data is normally distributed. When that assumption is violated—small samples, skewed distributions, ordinal data—non-parametric tests save the day by working with ranks instead of raw values.

Wilcoxon Rank-Sum Test

  • Non-parametric alternative to the two-sample T-test—compares distributions of two independent groups without assuming normality
  • Works with ranks rather than actual values, making it robust to outliers and skewed data
  • Also called Mann-Whitney U test—same procedure, different name; know both for exams

Kolmogorov-Smirnov Test

  • Compares entire distributions, not just central tendency—tests whether two samples come from the same population
  • Examines cumulative distribution functions—sensitive to differences in location, spread, and shape
  • One-sample version tests whether data follows a specific theoretical distribution (e.g., is this sample normally distributed?)

Compare: Wilcoxon vs. Kolmogorov-Smirnov—both are non-parametric, but Wilcoxon focuses on whether one group tends to have larger values (like a median comparison), while K-S tests whether the entire distribution shapes match. Wilcoxon is your T-test replacement; K-S is for distribution comparison.


Resampling Methods: Let the Data Speak

When you can't rely on theoretical distributions or your sample is unusual, bootstrap methods let you empirically estimate sampling distributions by repeatedly resampling your own data.

Bootstrap Methods

  • Resamples with replacement from your original data thousands of times to build an empirical sampling distribution
  • Estimates confidence intervals and standard errors without requiring normality or known formulas—particularly useful for complex statistics like medians or ratios
  • Assumption-light approach—doesn't require parametric assumptions, but does assume your sample reasonably represents the population

Compare: Bootstrap vs. Traditional tests—traditional tests use theoretical distributions (Z, T, F, chi-square); bootstrap builds the distribution empirically from your data. When exam questions mention "violated assumptions" or "unusual statistics," bootstrap is often the answer.


Quick Reference Table

ConceptBest Examples
Comparing one mean to a known valueZ-test (variance known), One-sample T-test (variance unknown)
Comparing two independent meansTwo-sample T-test, Wilcoxon rank-sum
Comparing paired/dependent observationsPaired T-test
Comparing three or more meansOne-way ANOVA, Two-way ANOVA
Comparing variancesF-test
Testing categorical associationsChi-square test
Modeling relationships between variablesSimple regression, Multiple regression
Comparing model fitLikelihood ratio test
Distribution comparison (non-parametric)Kolmogorov-Smirnov test
Assumption-free inferenceBootstrap methods

Self-Check Questions

  1. You have two independent groups and non-normal data with several outliers. Which two tests could you use, and why might you prefer the non-parametric option?

  2. A researcher wants to test whether a new teaching method improves scores by measuring the same students before and after the intervention. Which test is appropriate, and why would a two-sample T-test be incorrect here?

  3. Compare and contrast one-way ANOVA and the two-sample T-test. Under what conditions does ANOVA become necessary, and what additional information does two-way ANOVA provide?

  4. An FRQ presents count data showing how many customers preferred each of four product designs across three age groups. Which test would you use, and what assumption must you verify before proceeding?

  5. Your data violates normality assumptions, and you need to construct a 95% confidence interval for the median. Which method allows you to do this without relying on theoretical distributions, and briefly explain how it works?