๐ŸŽฒData, Inference, and Decisions

Hypothesis Testing Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Hypothesis testing is how you move from "I think there's a pattern here" to "I can confidently claim this effect is real." Every time a study claims a new drug works, a policy made a difference, or two groups behave differently, hypothesis testing is doing the heavy lifting behind the scenes. You're being tested on your ability to choose the right test for the right situation, interpret results correctly, and understand what "statistically significant" actually means.

The methods in this guide aren't just formulas to memorize. They represent different tools for different jobs. Some compare means, others compare variances or distributions. Some require your data to be normally distributed; others don't care. The key concepts to master include parametric vs. non-parametric approaches, comparing means vs. variances vs. distributions, and when assumptions matter. Don't just memorize which test does what. Know why you'd reach for one tool instead of another.


Comparing Means: The Workhorses of Hypothesis Testing

Most hypothesis tests you'll encounter ask a simple question: are these means different? The tests below handle this question under different conditions: known vs. unknown variance, one group vs. two, independent vs. paired observations.

Z-Test

  • Use when population variance is known. This is rare in practice but common on exams as a foundational concept.
  • Requires normality or large samples (n>30n > 30) thanks to the Central Limit Theorem.
  • Test statistic: Z=xห‰โˆ’ฮผฯƒ/nZ = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}, where ฯƒ\sigma is the known population standard deviation.

The numerator measures how far your sample mean is from the hypothesized population mean. The denominator is the standard error, which captures how much sampling variability you'd expect. A large ZZ value means your sample mean is far from ฮผ\mu relative to what random chance would produce.

T-Test (One-Sample, Two-Sample, Paired)

  • One-sample T-test compares a sample mean to a hypothesized population mean when variance is unknown. It uses the sample standard deviation ss instead of ฯƒ\sigma, which adds uncertainty and produces heavier tails in the distribution.
  • Two-sample T-test compares means from two independent groups. You choose between the pooled version (assumes equal variances in both groups) and Welch's version (allows unequal variances). When in doubt, Welch's is the safer choice.
  • Paired T-test handles dependent observations (before/after measurements, matched subjects). It analyzes the differences within pairs, not the raw scores, effectively reducing the problem to a one-sample T-test on those differences.

Compare: Z-test vs. T-test: both compare means to a reference value, but Z-tests require known population variance while T-tests estimate it from sample data. On exams, if they give you ฯƒ\sigma, think Z-test; if they give you ss, think T-test. As nn grows large, the T-distribution converges to the Z-distribution, so the distinction matters most with small samples.


Comparing Multiple Groups: Beyond Two Means

When you have three or more groups, running multiple T-tests inflates your Type I error rate (the probability of falsely rejecting a true null hypothesis). With three groups, you'd need three pairwise T-tests, and the chance of at least one false positive climbs well above your chosen ฮฑ\alpha. ANOVA solves this by testing all groups simultaneously using variance decomposition.

ANOVA (One-Way, Two-Way)

  • One-way ANOVA tests whether means differ across three or more groups based on a single factor. The null hypothesis is H0:ฮผ1=ฮผ2=โ‹ฏ=ฮผkH_0: \mu_1 = \mu_2 = \cdots = \mu_k.
  • Two-way ANOVA examines two factors simultaneously and can detect interaction effects, which occur when the impact of one factor depends on the level of another. For example, a drug might work differently for men and women.
  • The F-statistic is the ratio of between-group variance to within-group variance: F=MSbetweenMSwithinF = \frac{MS_{between}}{MS_{within}}. Larger values suggest the group means truly differ, because variation between groups is large relative to the noise within groups.

F-Test

  • Compares two variances to test if they're significantly different: F=s12s22F = \frac{s_1^2}{s_2^2}, which follows an F-distribution under the null.
  • Underlies ANOVA calculations. Every ANOVA table includes an F-statistic for testing mean equality.
  • Assumes normality and independence. The F-test is sensitive to violations of normality, especially with small samples, so check your assumptions before relying on it.

Compare: T-test vs. ANOVA: T-tests handle two groups; ANOVA handles three or more. If a problem gives you multiple treatment conditions, ANOVA is your go-to. Remember: a significant ANOVA result tells you something differs but not what. You need post-hoc tests (like Tukey's HSD) to identify which specific group means are different from each other.


Testing Relationships: Regression and Model Comparison

These methods go beyond "are groups different?" to ask "how are variables related?" and "which model explains the data better?" Regression quantifies relationships; likelihood ratio tests compare competing model specifications.

Regression Analysis (Simple and Multiple)

  • Simple regression models the relationship between one predictor and one outcome: Y=ฮฒ0+ฮฒ1X+ฯตY = \beta_0 + \beta_1 X + \epsilon. Here ฮฒ0\beta_0 is the intercept, ฮฒ1\beta_1 is the slope (the change in YY for a one-unit change in XX), and ฯต\epsilon is the error term.
  • Multiple regression includes two or more predictors, allowing you to control for confounding variables and assess each predictor's unique contribution while holding the others constant.
  • Hypothesis tests on coefficients (testing H0:ฮฒj=0H_0: \beta_j = 0) determine whether a predictor has a statistically significant linear relationship with the outcome. These typically use T-tests on individual coefficients and F-tests for joint significance of multiple coefficients.

Likelihood Ratio Test

  • Compares nested models by examining the ratio of their maximum likelihoods: ฮ›=L(restricted)L(unrestricted)\Lambda = \frac{L(\text{restricted})}{L(\text{unrestricted})}. The restricted (null) model is a special case of the unrestricted (alternative) model with some parameters set to zero or constrained.
  • The test statistic โˆ’2lnโก(ฮ›)-2 \ln(\Lambda) follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the two models.
  • Essential for logistic regression and generalized linear models where you can't simply compare R2R^2 values or use standard F-tests.

Compare: Regression coefficient tests vs. Likelihood ratio tests: coefficient tests ask "does this one predictor matter?" while likelihood ratio tests ask "does this set of predictors improve the model overall?" Use likelihood ratio tests when comparing models with different numbers of parameters, especially outside the OLS framework.


Categorical Data: When Means Don't Apply

Not all data is continuous. When you're working with counts, categories, or frequencies, you need tests designed for discrete distributions. The chi-square test is your primary tool here.

Chi-Square Test

The chi-square test compares what you actually observe in your data to what you'd expect if there were no association between the variables.

  • Test of independence: tests whether two categorical variables are associated by comparing observed cell frequencies to expected frequencies under independence.
  • Goodness-of-fit: tests whether observed frequencies match a hypothesized distribution (e.g., are dice rolls uniformly distributed?).
  • Test statistic: ฯ‡2=โˆ‘(Oโˆ’E)2E\chi^2 = \sum \frac{(O - E)^2}{E}, where OO = observed counts and EE = expected counts under the null hypothesis.
  • Requires adequate expected frequencies. The rule of thumb is Eโ‰ฅ5E \geq 5 in each cell. If this isn't met, consider Fisher's exact test instead.

Compare: Chi-square vs. T-test: Chi-square handles categorical data (counts in categories); T-tests handle continuous data (measured values). If your data is "how many people chose option A vs. B," think chi-square. If it's "what was the average score," think T-test.


Non-Parametric Alternatives: When Assumptions Fail

Parametric tests (Z, T, ANOVA) assume your data is normally distributed. When that assumption is violated due to small samples, skewed distributions, or ordinal data, non-parametric tests provide valid alternatives by working with ranks instead of raw values.

Wilcoxon Rank-Sum Test

  • Non-parametric alternative to the two-sample T-test. Compares distributions of two independent groups without assuming normality.
  • Works with ranks rather than actual values, making it robust to outliers and skewed data. You pool all observations, rank them from smallest to largest, then compare the sum of ranks between groups.
  • Also called the Mann-Whitney U test. Same procedure, different name. Know both for exams.

Kolmogorov-Smirnov Test

  • Compares entire distributions, not just central tendency. Tests whether two samples come from the same underlying population.
  • Examines cumulative distribution functions (CDFs). The test statistic is the maximum vertical distance between the two empirical CDFs, making it sensitive to differences in location, spread, and shape.
  • The one-sample version tests whether data follows a specific theoretical distribution (e.g., "is this sample normally distributed?").

Compare: Wilcoxon vs. Kolmogorov-Smirnov: both are non-parametric, but Wilcoxon focuses on whether one group tends to have larger values (similar to a median comparison), while K-S tests whether the entire distribution shapes match. Wilcoxon is your T-test replacement; K-S is for full distribution comparison.


Resampling Methods: Let the Data Speak

When you can't rely on theoretical distributions or your sample is unusual, bootstrap methods let you empirically estimate sampling distributions by repeatedly resampling your own data.

Bootstrap Methods

Here's how bootstrapping works:

  1. Take your original sample of nn observations.
  2. Draw a new sample of size nn with replacement from the original data. Some observations will appear multiple times; others won't appear at all.
  3. Compute the statistic of interest (mean, median, ratio, etc.) on this resampled data.
  4. Repeat steps 2-3 thousands of times (typically 1,000 to 10,000).
  5. The distribution of those computed statistics is your bootstrap distribution, which approximates the true sampling distribution.

You can use this bootstrap distribution to estimate confidence intervals and standard errors without requiring normality or known formulas. This is particularly useful for complex statistics like medians, ratios, or correlation coefficients where theoretical distributions are hard to derive.

The bootstrap is assumption-light but not assumption-free. It does assume your original sample reasonably represents the population. If your sample is biased or too small to capture the population's structure, bootstrapping won't fix that.

Compare: Bootstrap vs. Traditional tests: traditional tests use theoretical distributions (Z, T, F, chi-square); bootstrap builds the distribution empirically from your data. When a problem mentions "violated assumptions" or asks about inference for an unusual statistic, bootstrap is often the right approach.


Quick Reference Table

SituationBest Test(s)
Comparing one mean to a known valueZ-test (variance known), One-sample T-test (variance unknown)
Comparing two independent meansTwo-sample T-test, Wilcoxon rank-sum
Comparing paired/dependent observationsPaired T-test
Comparing three or more meansOne-way ANOVA, Two-way ANOVA
Comparing variancesF-test
Testing categorical associationsChi-square test
Modeling relationships between variablesSimple regression, Multiple regression
Comparing nested model fitLikelihood ratio test
Distribution comparison (non-parametric)Kolmogorov-Smirnov test
Assumption-free inferenceBootstrap methods

Self-Check Questions

  1. You have two independent groups and non-normal data with several outliers. Which two tests could you use, and why might you prefer the non-parametric option?

  2. A researcher wants to test whether a new teaching method improves scores by measuring the same students before and after the intervention. Which test is appropriate, and why would a two-sample T-test be incorrect here?

  3. Compare one-way ANOVA and the two-sample T-test. Under what conditions does ANOVA become necessary, and what additional information does two-way ANOVA provide?

  4. A problem presents count data showing how many customers preferred each of four product designs across three age groups. Which test would you use, and what assumption must you verify before proceeding?

  5. Your data violates normality assumptions, and you need to construct a 95% confidence interval for the median. Which method allows you to do this without relying on theoretical distributions, and briefly describe how it works.