Hypothesis testing is how we move from "I think there's a pattern here" to "I can demonstrate this effect is real." Every time you determine whether results are statistically significant, compare group means, or assess whether data fits an expected distribution, you're applying these tests.
Don't just memorize test names and formulas. Focus on when to use each test, what assumptions must hold, and how to interpret the results. Know the difference between parametric and non-parametric approaches, understand why sample size matters, and recognize the relationship between tests that serve similar purposes under different conditions.
Comparing Means: The Core Parametric Tests
These tests answer a fundamental question: Is the difference between group means large enough to be meaningful, or just random noise? They assume your data follows a normal distribution and use that structure to calculate precise probabilities.
Z-Test
Used when population standard deviation (ฯ) is known, which is rare in practice but common in textbook problems
Compares a sample mean to a population mean using the standard normal distribution
Requires either a large sample (nโฅ30, where the Central Limit Theorem kicks in) or a normally distributed population
Test statistic: z=ฯ/nโxหโฮผโ
T-Test (One-Sample, Two-Sample, Paired)
The T-test is the workhorse of hypothesis testing because in most real situations, you don't know the population standard deviation. You estimate it from your sample using s.
One-sample T-test: compares a sample mean to a known or hypothesized value
Two-sample T-test: compares means of two independent groups
Paired T-test: compares measurements from the same subjects at two different times or under two conditions. Because the observations are linked, you actually analyze the differences within each pair.
Degrees of freedom affect the shape of the t-distribution. Smaller samples produce wider, flatter curves with more area in the tails, meaning you need a larger test statistic to reach significance.
Compare: Z-test vs. T-test: both compare means to detect significant differences, but Z-tests require known population variance while T-tests estimate it from sample data. If you're given ฯ, think Z-test; if you're given s, think T-test. As sample size grows large, the t-distribution approaches the standard normal, so the two tests converge.
ANOVA (Analysis of Variance)
Running multiple T-tests to compare several groups inflates your Type I error rate (the chance of a false positive). ANOVA solves this by comparing all group means simultaneously in a single test.
One-way ANOVA uses one independent variable (e.g., comparing test scores across three teaching methods)
Two-way ANOVA examines two factors and their interaction effect (e.g., teaching method and class size)
Assumptions: normality within each group, independence of observations, and homogeneity of variances (roughly equal variances across groups)
A significant ANOVA result tells you at least one group differs, but not which one. You need post-hoc tests (like Tukey's HSD) to pinpoint specific differences.
Compare: T-test vs. ANOVA: T-tests handle two-group comparisons, while ANOVA extends this to three or more groups. If a problem asks you to compare multiple treatment conditions, ANOVA is your answer.
Comparing Variances and Relationships
Sometimes the question isn't about means. It's about spread or association. These tests examine whether variability differs between groups or whether variables move together in predictable ways.
F-Test
Compares variances of two populations to determine if they're significantly different
Test statistic: F=s22โs12โโ, the ratio of the two sample variances
Always produces positive values since it's a ratio of squared terms; the F-distribution is right-skewed and starts at zero
Often used as an assumption check before ANOVA to verify homogeneity of variances
Regression Analysis
Models the relationship between variables. Simple linear regression uses one predictor: y=a+bx. Multiple regression uses several: y=a+b1โx1โ+b2โx2โ+โฆ
The coefficient of determination (R2) tells you the proportion of variance in the dependent variable explained by the model. An R2 of 0.85 means the model accounts for 85% of the variability in y.
Residual analysis checks whether the model's assumptions hold. You want random scatter in residual plots. Patterns (curves, funnels) signal that the linear model isn't appropriate or that variance isn't constant.
Compare: F-test vs. ANOVA: both use the F-distribution, but F-tests compare two variances directly while ANOVA uses F-statistics to compare variation between groups to variation within groups. The ANOVA F-statistic is F=MSwithinโMSbetweenโโ, where MS stands for mean square.
Categorical Data Analysis
When your data consists of counts or categories rather than continuous measurements, you need tests designed for frequencies. These compare what you observed to what you'd expect under the null hypothesis.
Chi-Square Test
There are two main versions, and knowing which one applies matters:
Goodness-of-fit test: checks whether a single categorical variable follows a hypothesized distribution (e.g., do M&M color frequencies match the company's stated proportions?)
Test of independence: checks whether two categorical variables are associated in a contingency table (e.g., is there a relationship between gender and voting preference?)
Test statistic: ฯ2=โE(OโE)2โ, where O is the observed count and E is the expected count
Key condition: all expected cell counts should be at least 5. If they aren't, the chi-square approximation breaks down.
Compare: Chi-square test vs. T-test: Chi-square handles categorical data (counts in categories), while T-tests handle continuous data (measurements on a numerical scale). If your data involves frequencies or proportions in a table, Chi-square is your tool.
Non-Parametric Alternatives
When your data violates normality assumptions or uses ordinal scales (ranked data like survey ratings), non-parametric tests step in. They work with ranks rather than raw values, making fewer assumptions about the underlying distribution.
Wilcoxon Rank-Sum Test
Non-parametric alternative to the two-sample T-test for comparing two independent groups
Procedure: combine all observations from both groups, rank them from smallest to largest, then compare the sum of ranks for each group
Works with ordinal data or continuous data that isn't normally distributed
Also called the Mann-Whitney U test. Same underlying procedure, different name. Recognize both.
Kruskal-Wallis Test
Non-parametric alternative to one-way ANOVA for comparing three or more independent groups
Uses ranks instead of raw values to test whether samples come from the same distribution
Like ANOVA, a significant result only tells you something differs. Follow-up pairwise tests are needed to identify which specific groups differ.
Compare: Wilcoxon vs. Kruskal-Wallis: Wilcoxon handles two-group comparisons (parallels the T-test), while Kruskal-Wallis extends to three or more groups (parallels ANOVA). Both are rank-based and don't require normality.
Testing Assumptions: Normality Checks
Before applying parametric tests, you should verify that the normality assumption holds. These tests assess whether your data plausibly comes from a normal distribution. Note that for both tests, the null hypothesis is that the data is normal, so a small p-value means you reject normality.
Shapiro-Wilk Test
Preferred for small to moderate samples (roughly n<50) and generally considered the most powerful test for detecting departures from normality
A significant result (low p-value) means you have evidence the data is not normally distributed
Caveat with large samples: very large n can produce significant results even when the departure from normality is trivially small and practically irrelevant
Kolmogorov-Smirnov Test
More versatile: can compare a sample against any specified distribution, not just the normal distribution. Also works for two-sample comparisons (testing whether two samples come from the same distribution).
Measures the maximum distance between the empirical cumulative distribution function and the theoretical one
Less powerful than Shapiro-Wilk for normality testing specifically, so default to Shapiro-Wilk when normality is your only question
Compare: Shapiro-Wilk vs. Kolmogorov-Smirnov: Shapiro-Wilk is more powerful for normality testing, while K-S is more flexible for testing against other distributions. Use Shapiro-Wilk for normality checks unless you need to test a different reference distribution.
Quick Reference Table
Scenario
Best Test(s)
Comparing means (2 groups)
Z-test, T-test (two-sample), Paired T-test
Comparing means (3+ groups)
ANOVA (one-way, two-way)
Comparing variances
F-test
Modeling relationships
Regression analysis (simple, multiple)
Categorical data
Chi-square test (goodness-of-fit or independence)
Non-parametric (2 groups)
Wilcoxon rank-sum / Mann-Whitney U test
Non-parametric (3+ groups)
Kruskal-Wallis test
Testing normality
Shapiro-Wilk test, Kolmogorov-Smirnov test
Self-Check Questions
You have three treatment groups and want to compare their means, but a Shapiro-Wilk test shows significant non-normality. Which test should you use instead of ANOVA, and why?
Compare the Z-test and T-test: what assumption distinguishes when you'd use each, and how does sample size factor into this decision?
A researcher wants to determine if gender and voting preference are independent. Which test is appropriate, and what would the null hypothesis state?
Which two tests serve as non-parametric alternatives to the T-test and ANOVA, respectively? What do they have in common methodologically?
A before-and-after study measures the same subjects twice. Which specific type of T-test applies, and why would a two-sample T-test be inappropriate here?