๐Ÿ“ŠProbability and Statistics

Key Concepts in Statistical Hypothesis Tests

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Hypothesis testing is how we move from "I think there's a pattern here" to "I can demonstrate this effect is real." Every time you determine whether results are statistically significant, compare group means, or assess whether data fits an expected distribution, you're applying these tests.

Don't just memorize test names and formulas. Focus on when to use each test, what assumptions must hold, and how to interpret the results. Know the difference between parametric and non-parametric approaches, understand why sample size matters, and recognize the relationship between tests that serve similar purposes under different conditions.


Comparing Means: The Core Parametric Tests

These tests answer a fundamental question: Is the difference between group means large enough to be meaningful, or just random noise? They assume your data follows a normal distribution and use that structure to calculate precise probabilities.

Z-Test

  • Used when population standard deviation (ฯƒ\sigma) is known, which is rare in practice but common in textbook problems
  • Compares a sample mean to a population mean using the standard normal distribution
  • Requires either a large sample (nโ‰ฅ30n \geq 30, where the Central Limit Theorem kicks in) or a normally distributed population
  • Test statistic: z=xห‰โˆ’ฮผฯƒ/nz = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}

T-Test (One-Sample, Two-Sample, Paired)

The T-test is the workhorse of hypothesis testing because in most real situations, you don't know the population standard deviation. You estimate it from your sample using ss.

  • One-sample T-test: compares a sample mean to a known or hypothesized value
  • Two-sample T-test: compares means of two independent groups
  • Paired T-test: compares measurements from the same subjects at two different times or under two conditions. Because the observations are linked, you actually analyze the differences within each pair.
  • Degrees of freedom affect the shape of the t-distribution. Smaller samples produce wider, flatter curves with more area in the tails, meaning you need a larger test statistic to reach significance.

Compare: Z-test vs. T-test: both compare means to detect significant differences, but Z-tests require known population variance while T-tests estimate it from sample data. If you're given ฯƒ\sigma, think Z-test; if you're given ss, think T-test. As sample size grows large, the t-distribution approaches the standard normal, so the two tests converge.

ANOVA (Analysis of Variance)

Running multiple T-tests to compare several groups inflates your Type I error rate (the chance of a false positive). ANOVA solves this by comparing all group means simultaneously in a single test.

  • One-way ANOVA uses one independent variable (e.g., comparing test scores across three teaching methods)
  • Two-way ANOVA examines two factors and their interaction effect (e.g., teaching method and class size)
  • Assumptions: normality within each group, independence of observations, and homogeneity of variances (roughly equal variances across groups)
  • A significant ANOVA result tells you at least one group differs, but not which one. You need post-hoc tests (like Tukey's HSD) to pinpoint specific differences.

Compare: T-test vs. ANOVA: T-tests handle two-group comparisons, while ANOVA extends this to three or more groups. If a problem asks you to compare multiple treatment conditions, ANOVA is your answer.


Comparing Variances and Relationships

Sometimes the question isn't about means. It's about spread or association. These tests examine whether variability differs between groups or whether variables move together in predictable ways.

F-Test

  • Compares variances of two populations to determine if they're significantly different
  • Test statistic: F=s12s22F = \frac{s_1^2}{s_2^2}, the ratio of the two sample variances
  • Always produces positive values since it's a ratio of squared terms; the F-distribution is right-skewed and starts at zero
  • Often used as an assumption check before ANOVA to verify homogeneity of variances

Regression Analysis

  • Models the relationship between variables. Simple linear regression uses one predictor: y=a+bxy = a + bx. Multiple regression uses several: y=a+b1x1+b2x2+โ€ฆy = a + b_1x_1 + b_2x_2 + \ldots
  • The coefficient of determination (R2R^2) tells you the proportion of variance in the dependent variable explained by the model. An R2R^2 of 0.85 means the model accounts for 85% of the variability in yy.
  • Residual analysis checks whether the model's assumptions hold. You want random scatter in residual plots. Patterns (curves, funnels) signal that the linear model isn't appropriate or that variance isn't constant.

Compare: F-test vs. ANOVA: both use the F-distribution, but F-tests compare two variances directly while ANOVA uses F-statistics to compare variation between groups to variation within groups. The ANOVA F-statistic is F=MSbetweenMSwithinF = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}}, where MS stands for mean square.


Categorical Data Analysis

When your data consists of counts or categories rather than continuous measurements, you need tests designed for frequencies. These compare what you observed to what you'd expect under the null hypothesis.

Chi-Square Test

There are two main versions, and knowing which one applies matters:

  • Goodness-of-fit test: checks whether a single categorical variable follows a hypothesized distribution (e.g., do M&M color frequencies match the company's stated proportions?)
  • Test of independence: checks whether two categorical variables are associated in a contingency table (e.g., is there a relationship between gender and voting preference?)
  • Test statistic: ฯ‡2=โˆ‘(Oโˆ’E)2E\chi^2 = \sum \frac{(O - E)^2}{E}, where OO is the observed count and EE is the expected count
  • Key condition: all expected cell counts should be at least 5. If they aren't, the chi-square approximation breaks down.

Compare: Chi-square test vs. T-test: Chi-square handles categorical data (counts in categories), while T-tests handle continuous data (measurements on a numerical scale). If your data involves frequencies or proportions in a table, Chi-square is your tool.


Non-Parametric Alternatives

When your data violates normality assumptions or uses ordinal scales (ranked data like survey ratings), non-parametric tests step in. They work with ranks rather than raw values, making fewer assumptions about the underlying distribution.

Wilcoxon Rank-Sum Test

  • Non-parametric alternative to the two-sample T-test for comparing two independent groups
  • Procedure: combine all observations from both groups, rank them from smallest to largest, then compare the sum of ranks for each group
  • Works with ordinal data or continuous data that isn't normally distributed
  • Also called the Mann-Whitney U test. Same underlying procedure, different name. Recognize both.

Kruskal-Wallis Test

  • Non-parametric alternative to one-way ANOVA for comparing three or more independent groups
  • Uses ranks instead of raw values to test whether samples come from the same distribution
  • Like ANOVA, a significant result only tells you something differs. Follow-up pairwise tests are needed to identify which specific groups differ.

Compare: Wilcoxon vs. Kruskal-Wallis: Wilcoxon handles two-group comparisons (parallels the T-test), while Kruskal-Wallis extends to three or more groups (parallels ANOVA). Both are rank-based and don't require normality.


Testing Assumptions: Normality Checks

Before applying parametric tests, you should verify that the normality assumption holds. These tests assess whether your data plausibly comes from a normal distribution. Note that for both tests, the null hypothesis is that the data is normal, so a small p-value means you reject normality.

Shapiro-Wilk Test

  • Preferred for small to moderate samples (roughly n<50n < 50) and generally considered the most powerful test for detecting departures from normality
  • A significant result (low p-value) means you have evidence the data is not normally distributed
  • Caveat with large samples: very large nn can produce significant results even when the departure from normality is trivially small and practically irrelevant

Kolmogorov-Smirnov Test

  • More versatile: can compare a sample against any specified distribution, not just the normal distribution. Also works for two-sample comparisons (testing whether two samples come from the same distribution).
  • Measures the maximum distance between the empirical cumulative distribution function and the theoretical one
  • Less powerful than Shapiro-Wilk for normality testing specifically, so default to Shapiro-Wilk when normality is your only question

Compare: Shapiro-Wilk vs. Kolmogorov-Smirnov: Shapiro-Wilk is more powerful for normality testing, while K-S is more flexible for testing against other distributions. Use Shapiro-Wilk for normality checks unless you need to test a different reference distribution.


Quick Reference Table

ScenarioBest Test(s)
Comparing means (2 groups)Z-test, T-test (two-sample), Paired T-test
Comparing means (3+ groups)ANOVA (one-way, two-way)
Comparing variancesF-test
Modeling relationshipsRegression analysis (simple, multiple)
Categorical dataChi-square test (goodness-of-fit or independence)
Non-parametric (2 groups)Wilcoxon rank-sum / Mann-Whitney U test
Non-parametric (3+ groups)Kruskal-Wallis test
Testing normalityShapiro-Wilk test, Kolmogorov-Smirnov test

Self-Check Questions

  1. You have three treatment groups and want to compare their means, but a Shapiro-Wilk test shows significant non-normality. Which test should you use instead of ANOVA, and why?

  2. Compare the Z-test and T-test: what assumption distinguishes when you'd use each, and how does sample size factor into this decision?

  3. A researcher wants to determine if gender and voting preference are independent. Which test is appropriate, and what would the null hypothesis state?

  4. Which two tests serve as non-parametric alternatives to the T-test and ANOVA, respectively? What do they have in common methodologically?

  5. A before-and-after study measures the same subjects twice. Which specific type of T-test applies, and why would a two-sample T-test be inappropriate here?