upgrade
upgrade

🎣Statistical Inference

Nonparametric Tests

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When your data refuses to play by the rules—skewed distributions, ordinal measurements, small samples, or stubborn outliers—nonparametric tests become your best friends. These methods don't require the normality assumptions that parametric tests demand, making them essential tools when real-world data gets messy. You're being tested on understanding when to choose nonparametric over parametric approaches, how these tests use ranks instead of raw values, and why they sacrifice some statistical power for greater flexibility.

The core principle uniting these methods is elegantly simple: rank the data and work with those ranks. This transformation strips away the influence of extreme values and distributional quirks while preserving the essential ordering information. Whether you're comparing groups, measuring associations, or testing distributional assumptions, mastering nonparametric tests means knowing which tool fits which scenario—and being able to justify that choice on an FRQ. Don't just memorize test names; understand what each test's parametric equivalent is and what assumptions you're escaping by going nonparametric.


Comparing Two Groups: Paired vs. Independent Designs

The most fundamental distinction in group comparisons is whether your observations are linked (paired/related) or completely separate (independent). Paired designs control for individual variability by using each subject as their own control, while independent designs compare entirely different subjects.

Wilcoxon Signed-Rank Test

  • Nonparametric alternative to the paired t-test—use when you have two related measurements per subject but can't assume normally distributed differences
  • Ranks the absolute differences between paired observations, then applies the original signs to those ranks—this preserves both magnitude and direction information
  • Assumes symmetric distribution of differences around the median; if this fails, consider the simpler Sign Test instead

Mann-Whitney U Test

  • Compares two independent groups by pooling all observations, ranking them together, and examining whether one group's ranks cluster higher than the other's
  • Tests whether one distribution is stochastically greater—essentially asking if randomly selected values from Group A tend to exceed those from Group B
  • The U statistic counts how many times a value from one group exceeds values from the other group; extreme U values indicate group differences

Sign Test

  • Simplest paired comparison test—only considers whether differences are positive or negative, ignoring magnitude entirely
  • Extremely robust to outliers since a difference of 0.01 counts the same as a difference of 1,000; trade-off is reduced statistical power
  • Best for small samples or severely non-normal data where even the Wilcoxon Signed-Rank assumptions seem questionable

Compare: Wilcoxon Signed-Rank vs. Sign Test—both handle paired data, but Wilcoxon uses magnitude information (ranks of differences) while the Sign Test only counts directions. If your FRQ mentions outliers or asks about the most robust option, the Sign Test is your answer; if it asks about power, Wilcoxon wins.


Extending to Multiple Groups: One-Way and Repeated Measures

When you have three or more groups to compare, you need tests that generalize the two-group methods. The key distinction remains whether groups are independent or related (repeated measures/blocked designs).

Kruskal-Wallis Test

  • Nonparametric alternative to one-way ANOVA—compares medians across three or more independent groups using rank sums
  • Test statistic (H) is based on comparing each group's mean rank to the overall mean rank; larger H indicates greater between-group differences
  • Significant result requires follow-up with pairwise comparisons (like Dunn's test) to identify which specific groups differ

Friedman Test

  • Nonparametric alternative to repeated measures ANOVA—handles three or more related groups or blocked designs
  • Ranks data within each block (subject or matched set), then compares rank sums across conditions—this controls for individual baseline differences
  • Common applications include crossover studies, taste tests with the same judges rating multiple products, or pre/post/follow-up measurements

Compare: Kruskal-Wallis vs. Friedman—both extend to 3+ groups, but Kruskal-Wallis assumes independence while Friedman requires related/blocked data. Think of Kruskal-Wallis as "stacked Mann-Whitney" and Friedman as "stacked Wilcoxon Signed-Rank."


Measuring Association: Rank-Based Correlations

When examining relationships between two variables, rank correlations provide robust alternatives to Pearson's rr. These methods assess monotonic relationships—whether variables consistently increase or decrease together—rather than strictly linear ones.

Spearman's Rank Correlation

  • Converts both variables to ranks, then calculates Pearson's correlation on those ranks—denoted rsr_s or ρ\rho (rho)
  • Measures monotonic association rather than linear; a perfect Spearman correlation means the relationship is perfectly monotonic, not necessarily straight
  • More robust than Pearson's to outliers and non-normality; ideal for ordinal data or continuous data with questionable distributions

Kendall's Tau

  • Counts concordant vs. discordant pairs—a pair is concordant if both variables rank the same subject higher; discordant if they disagree
  • Better for small samples and data with ties than Spearman's; also has cleaner statistical properties for hypothesis testing
  • Tau-b version adjusts for ties, making it the standard choice when tied ranks are common in your data

Compare: Spearman's rsr_s vs. Kendall's τ\tau—both measure monotonic association, but Spearman transforms to ranks then correlates, while Kendall directly counts agreement/disagreement between pairs. Kendall's is preferred for small samples; Spearman's is more intuitive and commonly reported.


Testing Distributions: Goodness of Fit

Sometimes you need to test whether your data follows a specific distribution or whether two samples come from the same population. These tests examine the entire shape of distributions rather than just central tendency.

Kolmogorov-Smirnov Test

  • Measures maximum vertical distance between the empirical cumulative distribution function (ECDF) and a theoretical distribution (one-sample) or between two ECDFs (two-sample)
  • Sensitive to any distributional difference—location, spread, or shape—making it a general-purpose goodness-of-fit test
  • D statistic represents the supremum of absolute differences; reject the null when D exceeds critical values for your sample size

Compare: Kolmogorov-Smirnov vs. Mann-Whitney U—both can compare two samples, but K-S tests whether distributions are identical in any way (shape, spread, location), while Mann-Whitney specifically tests for location shift. Use K-S when you care about the whole distribution; use Mann-Whitney when you're focused on central tendency.


Resampling Methods: Distribution-Free Inference

Modern computing enables powerful nonparametric approaches that build reference distributions directly from your data. These methods make minimal assumptions and provide exact or approximate inference through computational brute force.

Permutation Tests

  • Generate the null distribution by repeatedly shuffling group labels and recalculating the test statistic—if groups don't differ, shuffling shouldn't matter
  • Provide exact p-values when all permutations are computed; approximations work well with random subsets of permutations
  • Highly flexible—can be applied to virtually any test statistic, making them ideal when no standard test fits your situation

Bootstrap Methods

  • Resample with replacement from your observed data to create thousands of "bootstrap samples," each the same size as the original
  • Estimate sampling distributions of any statistic—means, medians, regression coefficients, or complex estimators—without distributional assumptions
  • Confidence intervals can be constructed using percentile method, bias-corrected methods, or other bootstrap approaches; particularly valuable for small samples

Compare: Permutation Tests vs. Bootstrap—permutation tests shuffle labels to test hypotheses under the null, while bootstrap resamples to estimate the variability of statistics. Use permutation for hypothesis testing ("is there a difference?"); use bootstrap for estimation ("what's the confidence interval?").


Quick Reference Table

ConceptBest Examples
Paired two-group comparisonWilcoxon Signed-Rank, Sign Test
Independent two-group comparisonMann-Whitney U
Multiple independent groupsKruskal-Wallis
Multiple related groups/blocked designsFriedman Test
Rank-based correlationSpearman's rsr_s, Kendall's τ\tau
Distributional goodness of fitKolmogorov-Smirnov
Hypothesis testing via resamplingPermutation Tests
Confidence intervals via resamplingBootstrap Methods

Self-Check Questions

  1. You have pre-test and post-test scores from 15 participants, but the differences are heavily skewed with two extreme outliers. Which nonparametric test would be most robust, and which would have more power if the outliers weren't so extreme?

  2. A researcher wants to compare customer satisfaction ratings (on a 1-5 scale) across four different store locations with different customers at each location. Which test should they use, and what's the parametric equivalent they're avoiding?

  3. Compare and contrast Spearman's rank correlation and Kendall's tau: What do they both measure, how do their calculations differ, and when might you prefer one over the other?

  4. An FRQ asks you to test whether a sample of reaction times comes from an exponential distribution. Which nonparametric test is designed for this type of question, and what does its test statistic represent?

  5. Explain why permutation tests and bootstrap methods are both called "resampling methods" but serve fundamentally different purposes. Give a scenario where each would be the appropriate choice.