Fiveable

🎲Intro to Statistics Unit 10 Review

QR code for Intro to Statistics practice questions

10.5 Hypothesis Testing for Two Means and Two Proportions

10.5 Hypothesis Testing for Two Means and Two Proportions

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🎲Intro to Statistics
Unit & Topic Study Guides

Hypothesis Testing for Two Means and Two Proportions

When you're comparing two groups, you need a formal way to decide whether the difference you observe is real or just due to random chance. That's what two-sample hypothesis testing does. This topic covers how to test differences between two means and two proportions, how to interpret your results, and what factors affect the strength of your conclusions.

Hypothesis Testing for Two Means

There are two setups for comparing means, and picking the right one depends on how your data were collected.

Methods for two-population mean tests

Independent samples test applies when your two samples come from separate populations with no natural link between individual observations. For example, comparing test scores from two different classrooms.

  • Sample sizes don't need to be equal.
  • The test statistic is:

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

  • The denominator is the standard error of the difference, built from each sample's variance and size.
  • When population variances are unknown (which is almost always the case), degrees of freedom are approximated using the Welch-Satterthwaite equation. You won't need to compute this by hand in most intro courses, but know that it adjusts the degrees of freedom downward to be more conservative.

Paired samples test applies when each observation in one sample is naturally matched with an observation in the other. Think before-and-after measurements on the same subjects.

  • Sample sizes must be equal (since every observation needs a partner).
  • You first compute the difference did_i for each pair, then treat those differences as a single sample.
  • The test statistic is:

t=dˉsdnt = \frac{\bar{d}}{\frac{s_d}{\sqrt{n}}}

where dˉ\bar{d} is the mean of the differences, sds_d is the standard deviation of the differences, and nn is the number of pairs.

  • Degrees of freedom = n1n - 1.

A common mistake: using an independent samples test when the data are actually paired. If subjects appear in both groups, use the paired test. It's more powerful because it controls for individual variability.

Methods for two-population mean tests, Hypothesis Test for a Difference in Two Population Means (1 of 2) | Concepts in Statistics

Hypothesis Testing for Two Proportions

This test compares the proportion of "successes" between two groups, such as whether the pass rate differs between two sections of a course.

The test statistic is:

z=p^1p^2p^(1p^)(1n1+1n2)z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1 - \hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}

where p^\hat{p} is the pooled proportion, calculated as the total number of successes across both samples divided by the total sample size:

p^=x1+x2n1+n2\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}

You use the pooled proportion because the null hypothesis assumes the two population proportions are equal.

Conditions for using this test: The sampling distribution is approximately normal (by the Central Limit Theorem) when all four of these quantities are at least 10:

  • n1p^1n_1\hat{p}_1, n1(1p^1)n_1(1 - \hat{p}_1), n2p^2n_2\hat{p}_2, and n_2(1 - \hat{p}_2})

Making the decision: Reject the null hypothesis if the test statistic falls in the rejection region, which depends on your significance level (α\alpha) and whether the test is left-tailed, right-tailed, or two-tailed. For a two-tailed test at α=0.05\alpha = 0.05, the critical values are z=±1.96z = \pm 1.96.

Interpreting Results

Methods for two-population mean tests, Estimating the Difference in Two Population Means | Concepts in Statistics

Interpretation of two-sample test results

Using p-values:

The p-value is the probability of getting a test statistic as extreme as (or more extreme than) what you observed, assuming the null hypothesis is true.

  • If p-value < α\alpha (your significance level), reject the null hypothesis. There's sufficient evidence of a difference.
  • If p-value ≥ α\alpha, fail to reject the null hypothesis. You don't have enough evidence to conclude a difference exists.

Using confidence intervals:

A confidence interval for the difference between two parameters gives a range of plausible values for that difference.

  • If the interval does not contain 0, reject the null hypothesis. Zero would mean "no difference," and it's not a plausible value.
  • If the interval does contain 0, you can't rule out the possibility that there's no real difference.

Drawing conclusions carefully:

  • Rejecting the null means you found a statistically significant difference between the two populations.
  • Failing to reject the null does not prove the populations are the same. It only means your data weren't strong enough to detect a difference. This could happen because the sample was too small or the data were highly variable.

Statistical Power and Effect Size

These concepts help you evaluate the quality of your test, not just the result.

  • Effect size measures how large the difference between groups actually is, separate from whether it's statistically significant. A tiny difference can be "significant" with a huge sample, but that doesn't mean it matters practically.
  • Statistical power is the probability that your test correctly rejects the null hypothesis when a real difference exists. Higher power means you're less likely to miss a real effect.
  • Three factors increase power:
    • Larger sample sizes
    • Larger effect sizes (bigger real differences are easier to detect)
    • Higher significance level (e.g., α=0.10\alpha = 0.10 vs. α=0.05\alpha = 0.05), though this also increases the risk of a false positive
  • Power analysis is done before collecting data to figure out how large your samples need to be to have a reasonable chance (often 80% power) of detecting a meaningful difference.