Fiveable

📊Honors Statistics Unit 10 Review

QR code for Honors Statistics practice questions

10.1 Two Population Means with Unknown Standard Deviations

10.1 Two Population Means with Unknown Standard Deviations

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📊Honors Statistics
Unit & Topic Study Guides
Pep mascot

Comparing Two Population Means with Unknown Standard Deviations

When you want to compare the averages of two groups but don't know the population standard deviations (which is almost always the case with real data), you use a two-sample t-test. This test determines whether the difference between two sample means is statistically significant or just due to random sampling variability.

The core idea: calculate a t-statistic from your data, find the corresponding p-value, and compare it to your significance level to make a decision. This section covers the formula, degrees of freedom, effect size, and the errors you can make along the way.

Pep mascot
more resources to help you study

T-Statistic Calculation for Two Population Means

Before you can use this test, two assumptions need to hold:

  • Both populations are approximately normally distributed (or sample sizes are large enough for the Central Limit Theorem to kick in)
  • The two samples are independent, meaning the data in one group doesn't influence or relate to the data in the other

The test statistic formula is:

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

where:

  • xˉ1\bar{x}_1 and xˉ2\bar{x}_2 are the sample means
  • s12s_1^2 and s22s_2^2 are the sample variances
  • n1n_1 and n2n_2 are the sample sizes

The numerator captures how far apart the two sample means are. The denominator is the standard error of the difference, which accounts for how much variability exists within each group and how large the samples are. A bigger difference in means or less variability produces a larger t-value, which is stronger evidence against the null hypothesis.

Setting up the hypotheses:

  • Null hypothesis (H0H_0): μ1=μ2\mu_1 = \mu_2 (no difference between population means)
  • Alternative hypothesis (HaH_a) takes one of three forms depending on your research question:
    • Two-tailed: μ1μ2\mu_1 \neq \mu_2
    • Left-tailed: μ1<μ2\mu_1 < \mu_2
    • Right-tailed: μ1>μ2\mu_1 > \mu_2

The p-value is the probability of observing a t-statistic at least as extreme as the one you calculated, assuming H0H_0 is true. If the p-value is less than your significance level (α\alpha, typically 0.05), you reject H0H_0.

Quick example: Suppose Class A (n1=30n_1 = 30) scores an average of 78 with s1=10s_1 = 10, and Class B (n2=35n_2 = 35) scores an average of 83 with s2=12s_2 = 12. The numerator is 7883=578 - 83 = -5. The denominator is 10030+14435=3.333+4.114=7.4472.729\sqrt{\frac{100}{30} + \frac{144}{35}} = \sqrt{3.333 + 4.114} = \sqrt{7.447} \approx 2.729. So t52.7291.833t \approx \frac{-5}{2.729} \approx -1.833. You'd then compare this to the t-distribution with the appropriate degrees of freedom.

T-score calculation for population means, Hypothesis Test for a Difference in Two Population Means (1 of 2) | Concepts in Statistics

Degrees of Freedom in T-Distributions

Because the two samples can have different variances and different sizes, the degrees of freedom for this test aren't as simple as just adding sample sizes. You use the Welch-Satterthwaite equation:

df=(s12n1+s22n2)2(s12n1)2n11+(s22n2)2n21df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1-1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2-1}}

This formula almost never gives a whole number, so you round down to the nearest integer. Rounding down is the conservative choice because it makes the critical value slightly larger, making it harder to reject H0H_0.

Why degrees of freedom matter:

  • Smaller df produces a t-distribution with heavier tails, meaning you need a more extreme t-value to reach significance
  • Larger df makes the t-distribution look more like the standard normal distribution
  • Getting the df right ensures your p-value is accurate

For the example above, you'd plug the values into the Welch-Satterthwaite formula to get the exact df before looking up the p-value. In practice, your calculator or software handles this automatically.

T-score calculation for population means, Comparing two means – Learning Statistics with R

Cohen's d for Effect Size

A statistically significant result doesn't necessarily mean the difference is practically meaningful. With a large enough sample, even a tiny difference can produce a small p-value. Cohen's d measures the actual size of the difference in standard deviation units.

d=xˉ1xˉ2spd = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

where sps_p is the pooled standard deviation:

sp=(n11)s12+(n21)s22n1+n22s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}

The pooled standard deviation is a weighted average of the two sample standard deviations, giving more weight to the larger sample.

Interpretation benchmarks (these are conventions, not rigid cutoffs):

  • d0.2d \approx 0.2: small effect
  • d0.5d \approx 0.5: medium effect
  • d0.8d \approx 0.8: large effect

For instance, if two drugs lower blood pressure by an average difference of 3 mmHg with a pooled standard deviation of 6 mmHg, then d=3/6=0.5d = 3/6 = 0.5, a medium effect. That tells you the difference is about half a standard deviation, which is clinically noticeable.

Cohen's d is independent of sample size, so you can compare effect sizes across different studies even when sample sizes vary widely.

Statistical Inference and Error Types

Every hypothesis test carries the risk of making the wrong conclusion. Understanding these errors helps you interpret results honestly.

  • Type I error (false positive): You reject H0H_0 when it's actually true. The probability of this equals your significance level α\alpha. If α=0.05\alpha = 0.05, you accept a 5% chance of this error.
  • Type II error (false negative): You fail to reject H0H_0 when it's actually false. The probability of this is denoted β\beta.
  • Statistical power = 1β1 - \beta, the probability of correctly detecting a real difference. Power increases with larger sample sizes, larger effect sizes, and higher α\alpha levels.
  • Confidence interval for the difference in means: Instead of just a yes/no decision, you can construct an interval estimate for μ1μ2\mu_1 - \mu_2. If the interval doesn't contain 0, that's consistent with rejecting H0H_0.

A common mistake: thinking "fail to reject H0H_0" means the two means are equal. It doesn't. It just means you didn't find enough evidence to conclude they're different, which could be because the sample was too small (low power) rather than because no difference exists.