upgrade
upgrade

🎣Statistical Inference

Key Concepts in Statistical Power Calculations

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Statistical power is the backbone of good study design—it determines whether your research can actually detect the effects you're looking for. When you understand power calculations, you're not just plugging numbers into formulas; you're grasping the fundamental trade-offs between sample size, effect size, significance level, and error rates that govern all hypothesis testing. These concepts appear repeatedly in inference questions, and the AP exam loves asking you to reason through why a study might fail to find a significant result even when an effect exists.

Don't just memorize that "larger samples = more power." You need to understand why each factor influences power and how these factors interact. When an FRQ asks you to critique a study design or explain a non-significant result, your ability to connect power concepts to real statistical reasoning will earn you full credit. Let's break this down by the underlying principles.


The Core Framework: What Power Actually Measures

Power quantifies your test's sensitivity—its ability to say "yes, there's an effect" when an effect truly exists. Think of it as your statistical radar's detection capability. The higher your power, the less likely you are to miss a real signal.

Definition of Statistical Power

  • Power equals 1β1 - \beta—where β\beta is the probability of a Type II error (failing to detect a true effect)
  • Target power of 0.80 is conventional, meaning researchers typically accept a 20% chance of missing a real effect
  • Power applies only when the alternative hypothesis is true—it's meaningless to discuss power when H0H_0 is actually correct

Type I and Type II Errors

  • Type I error (α\alpha) occurs when you reject a true null hypothesis—a false positive that sees an effect that isn't there
  • Type II error (β\beta) occurs when you fail to reject a false null hypothesis—a false negative that misses a real effect
  • The α\alpha-β\beta tradeoff is unavoidable—lowering one error rate typically increases the other unless you increase sample size

Compare: Type I vs. Type II errors—both represent incorrect conclusions, but Type I means claiming an effect exists when it doesn't, while Type II means missing an effect that's really there. If an FRQ describes a study that "failed to find significance," consider whether low power (high Type II risk) might explain the result.


The Four Levers: Factors You Can Control

Power isn't fixed—it responds to choices you make during study design. Understanding these relationships helps you diagnose underpowered studies and design better ones. Each lever pulls power in a predictable direction.

Sample Size Effects

  • Larger samples increase power by reducing standard error and narrowing the sampling distribution
  • The relationship is nonlinear—doubling sample size doesn't double power; gains diminish as nn grows
  • Sample size is often the most practical lever since effect size and variability may be fixed by the research context

Significance Level (α\alpha) Effects

  • Higher α\alpha increases power by making it easier to reject H0H_0—you're widening the rejection region
  • The standard α=0.05\alpha = 0.05 balances Type I error control against reasonable power
  • Lowering α\alpha to 0.01 requires larger samples to maintain the same power level

Compare: Sample size vs. significance level—both affect power, but increasing nn improves power without increasing Type I error risk, while raising α\alpha boosts power at the cost of more false positives. This is why researchers typically adjust sample size rather than α\alpha.

Variability in the Data

  • Less variability increases power because the signal (effect) is easier to distinguish from noise
  • Standard deviation appears in power formulas—smaller σ\sigma means narrower distributions and clearer separation
  • Study design can reduce variability through blocking, matched pairs, or controlling extraneous variables

Effect Size: The Signal You're Trying to Detect

Effect size measures how big the difference or relationship is in standardized terms. Larger effects are easier to detect, just as louder sounds are easier to hear. Different contexts require different effect size measures.

Cohen's d

  • Measures mean differences in standard deviation units—calculated as d=xˉ1xˉ2spooledd = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}
  • Benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large—though context matters more than arbitrary cutoffs
  • Used primarily for t-tests comparing two group means

Odds Ratio

  • Compares odds of an outcome between groups—an OR of 2.0 means the odds are twice as high in one group
  • OR = 1 indicates no effect; values further from 1 represent stronger associations
  • Common in medical and social science research involving binary outcomes

Correlation Coefficient

  • Measures linear relationship strength between two quantitative variables, ranging from 1-1 to +1+1
  • Benchmarks: 0.1 = small, 0.3 = medium, 0.5 = large for r|r| values
  • Squared correlation (r2r^2) gives proportion of variance explained—directly relevant to regression power

Compare: Cohen's d vs. correlation coefficient—both standardize effect sizes, but dd applies to group comparisons while rr applies to relationships between continuous variables. Know which measure fits which test type.


Power Analysis by Test Type

Different statistical tests have different power characteristics because they're asking different questions about data. The underlying logic is the same, but the calculations and considerations vary.

Power Analysis for t-Tests

  • Compares means between two groups (independent) or two conditions (paired)
  • Paired designs typically have higher power because they control for individual differences, reducing variability
  • Requires estimates of effect size (dd) and standard deviation to calculate needed sample size

Power Analysis for ANOVA

  • Extends to three or more groups with effect size often measured by η2\eta^2 or ff
  • Power depends on the pattern of means—detecting one very different group is easier than detecting small differences among all groups
  • Total sample size matters, but so does balance—equal group sizes maximize power

Power Analysis for Regression

  • Evaluates whether predictors explain significant variance in the outcome
  • Power increases with more predictors that have true effects, but adding noise predictors can hurt
  • Effect size often expressed as R2R^2 or f2f^2—the proportion of variance explained by the model

Compare: t-test vs. ANOVA power analysis—both examine mean differences, but ANOVA spreads the effect across multiple comparisons, often requiring larger total samples to detect the same underlying differences.


Timing Matters: When to Calculate Power

The timing of your power analysis fundamentally changes its purpose and usefulness. Planning ahead versus looking backward yield very different insights.

A Priori Power Analysis

  • Conducted before data collection to determine the sample size needed for adequate power
  • Requires specifying desired power (typically 0.80), α\alpha, and expected effect size—the effect size estimate often comes from pilot studies or literature
  • Essential for research planning and often required by funding agencies and IRBs

Post Hoc Power Analysis

  • Performed after data collection using the observed effect size and sample size
  • Controversial and often misleading—observed power is mathematically determined by the p-value, so it adds no new information
  • Better alternative: report confidence intervals for effect sizes rather than post hoc power

Compare: A priori vs. post hoc power analysis—a priori helps you design a study that can succeed, while post hoc merely restates your results in different terms. If an FRQ asks about improving a study, focus on a priori planning for future replications.


Tools and Visualization

Power Curves

  • Graph power (y-axis) against sample size (x-axis) for fixed effect size and α\alpha
  • Curves show diminishing returns—power rises steeply at first, then flattens as nn increases
  • Multiple curves for different effect sizes reveal how much harder small effects are to detect

Power Analysis Software

  • G*Power is free and comprehensive—handles most common test types with graphical output
  • R packages (pwr, simr) offer flexibility for complex designs and simulation-based power analysis
  • Built-in functions in statistical software (SAS, SPSS, Stata) integrate power analysis into research workflows

Quick Reference Table

ConceptBest Examples
Power definition1β1 - \beta, probability of detecting true effect
Type I errorFalse positive, rejecting true H0H_0, controlled by α\alpha
Type II errorFalse negative, failing to reject false H0H_0, equals β\beta
Factors increasing powerLarger nn, larger effect size, higher α\alpha, lower variability
Effect size measuresCohen's d (means), odds ratio (binary), correlation (relationships)
Power by test typet-test, ANOVA, regression—each with specific formulas
Analysis timingA priori (planning) vs. post hoc (after the fact)
Visualization toolsPower curves, G*Power software, R packages

Self-Check Questions

  1. A study fails to reject the null hypothesis despite theory suggesting an effect exists. Which two factors could you increase to improve power in a replication study, and why does each work?

  2. Compare Cohen's d and the correlation coefficient: What type of research question does each address, and how would you interpret d=0.5d = 0.5 versus r=0.5r = 0.5?

  3. A researcher conducts post hoc power analysis and finds power was only 0.40. Why is this analysis less useful than it might seem, and what should the researcher report instead?

  4. Explain the tradeoff between Type I and Type II error rates. If a medical researcher testing a new drug lowers α\alpha from 0.05 to 0.01, what happens to power, and how could this be compensated?

  5. Two studies examine the same treatment effect: Study A uses independent groups (n=50n = 50 per group), while Study B uses a matched-pairs design (n=50n = 50 pairs). Which likely has higher power, and what statistical principle explains the difference?