Why This Matters
Statistical power is the backbone of good study design—it determines whether your research can actually detect the effects you're looking for. When you understand power calculations, you're not just plugging numbers into formulas; you're grasping the fundamental trade-offs between sample size, effect size, significance level, and error rates that govern all hypothesis testing. These concepts appear repeatedly in inference questions, and the AP exam loves asking you to reason through why a study might fail to find a significant result even when an effect exists.
Don't just memorize that "larger samples = more power." You need to understand why each factor influences power and how these factors interact. When an FRQ asks you to critique a study design or explain a non-significant result, your ability to connect power concepts to real statistical reasoning will earn you full credit. Let's break this down by the underlying principles.
The Core Framework: What Power Actually Measures
Power quantifies your test's sensitivity—its ability to say "yes, there's an effect" when an effect truly exists. Think of it as your statistical radar's detection capability. The higher your power, the less likely you are to miss a real signal.
Definition of Statistical Power
- Power equals 1−β—where β is the probability of a Type II error (failing to detect a true effect)
- Target power of 0.80 is conventional, meaning researchers typically accept a 20% chance of missing a real effect
- Power applies only when the alternative hypothesis is true—it's meaningless to discuss power when H0 is actually correct
Type I and Type II Errors
- Type I error (α) occurs when you reject a true null hypothesis—a false positive that sees an effect that isn't there
- Type II error (β) occurs when you fail to reject a false null hypothesis—a false negative that misses a real effect
- The α-β tradeoff is unavoidable—lowering one error rate typically increases the other unless you increase sample size
Compare: Type I vs. Type II errors—both represent incorrect conclusions, but Type I means claiming an effect exists when it doesn't, while Type II means missing an effect that's really there. If an FRQ describes a study that "failed to find significance," consider whether low power (high Type II risk) might explain the result.
The Four Levers: Factors You Can Control
Power isn't fixed—it responds to choices you make during study design. Understanding these relationships helps you diagnose underpowered studies and design better ones. Each lever pulls power in a predictable direction.
Sample Size Effects
- Larger samples increase power by reducing standard error and narrowing the sampling distribution
- The relationship is nonlinear—doubling sample size doesn't double power; gains diminish as n grows
- Sample size is often the most practical lever since effect size and variability may be fixed by the research context
Significance Level (α) Effects
- Higher α increases power by making it easier to reject H0—you're widening the rejection region
- The standard α=0.05 balances Type I error control against reasonable power
- Lowering α to 0.01 requires larger samples to maintain the same power level
Compare: Sample size vs. significance level—both affect power, but increasing n improves power without increasing Type I error risk, while raising α boosts power at the cost of more false positives. This is why researchers typically adjust sample size rather than α.
Variability in the Data
- Less variability increases power because the signal (effect) is easier to distinguish from noise
- Standard deviation appears in power formulas—smaller σ means narrower distributions and clearer separation
- Study design can reduce variability through blocking, matched pairs, or controlling extraneous variables
Effect Size: The Signal You're Trying to Detect
Effect size measures how big the difference or relationship is in standardized terms. Larger effects are easier to detect, just as louder sounds are easier to hear. Different contexts require different effect size measures.
Cohen's d
- Measures mean differences in standard deviation units—calculated as d=spooledxˉ1−xˉ2
- Benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large—though context matters more than arbitrary cutoffs
- Used primarily for t-tests comparing two group means
Odds Ratio
- Compares odds of an outcome between groups—an OR of 2.0 means the odds are twice as high in one group
- OR = 1 indicates no effect; values further from 1 represent stronger associations
- Common in medical and social science research involving binary outcomes
Correlation Coefficient
- Measures linear relationship strength between two quantitative variables, ranging from −1 to +1
- Benchmarks: 0.1 = small, 0.3 = medium, 0.5 = large for ∣r∣ values
- Squared correlation (r2) gives proportion of variance explained—directly relevant to regression power
Compare: Cohen's d vs. correlation coefficient—both standardize effect sizes, but d applies to group comparisons while r applies to relationships between continuous variables. Know which measure fits which test type.
Power Analysis by Test Type
Different statistical tests have different power characteristics because they're asking different questions about data. The underlying logic is the same, but the calculations and considerations vary.
Power Analysis for t-Tests
- Compares means between two groups (independent) or two conditions (paired)
- Paired designs typically have higher power because they control for individual differences, reducing variability
- Requires estimates of effect size (d) and standard deviation to calculate needed sample size
Power Analysis for ANOVA
- Extends to three or more groups with effect size often measured by η2 or f
- Power depends on the pattern of means—detecting one very different group is easier than detecting small differences among all groups
- Total sample size matters, but so does balance—equal group sizes maximize power
Power Analysis for Regression
- Evaluates whether predictors explain significant variance in the outcome
- Power increases with more predictors that have true effects, but adding noise predictors can hurt
- Effect size often expressed as R2 or f2—the proportion of variance explained by the model
Compare: t-test vs. ANOVA power analysis—both examine mean differences, but ANOVA spreads the effect across multiple comparisons, often requiring larger total samples to detect the same underlying differences.
Timing Matters: When to Calculate Power
The timing of your power analysis fundamentally changes its purpose and usefulness. Planning ahead versus looking backward yield very different insights.
A Priori Power Analysis
- Conducted before data collection to determine the sample size needed for adequate power
- Requires specifying desired power (typically 0.80), α, and expected effect size—the effect size estimate often comes from pilot studies or literature
- Essential for research planning and often required by funding agencies and IRBs
Post Hoc Power Analysis
- Performed after data collection using the observed effect size and sample size
- Controversial and often misleading—observed power is mathematically determined by the p-value, so it adds no new information
- Better alternative: report confidence intervals for effect sizes rather than post hoc power
Compare: A priori vs. post hoc power analysis—a priori helps you design a study that can succeed, while post hoc merely restates your results in different terms. If an FRQ asks about improving a study, focus on a priori planning for future replications.
Power Curves
- Graph power (y-axis) against sample size (x-axis) for fixed effect size and α
- Curves show diminishing returns—power rises steeply at first, then flattens as n increases
- Multiple curves for different effect sizes reveal how much harder small effects are to detect
Power Analysis Software
- G*Power is free and comprehensive—handles most common test types with graphical output
- R packages (pwr, simr) offer flexibility for complex designs and simulation-based power analysis
- Built-in functions in statistical software (SAS, SPSS, Stata) integrate power analysis into research workflows
Quick Reference Table
|
| Power definition | 1−β, probability of detecting true effect |
| Type I error | False positive, rejecting true H0, controlled by α |
| Type II error | False negative, failing to reject false H0, equals β |
| Factors increasing power | Larger n, larger effect size, higher α, lower variability |
| Effect size measures | Cohen's d (means), odds ratio (binary), correlation (relationships) |
| Power by test type | t-test, ANOVA, regression—each with specific formulas |
| Analysis timing | A priori (planning) vs. post hoc (after the fact) |
| Visualization tools | Power curves, G*Power software, R packages |
Self-Check Questions
-
A study fails to reject the null hypothesis despite theory suggesting an effect exists. Which two factors could you increase to improve power in a replication study, and why does each work?
-
Compare Cohen's d and the correlation coefficient: What type of research question does each address, and how would you interpret d=0.5 versus r=0.5?
-
A researcher conducts post hoc power analysis and finds power was only 0.40. Why is this analysis less useful than it might seem, and what should the researcher report instead?
-
Explain the tradeoff between Type I and Type II error rates. If a medical researcher testing a new drug lowers α from 0.05 to 0.01, what happens to power, and how could this be compensated?
-
Two studies examine the same treatment effect: Study A uses independent groups (n=50 per group), while Study B uses a matched-pairs design (n=50 pairs). Which likely has higher power, and what statistical principle explains the difference?