Why This Matters
Statistical inference isn't just about determining whether an effect existsโit's about understanding how big that effect actually is. While p-values tell you if results are statistically significant, effect size measures tell you if those results are practically meaningful. This distinction is critical: a study with thousands of participants might find a "significant" difference that's so tiny it has no real-world importance. You're being tested on your ability to choose the right effect size measure for different study designs, interpret what the values mean, and explain why effect sizes matter for evidence-based decision-making, meta-analysis, and replication studies.
Effect size measures fall into distinct families based on what they quantify: standardized differences between groups, variance explained, and measures of association for categorical outcomes. Understanding these categories helps you quickly identify which measure fits which research scenario. Don't just memorize formulasโknow what type of data each measure handles and when you'd choose one over another.
Standardized Mean Differences
These measures express the difference between group means in standard deviation units, making them ideal for comparing results across studies that use different measurement scales. The core principle: divide the mean difference by a measure of variability to create a unit-free comparison.
Cohen's d
- Most widely used effect size for two-group comparisonsโcalculated as the difference between means divided by the pooled standard deviation: d=spooledโXห1โโXห2โโ
- Benchmark interpretations of small (0.2), medium (0.5), and large (0.8) come from Cohen's original guidelines, though context matters
- Best applied when both groups have similar variances and sample sizes; assumes equal standard deviations across groups
Hedges' g
- Corrects Cohen's d for small-sample biasโapplies a correction factor that becomes negligible with larger samples (typically n>20)
- Preferred in meta-analyses because it provides unbiased estimates when combining studies with varying sample sizes
- Interpretation identical to Cohen's dโsame benchmarks apply, making results directly comparable
Glass's Delta
- Uses only the control group's standard deviation for standardization: ฮ=scontrolโXห1โโXห2โโ
- Ideal when treatment affects variabilityโif an intervention changes not just the mean but also the spread of scores, pooling standard deviations would be misleading
- Common in experimental designs where you want to express treatment effects relative to baseline variability
Compare: Cohen's d vs. Hedges' g vs. Glass's deltaโall measure standardized mean differences, but they differ in which standard deviation they use and whether they correct for bias. If an FRQ asks you to justify your choice of effect size, explain whether groups have equal variances and whether sample sizes are small.
Standardized Mean Difference (SMD)
- Umbrella term encompassing Cohen's d, Hedges' g, and Glass's deltaโrefers to any effect size that standardizes group differences
- Essential for meta-analysis because it allows combining results from studies using different measurement instruments
- Watch the terminologyโsome software outputs "SMD" generically, so always check which specific formula was applied
Variance-Explained Measures
These measures tell you what proportion of the outcome's variability can be attributed to your predictor(s). The underlying logic: partition total variance into explained and unexplained components.
R-squared (Rยฒ)
- Proportion of variance explained by the regression modelโranges from 0 to 1, where R2=1โSStotalโSSresidualโโ
- Interpretation is context-dependentโan R2 of 0.30 might be excellent in psychology but weak in physics; always consider the field's standards
- Limitation: always increases when you add predictors, even useless onesโthis is why adjusted R2 exists for multiple regression
Eta-squared (ฮทยฒ)
- ANOVA's version of R2โrepresents the proportion of total variance in the dependent variable explained by the factor: ฮท2=SStotalโSSbetweenโโ
- Tends to overestimate population effect sizes, especially with small samples or multiple factors
- Quick benchmarks: small (0.01), medium (0.06), large (0.14)โbut these are rough guidelines, not rigid cutoffs
Partial Eta-squared
- Controls for other factors in the modelโshows the unique variance explained by one factor after removing variance explained by others
- Standard output in factorial ANOVAโmost statistical software reports partial ฮท2 by default rather than eta-squared
- Not directly comparable to eta-squared because denominators differ; partial values are typically larger for the same effect
Compare: R2 vs. ฮท2โboth measure variance explained, but R2 is used in regression (continuous predictors) while ฮท2 is used in ANOVA (categorical predictors). On exams, match the measure to the analysis type.
Correlation-Based Measures
Correlation coefficients quantify the strength and direction of relationships between variables. The key insight: these measures are already standardized, making them natural effect sizes.
Pearson's Correlation Coefficient (r)
- Measures linear association between two continuous variablesโranges from โ1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship
- Effect size benchmarks: small (0.10), medium (0.30), large (0.50)โnote these differ from Cohen's d benchmarks
- Squaring gives variance explainedโr2 tells you the proportion of variance shared between variables, directly connecting correlation to regression
Compare: Pearson's r vs. R2โPearson's r captures direction and strength of a bivariate relationship, while R2 in multiple regression captures total variance explained by all predictors combined. In simple linear regression with one predictor, R2=r2.
Measures for Categorical Outcomes
When your outcome is binary (yes/no, disease/no disease), you need effect sizes designed for proportions and odds. These measures compare event rates or odds between groups rather than means.
Odds Ratio
- Compares odds between groupsโcalculated as OR=oddsgroup2โoddsgroup1โโ, where odds = probability of event / probability of no event
- Interpretation anchored at 1.0โvalues above 1 indicate higher odds in the numerator group; values below 1 indicate lower odds
- Standard in logistic regression and case-control studiesโwhen you can't calculate risk directly (retrospective designs), odds ratios are your go-to measure
Risk Ratio (Relative Risk)
- Compares probabilities directlyโcalculated as RR=P(eventโฃunexposed)P(eventโฃexposed)โ
- More intuitive than odds ratios for most audiencesโ"twice the risk" is easier to grasp than "twice the odds"
- Requires prospective dataโonly valid in cohort studies or RCTs where you can calculate actual incidence rates
Compare: Odds ratio vs. risk ratioโboth measure association in categorical data, but odds ratios work in any design while risk ratios require prospective data. When the outcome is rare (< 10%), odds ratios approximate risk ratios closely. FRQs often ask when each is appropriate.
Quick Reference Table
|
| Standardized group differences | Cohen's d, Hedges' g, Glass's delta |
| Small-sample correction | Hedges' g |
| Variance explained (regression) | R2 |
| Variance explained (ANOVA) | Eta-squared, Partial eta-squared |
| Correlation strength | Pearson's r |
| Categorical outcomes (any design) | Odds ratio |
| Categorical outcomes (prospective) | Risk ratio |
| Meta-analysis applications | Hedges' g, SMD, odds ratio |
Self-Check Questions
-
You're comparing treatment effects across three studies that used different depression scales. Which effect size measure would allow valid comparisons, and why?
-
A researcher reports ฮท2=0.08 from a one-way ANOVA. How would you interpret this value, and what benchmark category does it fall into?
-
Compare and contrast the odds ratio and risk ratio: In what study designs is each appropriate, and when do their values converge?
-
Why might a researcher choose Glass's delta over Cohen's d when evaluating an educational intervention? What assumption about the data motivates this choice?
-
An FRQ presents a multiple regression with R2=0.45 and asks whether adding another predictor improved the model. Why is R2 alone insufficient to answer this question, and what would you need to know?