Two-Way ANOVA tells you that differences exist between groups, but not which groups differ. Post-hoc analysis fills that gap by testing specific pairwise comparisons after a significant main effect or interaction effect has been found.

These tests matter because every additional comparison you run inflates your chance of a false positive. Post-hoc methods keep that risk in check so your conclusions stay trustworthy.

Purpose and Importance

A significant F-test in Two-Way ANOVA tells you that at least one group mean differs from the others, but it stops there. Post-hoc tests pick up where the omnibus test leaves off, comparing specific pairs of group means to pinpoint exactly where the differences are.

Conducted only after a significant main effect or interaction is found
Identify which specific group means differ from each other
Control the familywise error rate, which is the probability of making at least one Type I error across all your comparisons
Without post-hoc analysis, you know something differs but can't say what, which severely limits the practical value of your results

Controlling Type I Error

The core problem is simple: the more pairwise comparisons you run, the higher your chance of finding a "significant" result purely by luck. If you run 10 comparisons each at $\alpha = 0.05$ , your probability of at least one false positive is far higher than 5%.

The familywise error rate captures this cumulative risk. For $c$ independent comparisons, the upper bound is:

$1 - (1 - \alpha)^c$

With 10 comparisons at $\alpha = 0.05$ , that's roughly 0.40, meaning a 40% chance of at least one false positive.

Post-hoc tests solve this by adjusting the significance threshold for each individual comparison so that the overall error rate stays at your desired level (typically 0.05). Different tests accomplish this adjustment in different ways, which is why choosing the right one matters.

Purpose and Importance, How to Perform ANOVA in Python

Choosing Post-Hoc Tests

Factors to Consider

Not every post-hoc test fits every situation. Here are the main factors that guide your choice:

Number of comparisons: Conservative tests like Bonferroni work well when you have relatively few comparisons. With many comparisons, they become overly strict and lose statistical power. Tukey's HSD handles larger sets of comparisons more gracefully.
Equal vs. unequal sample sizes: Tukey's HSD assumes balanced group sizes. With unequal $n$ , the Tukey-Kramer modification or Games-Howell test is more appropriate.
Homogeneity of variances: Tukey's HSD assumes equal variances across groups. If Levene's test or a similar check suggests unequal variances, Games-Howell is a better choice because it doesn't rely on that assumption.
Type of comparison: If your research question is specifically about comparing treatment groups to a single control group, Dunnett's test is purpose-built for that and more powerful than testing all possible pairs.

Purpose and Importance, r - How to obtain the results of a Tukey HSD post-hoc test in a table showing grouped pairs ...

Commonly Used Post-Hoc Tests

Tukey's HSD (Honestly Significant Difference): The most common choice when you want to compare all possible pairs of means. It uses the studentized range distribution to set a single critical difference threshold. Best suited for balanced designs with equal variances.

Bonferroni correction: Divides the desired $\alpha$ by the number of comparisons ( $\alpha_{adj} = \alpha / c$ ). Straightforward and flexible, but increasingly conservative as the number of comparisons grows. With 15+ comparisons, it can make it very hard to detect real differences.

Scheffé's test: The most conservative of the common options. It can test any linear contrast (not just pairwise comparisons), which gives it flexibility, but at the cost of lower power for simple pairwise tests. Use it when you're exploring complex contrasts or when variance assumptions are questionable.

Dunnett's test: Designed specifically for comparing each treatment group against a single control group. Because it tests fewer comparisons than an all-pairs method, it has more power for that specific purpose.

Games-Howell: A good default when group variances are unequal or sample sizes are unbalanced. It doesn't assume homogeneity of variances and adjusts for unequal $n$ .

Interpreting Post-Hoc Results

Presenting Results

Post-hoc results are typically displayed in a comparison matrix or table. Each row represents a pair of groups, and the columns show:

The mean difference between the two groups
The standard error of that difference
The p-value (adjusted for multiple comparisons)
Often a confidence interval for the mean difference

A p-value below your threshold (usually 0.05) means that pair of groups differs significantly. A confidence interval that does not contain zero tells you the same thing.

Drawing Conclusions

Focus on the comparisons that are relevant to your research question. A Two-Way ANOVA with two factors having 3 and 4 levels produces many possible pairwise comparisons, and not all of them will be theoretically interesting.

Statistical vs. practical significance: A small mean difference can be statistically significant with a large sample. Always report and consider effect sizes (such as Cohen's $d$ or $\eta^2$ ) alongside p-values.
Non-significant results: A non-significant pairwise comparison means you failed to detect a difference, not that the groups are identical. Avoid stating that two groups are "the same."
Interaction effects: When a significant interaction is present, post-hoc comparisons on main effects alone can be misleading. You'll typically want to compare simple effects, examining differences within one factor at each level of the other factor, rather than collapsing across levels.
Reporting: Clearly state which post-hoc test you used and why. Include mean differences, confidence intervals, and effect sizes alongside p-values. This makes your results more interpretable and easier for others to replicate.

2,589 studying →