Factorial designs let you study the effects of multiple independent variables (factors) at the same time within a single experiment. Instead of running separate experiments for each factor, you cross all factor levels together, which reveals not only how each factor affects the outcome on its own but also how factors combine to produce effects that neither would produce alone. This makes factorial designs one of the most efficient and informative structures available for randomized experiments.
Factorial design overview
A factorial design is an experiment in which every level of every factor is combined with every level of every other factor. Subjects are then randomly assigned to these combinations. The result is that you can estimate main effects (the average impact of each factor) and interaction effects (whether the impact of one factor changes depending on the level of another).
Factors and levels
A factor is an independent variable that the experimenter manipulates. Each factor has two or more levels, which are the specific values or categories it can take.
For example, in a 2×2 factorial design studying weight loss, you might have:
- Factor A: Diet with levels low-fat and high-fat
- Factor B: Exercise with levels sedentary and active
The notation "2×2" tells you there are two factors, each with two levels.
Treatment combinations
A treatment combination is one specific pairing of factor levels. You get the total number of combinations by multiplying the number of levels across all factors.
- A 2×2 design has treatment combinations
- A 2×3 design has treatment combinations
- A 2×2×2 design has treatment combinations
In the diet-and-exercise example, the four treatment combinations are: low-fat/sedentary, low-fat/active, high-fat/sedentary, and high-fat/active.
Balanced vs unbalanced designs
- A balanced factorial design assigns an equal number of subjects to each treatment combination. This is preferred because it gives equal precision for all effect estimates and simplifies the analysis.
- An unbalanced design has unequal cell sizes, which can happen due to dropout or practical constraints. Unbalanced designs require more complex analysis (e.g., Type III sums of squares) and can make interaction effects harder to estimate cleanly.
Benefits of factorial designs
Efficiency vs single-factor experiments
Factorial designs get more information per subject than running separate one-factor-at-a-time experiments. Consider a 2×2 design with 20 subjects per cell (80 total). Each main effect estimate uses all 80 subjects (40 per level of each factor), so you effectively get two experiments' worth of information from one sample.
Running two separate single-factor experiments with the same precision would also require 80 total subjects, but you'd learn nothing about how the factors interact.
Interaction effects detection
An interaction effect occurs when the effect of one factor depends on the level of another factor. Factorial designs are the only way to detect these.
For example, suppose a study finds that medication reduces depression symptoms by 10 points on average, and therapy reduces them by 8 points. If the combination reduces symptoms by 25 points rather than the expected 18, there's a positive interaction: the treatments are synergistic. You'd miss this entirely if you studied medication and therapy in separate experiments.
Cost and time savings
By combining factors into a single experiment, you avoid duplicating recruitment, setup, and measurement procedures across multiple studies. This matters especially in settings where running experiments is expensive or time-sensitive, such as clinical trials or policy evaluations.
Assumptions of factorial designs
The standard analysis of factorial designs (ANOVA-based) relies on several assumptions. Violations can bias your estimates or inflate your Type I error rate.
Independence of observations
Each observation should be independent of every other observation. If subjects can influence each other's outcomes, this assumption breaks down. A common violation occurs in cluster settings: students in the same classroom, patients in the same clinic, etc. When clustering is present, you need to account for it (e.g., cluster-robust standard errors or multilevel models).
Normality of residuals
The residuals (observed minus predicted values) should be approximately normally distributed. You can check this with Q-Q plots or the Shapiro-Wilk test. Moderate departures from normality are usually tolerable with large samples due to the central limit theorem, but heavily skewed or heavy-tailed distributions may call for data transformations or nonparametric alternatives.
Homogeneity of variances
The variance of residuals should be roughly equal across all treatment combinations. This is called homoscedasticity. You can check it with Levene's test or by plotting residuals against predicted values. If variances differ substantially across cells, consider using Welch-type corrections or robust standard errors.

Designing factorial experiments
Choosing factors and levels
- Select factors that have a plausible causal relationship with the outcome and are directly relevant to your research question.
- Choose levels that are meaningfully distinct from each other and representative of the range you care about.
- Consider practical feasibility and ethics. Can you actually manipulate each factor at the chosen levels without harming participants?
For instance, in an agricultural experiment on plant growth, you might choose temperature levels of 15°C, 25°C, and 35°C to span a realistic range, rather than picking two values that are barely different.
Determining sample size
Use power analysis before collecting data to figure out how many subjects you need per cell. The required sample size depends on:
- The effect size you want to detect (smaller effects need larger samples)
- Your chosen significance level (typically )
- Your desired power (typically 80% or higher)
- The number of treatment combinations (more cells means more subjects overall)
For a 2×2 design aiming to detect a medium-sized interaction effect (Cohen's ) with 80% power at , a power analysis might indicate you need roughly 100 subjects per cell (400 total). Software like G*Power can compute these numbers for you.
Randomization and blocking
Randomization is what makes factorial designs valid for causal inference. Randomly assigning subjects to treatment combinations ensures that, in expectation, the groups are comparable on both observed and unobserved characteristics.
Blocking adds precision on top of randomization. You group subjects by a known source of variability (the blocking variable), then randomize treatment assignments within each block. For example, if prior academic performance predicts the outcome, you could create blocks of high-, medium-, and low-performing students, then randomly assign treatments within each block. This reduces residual variance and increases power.
Analyzing factorial designs
ANOVA for factorial designs
Analysis of variance (ANOVA) is the standard tool. It partitions the total sum of squares in the outcome into components:
- Main effect of Factor A
- Main effect of Factor B
- Interaction effect A×B
- Residual (error)
Each component is tested with an F-test, which compares the variance explained by that component to the residual variance. The ANOVA table reports degrees of freedom, sums of squares, mean squares, F-values, and p-values for each source.
Main effects vs interaction effects
A main effect is the average effect of one factor, collapsing across all levels of the other factor(s). A significant main effect of diet means that, on average across exercise conditions, one diet produces different outcomes than the other.
An interaction effect means the effect of one factor changes depending on the level of another. If diet matters a lot for sedentary people but barely matters for active people, that's an interaction.
When a significant interaction is present, interpreting main effects in isolation can be misleading. The main effect of diet might look small on average, but that average hides the fact that diet has a large effect for one subgroup and no effect for another. Always check for interactions before drawing conclusions from main effects alone.
Multiple comparisons and post-hoc tests
When a main effect or interaction is significant, you often want to know which specific groups differ. Post-hoc tests do this while controlling for the inflated Type I error rate that comes from making multiple comparisons.
Common procedures include:
- Tukey's HSD: Controls the family-wise error rate across all pairwise comparisons. Good default choice.
- Bonferroni correction: Divides by the number of comparisons. Conservative but simple.
- Scheffé's method: Most conservative; useful when you're testing complex contrasts, not just pairwise differences.
For example, if a significant diet×exercise interaction is found, post-hoc tests might reveal that the low-fat diet leads to significantly more weight loss than the high-fat diet among sedentary individuals, but not among active individuals.
Effect size and power
Statistical significance alone doesn't tell you whether an effect is practically meaningful. Always report effect sizes alongside p-values.
- Partial eta-squared (): The proportion of variance in the outcome explained by a given effect, after removing variance due to other effects. Values of 0.01, 0.06, and 0.14 are often considered small, medium, and large.
- Cohen's : Related to by . Values of 0.10, 0.25, and 0.40 correspond to small, medium, and large effects.
A study with a very large sample might find a statistically significant effect that is too small to matter in practice. Conversely, a small study might miss a real and important effect. Reporting both significance and effect size gives the full picture.
Interpreting factorial results
Significance of main effects
A significant main effect tells you that the average outcome differs across levels of that factor, collapsing over the other factor(s). Check the direction (which level produces higher outcomes) and the magnitude (how large is the difference relative to the variability in the data).
Remember: if there's also a significant interaction involving that factor, the main effect is an average that may not describe any particular subgroup well.

Significance of interaction effects
A significant interaction means the story changes depending on which combination of factor levels you're looking at. To understand the interaction, examine the simple effects: the effect of one factor at each fixed level of the other factor.
For example, if study method and subject difficulty interact in their effect on test scores, you'd look at the effect of study method separately for easy subjects and hard subjects. The interaction tells you these simple effects are not the same.
Graphical representations of effects
Interaction plots are the most useful visual tool. Plot the mean outcome on the y-axis, levels of one factor on the x-axis, and use separate lines for each level of the other factor.
- Parallel lines indicate no interaction: the effect of one factor is the same regardless of the other factor's level.
- Non-parallel lines suggest an interaction. The more the lines diverge or cross, the stronger the interaction.
Bar graphs with error bars (showing confidence intervals) for each cell mean are also helpful for comparing specific treatment combinations.
Factorial design variations
Two-way vs three-way designs
A two-way design has two factors and can estimate two main effects and one two-way interaction. A three-way design has three factors and can estimate three main effects, three two-way interactions, and one three-way interaction.
Three-way (and higher) designs provide richer information but come with costs: more treatment combinations, larger required sample sizes, and harder-to-interpret higher-order interactions. Use them when you have strong theoretical reasons to expect a three-way interaction and sufficient resources to power the design.
Within-subjects vs between-subjects factors
- Between-subjects: Each participant experiences only one treatment combination. Simple to implement, but individual differences add noise.
- Within-subjects (repeated measures): Each participant experiences all levels of a factor. This controls for individual differences and increases power, but introduces risks of carryover effects (earlier conditions affecting later ones) and order effects.
Counterbalancing (varying the order of conditions across participants) helps mitigate order effects in within-subjects designs.
Mixed factorial designs
A mixed design combines at least one between-subjects factor and at least one within-subjects factor. This is common when one factor can't be repeated (e.g., a drug vs. placebo, assigned once) but another naturally involves repeated measurement (e.g., time points).
For example, a study might assign participants to caffeine or placebo (between-subjects) and test them in both the morning and afternoon (within-subjects). The mixed design lets you estimate the main effects of caffeine and time of day, plus their interaction.
Limitations of factorial designs
Large number of treatment combinations
The number of cells grows multiplicatively. A design has 16 cells; a design has 27. Each cell needs enough subjects for adequate power, so the total sample size can become impractical quickly.
Fractional factorial designs address this by running only a strategically chosen subset of all possible treatment combinations. You sacrifice the ability to estimate some higher-order interactions, but you can still estimate main effects and lower-order interactions with far fewer subjects.
Difficulty interpreting higher-order interactions
A two-way interaction is relatively straightforward: the effect of A depends on B. A three-way interaction means the two-way interaction between A and B itself changes across levels of C. Four-way interactions are even harder to parse.
Higher-order interactions are also more likely to be spurious, especially when you're testing many effects simultaneously. Treat them with caution and look for replication before building theory around them.
Confounding and lurking variables
Randomization protects against confounding in expectation, but it doesn't guarantee balance in any single experiment, especially with small samples. Lurking variables are unmeasured factors that could influence the outcome and co-vary with your treatment assignments by chance.
To guard against this:
- Use randomization (the most important step)
- Block on known prognostic variables
- Check for baseline balance across treatment groups
- Consider covariate adjustment in the analysis
Factorial designs in practice
Real-world examples and case studies
- Psychology: Bandura, Ross, and Ross (1961) used a factorial design to study children's aggression after observing aggressive vs. non-aggressive models who were either rewarded or punished. The interaction revealed that model reward moderated the effect of observed aggression on children's behavior.
- Marketing: A company might test price (low, medium, high) crossed with ad type (emotional, informational) to see which combination maximizes purchase intent. The interaction could show that emotional ads work best at high price points.
- Healthcare: The STAR*D trial and similar studies use factorial-like structures to evaluate combinations of medications and psychotherapy, testing whether treatments are additive or synergistic.
Reporting factorial design results
Follow established reporting guidelines (APA Style, CONSORT for trials). A complete report should include:
- A clear description of all factors, their levels, and the resulting treatment combinations
- Sample size per cell and overall, with information on how the sample size was determined
- The randomization procedure and any blocking variables
- Full ANOVA results: F-values, degrees of freedom, p-values, and effect sizes for every main effect and interaction
- Post-hoc comparisons where relevant, with the correction method specified
- Graphical displays (interaction plots, bar graphs with confidence intervals) to make the pattern of results accessible to readers