ANOVA and linear regression with categorical predictors are mathematically the same procedure. They both compare group means, produce identical test statistics, and lead to the same conclusions. The reason this equivalence matters is that regression gives you a more flexible framework: it can handle unbalanced designs, include continuous covariates, and scale up to more complex models without switching methods.

Mathematical Equivalence

One-way ANOVA and linear regression with a dummy-coded categorical predictor produce identical results: the same sums of squares, the same F-statistic, and the same p-value.
The overall F-test in ANOVA is exactly the overall F-test for the regression model.
The t-tests for individual regression coefficients correspond to pairwise comparisons between each group and the reference group. Squaring any of those t-statistics gives you the equivalent F-statistic for that specific comparison.

Variable Types and Group Comparisons

In one-way ANOVA, you have a categorical independent variable (the grouping factor with $k$ levels) and a continuous dependent variable. Regression handles this by converting the categorical variable into a set of dummy variables, which are numeric columns the model can work with. This lets you compare group means within the same regression framework you'd use for continuous predictors, and it opens the door to controlling for covariates.

Linear Regression Model for ANOVA

Mathematical Equivalence, Meta-Research: Why we need to report more than 'Data were Analyzed by t-tests or ANOVA' | eLife

Dummy Variables

A dummy variable is a binary indicator coded 0 or 1 that flags whether an observation belongs to a particular category.

Here's how the coding works:

Start with a categorical variable that has $k$ levels (groups).
Choose one level as the reference level (baseline). All its dummy variables will equal 0.
Create $k - 1$ dummy variables. Each one equals 1 for observations in its corresponding group and 0 otherwise.
You use $k - 1$ (not $k$ ) dummies to avoid perfect multicollinearity, where one predictor is a perfect linear combination of the others.

Each regression coefficient on a dummy variable then tells you the difference in means between that group and the reference group.

Model Specification

The regression model equivalent to a one-way ANOVA with $k$ groups is:

$Y = \beta_0 + \beta_1 D_1 + \beta_2 D_2 + \cdots + \beta_{k-1} D_{k-1} + \varepsilon$

$Y$ is the continuous outcome variable.
$\beta_0$ is the intercept, equal to the mean of the reference group ( $\bar{Y}_{\text{ref}}$ ).
$\beta_i$ is the coefficient for the $i$ -th dummy variable, equal to $\bar{Y}_i - \bar{Y}_{\text{ref}}$ .
$D_i$ is the $i$ -th dummy variable (1 if the observation is in group $i$ , 0 otherwise).
$\varepsilon$ is the error term capturing variation not explained by group membership.

For example, with three groups (A, B, C) and A as the reference, the predicted value for a group B observation is $\hat{Y} = \beta_0 + \beta_1(1) + \beta_2(0) = \beta_0 + \beta_1$ . That's just the mean of group A plus the B-vs-A difference, which equals the mean of group B.

Interpreting Regression Coefficients for Group Comparisons

Mathematical Equivalence, One-Way ANOVA | Boundless Statistics

Coefficient Interpretation

Intercept ( $\beta_0$ ): the sample mean of the reference group.
Each slope ( $\beta_i$ $β_{i}$ ): the difference in sample means between group $i$ $i$ and the reference group.
- A positive $\beta_i$ means group $i$ has a higher mean than the reference.
- A negative $\beta_i$ means group $i$ has a lower mean than the reference.
- The magnitude of $\beta_i$ is the size of that difference.

This means you can recover every group mean directly from the coefficients. The reference group mean is $\beta_0$ , and any other group mean is $\beta_0 + \beta_i$ .

Hypothesis Testing and Confidence Intervals

The t-test for each $\beta_i$ tests $H_0: \mu_i = \mu_{\text{ref}}$ , i.e., whether group $i$ differs significantly from the reference group.
A confidence interval for $\beta_i$ gives a range of plausible values for the true mean difference $\mu_i - \mu_{\text{ref}}$ . If that interval excludes 0, the difference is statistically significant at the chosen $\alpha$ level.
The overall F-test simultaneously tests whether any group differs from the reference, which is the same null hypothesis as in ANOVA: $H_0: \mu_1 = \mu_2 = \cdots = \mu_k$ .

Advantages and Limitations of Regression for ANOVA

Advantages

Covariates: Regression lets you add continuous control variables (e.g., age, baseline score) directly into the model, turning a one-way ANOVA into an ANCOVA without switching software procedures.
Unbalanced designs: When group sizes are unequal, regression handles the unequal weighting naturally through its least-squares estimation.
Modeling flexibility: You can add interaction terms, polynomial terms, or additional categorical predictors within the same framework, building toward the general linear model.

Limitations and Considerations

Regression still requires the same assumptions ANOVA does: independence, normality of residuals, and homogeneity of variance across groups. Violating homogeneity (heteroscedasticity) is a concern in both approaches.
The linearity assumption in regression refers to linearity in the parameters, which is automatically satisfied with dummy coding. However, if you later add continuous predictors, you do need to check for non-linear relationships.
Researchers accustomed to traditional ANOVA output (sums of squares decomposition, mean squares, F-ratio tables) may find regression output less intuitive at first, even though the underlying math is identical.

When to use which framing: If you have a balanced design, no covariates, and a straightforward group comparison, classical ANOVA notation is simpler and more familiar. Once you need covariates, unequal group sizes, or more complex model terms, the regression framework is the natural choice. Since they're the same model, the decision is about clarity and convenience, not correctness.

2,589 studying →