Categorical predictors represent distinct groups or categories rather than continuous numerical values. In regression, they let you compare how different groups relate to the response variable.

A few common examples:

Gender (male/female)
Education level (high school / college / graduate)
Product type (A / B / C)

Categorical predictors with exactly two levels are called binary (or dichotomous) variables. Those with more than two levels are called polytomous variables. The distinction matters because polytomous variables require more than one dummy variable to encode, which directly affects how you build and interpret the model.

Role of Categorical Predictors in Regression Models

Including categorical predictors lets you estimate the effect of belonging to each category on the response variable, holding other predictors constant. A regression model with categorical predictors can identify whether the response variable differs significantly across groups.

Beyond main effects, you can also include interaction terms between categorical predictors, or between a categorical and a continuous predictor. These interactions test whether the effect of one predictor changes depending on the level of another. Overall, categorical predictors can meaningfully improve a model's explanatory power by capturing group-level variation that continuous predictors alone would miss.

Creating Dummy Variables

Concept of Dummy Variables

Because regression requires numerical inputs, you can't plug a variable like "region = {North, South, East, West}" directly into a model. Dummy variables (also called indicator variables) solve this by converting each category into a binary 0/1 variable.

The key rule: a categorical predictor with $k$ levels requires $k - 1$ dummy variables. The level that doesn't get its own dummy variable becomes the reference category (or baseline). Every coefficient for the remaining dummies is then interpreted relative to that reference category.

For example, suppose "region" has three levels: North, South, and West. If you choose North as the reference category, you create two dummy variables:

$D_{\text{South}}$ : equals 1 if the observation is South, 0 otherwise
$D_{\text{West}}$ : equals 1 if the observation is West, 0 otherwise

An observation from the North has $D_{\text{South}} = 0$ and $D_{\text{West}} = 0$ . The reference category is identified by all dummy variables equaling zero.

Why only $k - 1$ dummies? If you included all $k$ , the dummy columns would sum to 1 for every observation, creating perfect multicollinearity with the intercept. The model can't be estimated under those conditions.

Understanding Categorical Predictors, R Tutorial Series: R Tutorial Series: Regression With Categorical Variables

Process of Creating Dummy Variables

Identify the categorical predictor and its levels. Confirm how many distinct categories exist in your data.
Choose a reference category. This is often the most common level, a natural control group, or whichever category makes interpretation most meaningful for your research question.
Create $k - 1$ dummy variables. For each non-reference level, create a binary variable that equals 1 when an observation belongs to that level and 0 otherwise.
Include the dummy variables in the regression model as predictors alongside any continuous variables.
Interpret each dummy coefficient as the estimated difference in the mean response between that level and the reference category.

The choice of reference category doesn't change the model's overall fit or predictions. It only changes which comparisons the coefficients directly represent.

Interpreting Dummy Variable Coefficients

Coefficient Interpretation

The coefficient on a dummy variable tells you the estimated difference in the mean response between that category and the reference category, holding all other predictors constant.

Suppose you're modeling salary with education level as a predictor, using "high school" as the reference:

$\hat{Y} = 35{,}000 + 12{,}000 \cdot D_{\text{college}} + 24{,}000 \cdot D_{\text{graduate}}$

The intercept ( $35{,}000$ ) is the estimated mean salary for the reference group (high school).
The coefficient $12{,}000$ on $D_{\text{college}}$ means college-educated individuals earn, on average, $12,000 more than high school-educated individuals.
The coefficient $24{,}000$ on $D_{\text{graduate}}$ means graduate-educated individuals earn, on average, $24,000 more than the high school group.

A positive coefficient means that category has a higher mean response than the reference. A negative coefficient means a lower mean response. The magnitude tells you the size of that difference.

Assessing Significance

Each dummy variable coefficient has its own hypothesis test and p-value. The null hypothesis is that there's no difference in the mean response between that category and the reference category.

A significant coefficient (typically $p < 0.05$ ) indicates the difference from the reference category is unlikely due to chance alone.
A non-significant coefficient suggests the data don't provide strong evidence of a difference from the reference group.
Confidence intervals around each coefficient give you a range of plausible values for the true difference in means.

One thing to watch: the individual t-tests only compare each level to the reference. If you want to test whether the categorical predictor matters overall (across all levels simultaneously), use an F-test for the joint significance of all the dummy variables together.

Comparing Category Effects

Comparing Coefficients

With dummy variables in the model, you can compare the effects of different categories on the response. Larger absolute coefficients indicate a stronger estimated effect relative to the reference group.

In the salary example above, the graduate coefficient ( $24{,}000$ ) is twice the college coefficient ( $12{,}000$ ), suggesting the salary gap between graduate and high school education is about twice the gap between college and high school education.

However, comparing two non-reference categories to each other requires a bit more work, since each coefficient is defined relative to the reference, not relative to the other categories.

Pairwise Comparisons

To directly compare two non-reference categories, you have a couple of options:

Change the reference category and re-estimate the model. For instance, switching the reference to "college" would give you a coefficient that directly estimates the graduate-vs.-college difference.
Compute the difference between coefficients from the original model (e.g., $24{,}000 - 12{,}000 = 12{,}000$ ) and use a linear contrast or Wald test to assess its significance.

Both approaches yield the same point estimate, but the second avoids re-running the model.

When you're making many pairwise comparisons, be cautious about Type I error inflation. Testing all possible pairs increases the chance of finding a "significant" difference by chance. Corrections like Bonferroni or Tukey's HSD can help control the overall error rate.

Interaction Effects

Interaction terms let you test whether the effect of one predictor depends on the level of another. For example, an interaction between gender and education would test whether the salary boost from a graduate degree differs for men and women.

In the model, an interaction between a categorical and a continuous predictor creates new terms by multiplying the dummy variable by the continuous variable. A significant interaction coefficient means the slope of the continuous predictor differs across categories.

Interpreting interactions requires looking at the interaction coefficients together with the main effects. You can't interpret either in isolation. Interaction plots are especially helpful here: they graph the predicted response across levels of one predictor, with separate lines for each level of the other predictor. Non-parallel lines signal an interaction effect.

2,589 studying →