Fiveable

🥖Linear Modeling Theory Unit 4 Review

QR code for Linear Modeling Theory practice questions

4.1 Residual Analysis and Plots

4.1 Residual Analysis and Plots

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Residuals in Linear Regression

Understanding Residuals

A residual is the difference between an observed value and the value your regression model predicts for that point:

ei=yiy^ie_i = y_i - \hat{y}_i

Each data point has its own residual. A positive residual means the model underpredicted; a negative residual means it overpredicted. One fundamental property: the sum of all residuals in an ordinary least squares regression is always zero, because the fitted line passes through (xˉ,yˉ)(\bar{x}, \bar{y}) by construction.

Residuals matter because they're your primary diagnostic tool. The regression model rests on a set of assumptions (linearity, independence, homoscedasticity, and normality), and you can't check most of those assumptions by looking at the raw data alone. You check them by looking at the residuals.

Why Residuals Matter for Diagnostics

Residuals help you spot two broad categories of problems:

  • Outliers and influential observations. An outlier is a point whose residual is unusually large in magnitude. An influential observation is a point that, if removed, would substantially change the estimated regression coefficients. These aren't the same thing: a point can be an outlier without being influential, and vice versa.
  • Systematic patterns. If the residuals show a non-random structure (curves, fans, clusters), that's evidence the model is missing something. Random-looking residuals suggest the model assumptions hold; patterned residuals suggest at least one assumption is violated.

Interpreting Residual Plots

Understanding Residuals, regression - Interpreting the residuals vs. fitted values plot for verifying the assumptions of ...

What an Ideal Residual Plot Looks Like

A residual plot is a scatterplot with predicted values (y^\hat{y}) on the horizontal axis and residuals (eie_i) on the vertical axis. You can also plot residuals against the predictor variable xx in simple linear regression.

In an ideal residual plot:

  • The points scatter randomly around the horizontal line at e=0e = 0
  • There's no visible pattern, trend, or change in spread
  • The vertical spread of points stays roughly constant from left to right

This random cloud of points tells you the model's assumptions are reasonably satisfied.

Common Patterns and What They Mean

Funnel shape (heteroscedasticity). The spread of residuals increases (or decreases) as the predicted values get larger. This means the variance of the errors isn't constant. For example, you might see residuals tightly clustered near small y^\hat{y} values but widely spread near large ones. This violates the homoscedasticity assumption.

Curved pattern (non-linearity). The residuals form a U-shape or inverted U-shape instead of scattering randomly. This means a straight line doesn't capture the true relationship between xx and yy. A quadratic term, a different functional form, or a variable transformation may be needed.

Clusters or repeating patterns. Distinct groups of residuals or a wave-like pattern can indicate subgroups in the data, omitted variables, or autocorrelation (common in time-series data where consecutive observations are related). This points to a violation of the independence assumption.

Residual Analysis for Model Fit

Understanding Residuals, Introduction to Assessing the Fit of a Line | Concepts in Statistics

Evaluating Goodness-of-Fit

Beyond visual inspection, several numerical summaries help you assess how well the model fits:

  • Residual standard error (RSE): The standard deviation of the residuals, measuring the typical size of a prediction error in the units of yy. A smaller RSE means predictions are, on average, closer to the observed values.
  • Coefficient of determination (R2R^2): The proportion of variance in yy explained by the model, ranging from 0 to 1. An R2R^2 of 0.85 means the predictor accounts for 85% of the variability in the response.
  • Adjusted R2R^2: A modified version that penalizes for adding predictors that don't genuinely improve the model. In simple linear regression with one predictor, R2R^2 and adjusted R2R^2 will be very close, but the distinction becomes important when you move to multiple regression.

These numbers summarize overall fit, but they won't tell you where or how the model fails. That's what the residual plots are for. A model can have a decent R2R^2 and still show clear assumption violations in its residual plot.

Identifying Issues and Next Steps

Residual analysis can reveal several specific problems:

  • Lack of fit: The model doesn't capture the underlying relationship (often visible as a curved residual pattern even when R2R^2 looks acceptable).
  • Outliers: Individual points with unusually large residuals that may distort the fitted line.
  • Assumption violations: Non-linearity, heteroscedasticity, non-normality, or autocorrelation.

When you find problems, common remedies include:

  1. Transforming the predictor or response variable (e.g., log or square root transformations)
  2. Adding polynomial or interaction terms
  3. Using weighted least squares to handle non-constant variance
  4. Investigating and, with justification, removing truly anomalous observations

Removing data points should always be a last resort and requires a substantive reason, not just "it made R2R^2 go up."

Checking the Normality Assumption

Tools for Assessing Normality

The normality assumption states that the residuals follow a normal distribution. Two common tools for checking this:

  • Histogram of residuals: Should look roughly bell-shaped and symmetric around zero. With small sample sizes, don't expect perfection; you're looking for severe skewness or multiple peaks.
  • Normal Q-Q plot: Plots the ordered residuals against the quantiles you'd expect from a normal distribution. If normality holds, the points fall approximately along a straight diagonal line. Systematic departures (S-curves, bowing away from the line) indicate non-normality.

Why Normality Matters

Mild departures from normality usually aren't a serious problem, especially with larger samples, because the Central Limit Theorem helps stabilize inference. However, strong skewness or heavy tails can make confidence intervals and hypothesis tests unreliable.

If normality is clearly violated, options include transforming the response variable or using robust regression methods that don't rely as heavily on the normality assumption.

The four assumptions to check (LINE): Linearity, Independence, Normality, Equal variance (homoscedasticity). Residual plots are your primary tool for the first and last of these; Q-Q plots and histograms handle normality; independence often requires thinking about how the data were collected.