Residuals in Linear Regression
Understanding Residuals
A residual is the difference between an observed value and the value your regression model predicts for that point:
Each data point has its own residual. A positive residual means the model underpredicted; a negative residual means it overpredicted. One fundamental property: the sum of all residuals in an ordinary least squares regression is always zero, because the fitted line passes through by construction.
Residuals matter because they're your primary diagnostic tool. The regression model rests on a set of assumptions (linearity, independence, homoscedasticity, and normality), and you can't check most of those assumptions by looking at the raw data alone. You check them by looking at the residuals.
Why Residuals Matter for Diagnostics
Residuals help you spot two broad categories of problems:
- Outliers and influential observations. An outlier is a point whose residual is unusually large in magnitude. An influential observation is a point that, if removed, would substantially change the estimated regression coefficients. These aren't the same thing: a point can be an outlier without being influential, and vice versa.
- Systematic patterns. If the residuals show a non-random structure (curves, fans, clusters), that's evidence the model is missing something. Random-looking residuals suggest the model assumptions hold; patterned residuals suggest at least one assumption is violated.
Interpreting Residual Plots

What an Ideal Residual Plot Looks Like
A residual plot is a scatterplot with predicted values () on the horizontal axis and residuals () on the vertical axis. You can also plot residuals against the predictor variable in simple linear regression.
In an ideal residual plot:
- The points scatter randomly around the horizontal line at
- There's no visible pattern, trend, or change in spread
- The vertical spread of points stays roughly constant from left to right
This random cloud of points tells you the model's assumptions are reasonably satisfied.
Common Patterns and What They Mean
Funnel shape (heteroscedasticity). The spread of residuals increases (or decreases) as the predicted values get larger. This means the variance of the errors isn't constant. For example, you might see residuals tightly clustered near small values but widely spread near large ones. This violates the homoscedasticity assumption.
Curved pattern (non-linearity). The residuals form a U-shape or inverted U-shape instead of scattering randomly. This means a straight line doesn't capture the true relationship between and . A quadratic term, a different functional form, or a variable transformation may be needed.
Clusters or repeating patterns. Distinct groups of residuals or a wave-like pattern can indicate subgroups in the data, omitted variables, or autocorrelation (common in time-series data where consecutive observations are related). This points to a violation of the independence assumption.
Residual Analysis for Model Fit

Evaluating Goodness-of-Fit
Beyond visual inspection, several numerical summaries help you assess how well the model fits:
- Residual standard error (RSE): The standard deviation of the residuals, measuring the typical size of a prediction error in the units of . A smaller RSE means predictions are, on average, closer to the observed values.
- Coefficient of determination (): The proportion of variance in explained by the model, ranging from 0 to 1. An of 0.85 means the predictor accounts for 85% of the variability in the response.
- Adjusted : A modified version that penalizes for adding predictors that don't genuinely improve the model. In simple linear regression with one predictor, and adjusted will be very close, but the distinction becomes important when you move to multiple regression.
These numbers summarize overall fit, but they won't tell you where or how the model fails. That's what the residual plots are for. A model can have a decent and still show clear assumption violations in its residual plot.
Identifying Issues and Next Steps
Residual analysis can reveal several specific problems:
- Lack of fit: The model doesn't capture the underlying relationship (often visible as a curved residual pattern even when looks acceptable).
- Outliers: Individual points with unusually large residuals that may distort the fitted line.
- Assumption violations: Non-linearity, heteroscedasticity, non-normality, or autocorrelation.
When you find problems, common remedies include:
- Transforming the predictor or response variable (e.g., log or square root transformations)
- Adding polynomial or interaction terms
- Using weighted least squares to handle non-constant variance
- Investigating and, with justification, removing truly anomalous observations
Removing data points should always be a last resort and requires a substantive reason, not just "it made go up."
Checking the Normality Assumption
Tools for Assessing Normality
The normality assumption states that the residuals follow a normal distribution. Two common tools for checking this:
- Histogram of residuals: Should look roughly bell-shaped and symmetric around zero. With small sample sizes, don't expect perfection; you're looking for severe skewness or multiple peaks.
- Normal Q-Q plot: Plots the ordered residuals against the quantiles you'd expect from a normal distribution. If normality holds, the points fall approximately along a straight diagonal line. Systematic departures (S-curves, bowing away from the line) indicate non-normality.
Why Normality Matters
Mild departures from normality usually aren't a serious problem, especially with larger samples, because the Central Limit Theorem helps stabilize inference. However, strong skewness or heavy tails can make confidence intervals and hypothesis tests unreliable.
If normality is clearly violated, options include transforming the response variable or using robust regression methods that don't rely as heavily on the normality assumption.
The four assumptions to check (LINE): Linearity, Independence, Normality, Equal variance (homoscedasticity). Residual plots are your primary tool for the first and last of these; Q-Q plots and histograms handle normality; independence often requires thinking about how the data were collected.