Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In linear regression, not all data points are created equal. Some observations wield disproportionate power over your model's coefficients, predictions, and overall fit. A single problematic point can completely distort your regression line, leading to misleading conclusions and poor predictions. The goal here is to distinguish between points that are merely unusual and points that actually change your model's story.
The key concepts involve leverage, influence, and residual behavior. These three ideas are related but distinct, and exam questions love to conflate them. A point can have high leverage without being influential, or it can be an outlier in without affecting coefficients much. Mastering these diagnostics means understanding what each metric actually measures and when to use which tool. Don't just memorize formulas and thresholds. Know what concept each diagnostic captures and how they work together to protect your model's integrity.
These metrics identify observations with unusual predictor values, meaning points that sit far from the center of your -space. High leverage means the observation has the potential to influence the regression, though it may not actually do so if its -value falls in line with the pattern set by other points.
The hat matrix maps observed responses to fitted values, and the diagonal elements are the leverage values. Each measures how far observation 's predictor values are from the centroid of all predictors.
Mahalanobis distance measures how far an observation is from the centroid of the predictor space, but unlike simple Euclidean distance, it accounts for the correlation structure among predictors. It does this by scaling distances according to the covariance matrix, so it adjusts for how variables covary and for differences in variable spread.
Compare: Leverage vs. Mahalanobis Distance: both measure position in predictor space, but Mahalanobis explicitly accounts for correlations between variables while leverage is computed directly from the hat matrix. In practice, for a single predictor they convey similar information. For multivariate outlier detection, Mahalanobis is the stronger choice because it directly incorporates the covariance structure.
These diagnostics measure how much the regression results actually change when you remove a specific observation. Influence combines leverage with residual size. A point must be unusual in predictor space and poorly fit by the model to truly distort your results.
Cook's Distance summarizes the overall influence of observation on all fitted values simultaneously. It's computed as:
where is the residual, is the leverage, and is the mean squared error. Notice how it explicitly combines residual size (the term) with leverage (the term).
DFFITS measures how much the predicted value for observation changes when that observation is deleted from the model:
where is the predicted value from the model fit without observation .
DFBETAS provides coefficient-specific influence. It measures how much each individual regression coefficient changes when observation is removed:
Compare: Cook's Distance vs. DFBETAS: Cook's Distance gives you one number summarizing overall influence, while DFBETAS tells you which specific coefficients are being distorted. If a question asks about influence on a particular predictor's effect, DFBETAS is your answer. If it asks about overall model stability, go with Cook's Distance.
These metrics focus on how far observations fall from the regression line in the -direction. Large residuals indicate poor fit, but a point can be an outlier in without having much influence if it lacks leverage.
Ordinary residuals have a problem: their variance depends on leverage. Points with high leverage tend to have artificially small residuals because the regression line is pulled toward them. Externally studentized residuals (also called studentized deleted residuals) solve this by:
This leave-one-out approach prevents a potential outlier from inflating the variance estimate used to standardize its own residual.
These are univariate screening tools, useful for initial exploration but not regression-specific.
Compare: Studentized Residuals vs. Z-scores: Studentized residuals are regression-specific and account for leverage through the hat matrix, while Z-scores treat each variable independently. For regression diagnostics, always prefer studentized residuals. Save Z-scores for univariate screening of individual variables.
Plots provide intuitive ways to identify problematic observations by combining multiple diagnostic dimensions. Visualization often reveals patterns that numerical summaries miss.
These plots typically display leverage () on the x-axis and studentized residuals on the y-axis, with point size proportional to Cook's Distance.
A partial regression plot for predictor works by:
The slope of this scatterplot equals the coefficient in the full multiple regression model. These plots are valuable because they can reveal masked outliers: points that appear normal in marginal scatterplots of vs. but become influential once the effects of other predictors are removed.
Compare: Influence Plots vs. Partial Regression Plots: Influence plots give you a global view of all observations' leverage and residuals, while partial regression plots focus on one predictor's relationship with after adjusting for everything else. Use influence plots for overall screening, and partial regression plots for understanding how specific observations affect individual coefficient estimates.
| Concept | Best Examples |
|---|---|
| Overall influence | Cook's Distance, DFFITS |
| Coefficient-specific influence | DFBETAS |
| Position in predictor space | Leverage, Mahalanobis Distance |
| Residual magnitude | Studentized Residuals, Z-score |
| Robust univariate screening | IQR method |
| Visual diagnostics | Influence Plots, Partial Regression Plots |
| Multivariate outlier detection | Mahalanobis Distance |
| Single best summary metric | Cook's Distance |
A point has high leverage but a small studentized residual. Is it influential? Explain why Cook's Distance would be low in this case.
Which two diagnostics would you use together to determine both whether a point is influential and which coefficients it affects most?
Compare and contrast Mahalanobis Distance and simple leverage. When would you prefer one over the other, and what assumption does Mahalanobis make that leverage doesn't?
A regression has one observation with Cook's Distance of 0.8. What threshold would you cite, and what additional diagnostics might inform your decision about whether to remove it?
You're examining a partial regression plot and notice one point far from the regression line. Explain why this point might not have appeared unusual in a simple scatterplot of versus that predictor.