Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In linear regression, not all data points are created equal—some observations wield disproportionate power over your model's coefficients, predictions, and overall fit. Understanding influential outliers is essential because a single problematic point can completely distort your regression line, leading to misleading conclusions and poor predictions. You're being tested on your ability to distinguish between points that are merely unusual and points that actually change your model's story.
The key concepts here involve leverage, influence, and residual behavior—three related but distinct ideas that exam questions love to conflate. A point can have high leverage without being influential, or it can be an outlier in without affecting coefficients much. Mastering these diagnostics means understanding what each metric actually measures and when to use which tool. Don't just memorize formulas and thresholds—know what concept each diagnostic illustrates and how they work together to protect your model's integrity.
These metrics identify observations with unusual predictor values—points that sit far from the center of your -space. High leverage means the observation has the potential to influence the regression, though it may not actually do so if its -value falls in line.
Compare: Leverage vs. Mahalanobis Distance—both measure position in predictor space, but Mahalanobis accounts for correlations between variables while leverage is computed directly from the hat matrix. For FRQs asking about multivariate outlier detection, Mahalanobis is your go-to example.
These diagnostics measure how much the regression results actually change when you remove a specific observation. Influence combines leverage with residual size—a point must be unusual in both and to truly distort your model.
Compare: Cook's Distance vs. DFBETAS—Cook's Distance gives you one number summarizing overall influence, while DFBETAS tells you which specific coefficients are being distorted. If an FRQ asks about influence on a particular predictor's effect, DFBETAS is your answer.
These metrics focus on how far observations fall from the regression line in the -direction. Large residuals indicate poor fit, but a point can be an outlier in without having much influence if it lacks leverage.
Compare: Studentized Residuals vs. Z-scores—Studentized residuals are regression-specific and account for leverage, while Z-scores treat each variable independently. For regression diagnostics, always prefer studentized residuals; save Z-scores for univariate screening.
Plots provide intuitive ways to identify problematic observations by combining multiple diagnostic dimensions. Visualization often reveals patterns that numerical summaries miss.
Compare: Influence Plots vs. Partial Regression Plots—Influence plots give you a global view of all observations' leverage and residuals, while partial regression plots focus on one predictor's relationship. Use influence plots for overall screening, partial regression plots for understanding specific coefficient estimates.
| Concept | Best Examples |
|---|---|
| Overall influence | Cook's Distance, DFFITS |
| Coefficient-specific influence | DFBETAS |
| Position in predictor space | Leverage, Mahalanobis Distance |
| Residual magnitude | Studentized Residuals, Z-score |
| Robust univariate screening | IQR method |
| Visual diagnostics | Influence Plots, Partial Regression Plots |
| Multivariate outlier detection | Mahalanobis Distance |
| Single best summary metric | Cook's Distance |
A point has high leverage but a small studentized residual. Is it influential? Explain why Cook's Distance would be low in this case.
Which two diagnostics would you use together to determine both whether a point is influential and which coefficients it affects most?
Compare and contrast Mahalanobis Distance and simple leverage—when would you prefer one over the other, and what assumption does Mahalanobis make that leverage doesn't?
An FRQ presents a regression with one observation having Cook's Distance of 0.8 and asks whether it should be removed. What threshold would you cite, and what additional diagnostics might inform your decision?
You're examining a partial regression plot and notice one point far from the regression line. Explain why this point might not have appeared unusual in a simple scatterplot of versus that predictor.