upgrade
upgrade

🥖Linear Modeling Theory

Influential Linear Regression Outliers

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In linear regression, not all data points are created equal—some observations wield disproportionate power over your model's coefficients, predictions, and overall fit. Understanding influential outliers is essential because a single problematic point can completely distort your regression line, leading to misleading conclusions and poor predictions. You're being tested on your ability to distinguish between points that are merely unusual and points that actually change your model's story.

The key concepts here involve leverage, influence, and residual behavior—three related but distinct ideas that exam questions love to conflate. A point can have high leverage without being influential, or it can be an outlier in YY without affecting coefficients much. Mastering these diagnostics means understanding what each metric actually measures and when to use which tool. Don't just memorize formulas and thresholds—know what concept each diagnostic illustrates and how they work together to protect your model's integrity.


Leverage-Based Diagnostics

These metrics identify observations with unusual predictor values—points that sit far from the center of your XX-space. High leverage means the observation has the potential to influence the regression, though it may not actually do so if its YY-value falls in line.

Leverage (Hat Values)

  • Measures distance in predictor space—specifically, how far an observation's XX-values are from the mean of all predictors
  • High leverage threshold is typically 2(p+1)/n2(p+1)/n where pp is the number of predictors and nn is sample size
  • Leverage alone doesn't guarantee influence—a point needs both high leverage AND a large residual to actually distort your model

Mahalanobis Distance

  • Accounts for correlation structure—measures multivariate distance from the centroid while adjusting for how variables covary
  • Superior to Euclidean distance when predictors are correlated, since it considers the shape of the data cloud
  • Threshold based on chi-squared distribution with pp degrees of freedom, often using critical values at α=0.01\alpha = 0.01

Compare: Leverage vs. Mahalanobis Distance—both measure position in predictor space, but Mahalanobis accounts for correlations between variables while leverage is computed directly from the hat matrix. For FRQs asking about multivariate outlier detection, Mahalanobis is your go-to example.


Influence Metrics

These diagnostics measure how much the regression results actually change when you remove a specific observation. Influence combines leverage with residual size—a point must be unusual in both XX and YY to truly distort your model.

Cook's Distance

  • Combines leverage and residual size into a single summary of overall influence on all fitted values simultaneously
  • Threshold of 4/n4/n is commonly used, though some texts suggest flagging values above 1
  • Best single metric for overall influence—if you can only compute one diagnostic, this is the one to choose

DFFITS (Difference in Fits)

  • Measures change in predicted value for observation ii when that observation is deleted from the model
  • Threshold of ±2(p+1)/n\pm 2\sqrt{(p+1)/n} identifies points that substantially shift their own fitted values
  • Scaled version of Cook's Distance—they identify similar points but DFFITS focuses on the observation's own prediction

DFBETAS (Difference in Beta Estimates)

  • Coefficient-specific influence—measures how much each βj\beta_j changes when observation ii is removed
  • Threshold of ±2/n\pm 2/\sqrt{n} flags observations that substantially shift individual coefficient estimates
  • Produces a matrix of values—one DFBETA for each observation-coefficient combination, revealing which coefficients are vulnerable

Compare: Cook's Distance vs. DFBETAS—Cook's Distance gives you one number summarizing overall influence, while DFBETAS tells you which specific coefficients are being distorted. If an FRQ asks about influence on a particular predictor's effect, DFBETAS is your answer.


Residual-Based Diagnostics

These metrics focus on how far observations fall from the regression line in the YY-direction. Large residuals indicate poor fit, but a point can be an outlier in YY without having much influence if it lacks leverage.

Studentized Residuals

  • Standardized using leave-one-out variance—divides residual by an estimate of standard error that excludes that observation
  • Follows a t-distribution under normality assumptions, enabling formal hypothesis testing for outliers
  • Threshold of t>2|t| > 2 or 33 commonly used, with Bonferroni correction for multiple testing when screening many points

Outlier Detection Methods (Z-score, IQR)

  • Z-score method flags observations more than 2-3 standard deviations from the mean, assuming approximate normality
  • IQR method uses Q11.5×IQRQ_1 - 1.5 \times IQR and Q3+1.5×IQRQ_3 + 1.5 \times IQR as fences, robust to non-normality
  • Univariate approaches—useful for initial screening but don't account for regression structure or multivariate relationships

Compare: Studentized Residuals vs. Z-scores—Studentized residuals are regression-specific and account for leverage, while Z-scores treat each variable independently. For regression diagnostics, always prefer studentized residuals; save Z-scores for univariate screening.


Visual Diagnostic Tools

Plots provide intuitive ways to identify problematic observations by combining multiple diagnostic dimensions. Visualization often reveals patterns that numerical summaries miss.

Influence Plots

  • Combines leverage (x-axis) and studentized residuals (y-axis) with point size proportional to Cook's Distance
  • Upper-right and lower-right corners contain the most dangerous points—high leverage plus large residuals
  • Quick visual screening tool—immediately identifies observations warranting further investigation

Partial Regression Plots (Added Variable Plots)

  • Shows relationship between XjX_j and YY after removing effects of other predictors—both variables are residualized first
  • Slope equals the coefficient βj\beta_j in the full multiple regression model
  • Reveals masked outliers—points that appear normal marginally but are influential in the partial relationship

Compare: Influence Plots vs. Partial Regression Plots—Influence plots give you a global view of all observations' leverage and residuals, while partial regression plots focus on one predictor's relationship. Use influence plots for overall screening, partial regression plots for understanding specific coefficient estimates.


Quick Reference Table

ConceptBest Examples
Overall influenceCook's Distance, DFFITS
Coefficient-specific influenceDFBETAS
Position in predictor spaceLeverage, Mahalanobis Distance
Residual magnitudeStudentized Residuals, Z-score
Robust univariate screeningIQR method
Visual diagnosticsInfluence Plots, Partial Regression Plots
Multivariate outlier detectionMahalanobis Distance
Single best summary metricCook's Distance

Self-Check Questions

  1. A point has high leverage but a small studentized residual. Is it influential? Explain why Cook's Distance would be low in this case.

  2. Which two diagnostics would you use together to determine both whether a point is influential and which coefficients it affects most?

  3. Compare and contrast Mahalanobis Distance and simple leverage—when would you prefer one over the other, and what assumption does Mahalanobis make that leverage doesn't?

  4. An FRQ presents a regression with one observation having Cook's Distance of 0.8 and asks whether it should be removed. What threshold would you cite, and what additional diagnostics might inform your decision?

  5. You're examining a partial regression plot and notice one point far from the regression line. Explain why this point might not have appeared unusual in a simple scatterplot of YY versus that predictor.