๐Ÿฅ–Linear Modeling Theory

Influential Linear Regression Outliers

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In linear regression, not all data points are created equal. Some observations wield disproportionate power over your model's coefficients, predictions, and overall fit. A single problematic point can completely distort your regression line, leading to misleading conclusions and poor predictions. The goal here is to distinguish between points that are merely unusual and points that actually change your model's story.

The key concepts involve leverage, influence, and residual behavior. These three ideas are related but distinct, and exam questions love to conflate them. A point can have high leverage without being influential, or it can be an outlier in YY without affecting coefficients much. Mastering these diagnostics means understanding what each metric actually measures and when to use which tool. Don't just memorize formulas and thresholds. Know what concept each diagnostic captures and how they work together to protect your model's integrity.


Leverage-Based Diagnostics

These metrics identify observations with unusual predictor values, meaning points that sit far from the center of your XX-space. High leverage means the observation has the potential to influence the regression, though it may not actually do so if its YY-value falls in line with the pattern set by other points.

Leverage (Hat Values)

The hat matrix H=X(XTX)โˆ’1XTH = X(X^TX)^{-1}X^T maps observed responses to fitted values, and the diagonal elements hiih_{ii} are the leverage values. Each hiih_{ii} measures how far observation ii's predictor values are from the centroid of all predictors.

  • Leverage values always fall between 1/n1/n and 11
  • The common high leverage threshold is 2(p+1)/n2(p+1)/n, where pp is the number of predictors and nn is sample size
  • Leverage alone doesn't guarantee influence. A point needs both high leverage AND a large residual to actually distort your model. Think of leverage as opportunity to influence, not influence itself.

Mahalanobis Distance

Mahalanobis distance measures how far an observation is from the centroid of the predictor space, but unlike simple Euclidean distance, it accounts for the correlation structure among predictors. It does this by scaling distances according to the covariance matrix, so it adjusts for how variables covary and for differences in variable spread.

  • Superior to Euclidean distance when predictors are correlated, since it considers the shape of the data cloud rather than treating all directions equally
  • Under multivariate normality, Mahalanobis distance follows a chi-squared distribution with pp degrees of freedom, so thresholds are typically set using critical values at ฮฑ=0.01\alpha = 0.01
  • Particularly useful for detecting multivariate outliers that look normal on any single variable but are unusual in combination

Compare: Leverage vs. Mahalanobis Distance: both measure position in predictor space, but Mahalanobis explicitly accounts for correlations between variables while leverage is computed directly from the hat matrix. In practice, for a single predictor they convey similar information. For multivariate outlier detection, Mahalanobis is the stronger choice because it directly incorporates the covariance structure.


Influence Metrics

These diagnostics measure how much the regression results actually change when you remove a specific observation. Influence combines leverage with residual size. A point must be unusual in predictor space and poorly fit by the model to truly distort your results.

Cook's Distance

Cook's Distance summarizes the overall influence of observation ii on all fitted values simultaneously. It's computed as:

Di=ei2pโ‹…MSEโ‹…hii(1โˆ’hii)2D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}

where eie_i is the residual, hiih_{ii} is the leverage, and MSEMSE is the mean squared error. Notice how it explicitly combines residual size (the ei2e_i^2 term) with leverage (the hiih_{ii} term).

  • A threshold of 4/n4/n is commonly used for flagging, though some texts suggest values above 0.5 deserve attention and values above 1 are almost certainly influential
  • If you can only compute one influence diagnostic, this is the one to choose because it captures the joint effect of leverage and residual magnitude in a single number

DFFITS (Difference in Fits)

DFFITS measures how much the predicted value for observation ii changes when that observation is deleted from the model:

DFFITSi=y^iโˆ’y^i(i)MSE(i)โ‹…hiiDFFITS_i = \frac{\hat{y}_i - \hat{y}_{i(i)}}{\sqrt{MSE_{(i)} \cdot h_{ii}}}

where y^i(i)\hat{y}_{i(i)} is the predicted value from the model fit without observation ii.

  • Threshold of ยฑ2(p+1)/n\pm 2\sqrt{(p+1)/n} identifies points that substantially shift their own fitted values
  • DFFITS and Cook's Distance tend to flag the same observations. The conceptual difference is that DFFITS focuses specifically on the change in observation ii's own prediction, while Cook's Distance summarizes the change across all predictions.

DFBETAS (Difference in Beta Estimates)

DFBETAS provides coefficient-specific influence. It measures how much each individual regression coefficient ฮฒ^j\hat{\beta}_j changes when observation ii is removed:

DFBETASj(i)=ฮฒ^jโˆ’ฮฒ^j(i)MSE(i)โ‹…(XTX)jjโˆ’1DFBETAS_{j(i)} = \frac{\hat{\beta}_j - \hat{\beta}_{j(i)}}{\sqrt{MSE_{(i)} \cdot (X^TX)^{-1}_{jj}}}

  • Threshold of ยฑ2/n\pm 2/\sqrt{n} flags observations that substantially shift individual coefficient estimates
  • This produces a matrix of values: one DFBETA for each observation-coefficient combination, revealing which coefficients are most vulnerable to specific observations
  • Especially useful when you care about the interpretation of a particular predictor's effect

Compare: Cook's Distance vs. DFBETAS: Cook's Distance gives you one number summarizing overall influence, while DFBETAS tells you which specific coefficients are being distorted. If a question asks about influence on a particular predictor's effect, DFBETAS is your answer. If it asks about overall model stability, go with Cook's Distance.


Residual-Based Diagnostics

These metrics focus on how far observations fall from the regression line in the YY-direction. Large residuals indicate poor fit, but a point can be an outlier in YY without having much influence if it lacks leverage.

Studentized Residuals

Ordinary residuals have a problem: their variance depends on leverage. Points with high leverage tend to have artificially small residuals because the regression line is pulled toward them. Externally studentized residuals (also called studentized deleted residuals) solve this by:

  1. Deleting observation ii from the dataset
  2. Refitting the model to get MSE(i)MSE_{(i)}, the error variance estimate without that point
  3. Dividing the residual by MSE(i)โ‹…(1โˆ’hii)\sqrt{MSE_{(i)} \cdot (1 - h_{ii})}

This leave-one-out approach prevents a potential outlier from inflating the variance estimate used to standardize its own residual.

  • Under normality assumptions, externally studentized residuals follow a tt-distribution with nโˆ’pโˆ’2n - p - 2 degrees of freedom, enabling formal hypothesis testing
  • Common thresholds are โˆฃtโˆฃ>2|t| > 2 or โˆฃtโˆฃ>3|t| > 3, with Bonferroni correction applied when screening many observations simultaneously (to control the family-wise error rate)

Outlier Detection Methods (Z-score, IQR)

These are univariate screening tools, useful for initial exploration but not regression-specific.

  • Z-score method flags observations more than 2 or 3 standard deviations from the mean, assuming approximate normality. It can be misleading when the distribution is skewed or heavy-tailed.
  • IQR method uses Q1โˆ’1.5ร—IQRQ_1 - 1.5 \times IQR and Q3+1.5ร—IQRQ_3 + 1.5 \times IQR as fences. Because it's based on quartiles, it's robust to non-normality and less sensitive to extreme values.
  • Neither method accounts for regression structure or multivariate relationships, so they're best used for preliminary variable-by-variable screening before fitting a model.

Compare: Studentized Residuals vs. Z-scores: Studentized residuals are regression-specific and account for leverage through the hat matrix, while Z-scores treat each variable independently. For regression diagnostics, always prefer studentized residuals. Save Z-scores for univariate screening of individual variables.


Visual Diagnostic Tools

Plots provide intuitive ways to identify problematic observations by combining multiple diagnostic dimensions. Visualization often reveals patterns that numerical summaries miss.

Influence Plots

These plots typically display leverage (hiih_{ii}) on the x-axis and studentized residuals on the y-axis, with point size proportional to Cook's Distance.

  • The upper-right and lower-right corners contain the most dangerous points: high leverage combined with large residuals
  • Points with large bubbles (high Cook's Distance) but moderate leverage or moderate residuals help you see how the two components trade off
  • This is your best quick visual screening tool for identifying observations that warrant further investigation

Partial Regression Plots (Added Variable Plots)

A partial regression plot for predictor XjX_j works by:

  1. Regressing YY on all predictors except XjX_j and saving the residuals (eYโˆฃXโˆ’je_{Y|X_{-j}})
  2. Regressing XjX_j on all other predictors and saving the residuals (eXjโˆฃXโˆ’je_{X_j|X_{-j}})
  3. Plotting eYโˆฃXโˆ’je_{Y|X_{-j}} against eXjโˆฃXโˆ’je_{X_j|X_{-j}}

The slope of this scatterplot equals the coefficient ฮฒ^j\hat{\beta}_j in the full multiple regression model. These plots are valuable because they can reveal masked outliers: points that appear normal in marginal scatterplots of YY vs. XjX_j but become influential once the effects of other predictors are removed.

Compare: Influence Plots vs. Partial Regression Plots: Influence plots give you a global view of all observations' leverage and residuals, while partial regression plots focus on one predictor's relationship with YY after adjusting for everything else. Use influence plots for overall screening, and partial regression plots for understanding how specific observations affect individual coefficient estimates.


Quick Reference Table

ConceptBest Examples
Overall influenceCook's Distance, DFFITS
Coefficient-specific influenceDFBETAS
Position in predictor spaceLeverage, Mahalanobis Distance
Residual magnitudeStudentized Residuals, Z-score
Robust univariate screeningIQR method
Visual diagnosticsInfluence Plots, Partial Regression Plots
Multivariate outlier detectionMahalanobis Distance
Single best summary metricCook's Distance

Self-Check Questions

  1. A point has high leverage but a small studentized residual. Is it influential? Explain why Cook's Distance would be low in this case.

  2. Which two diagnostics would you use together to determine both whether a point is influential and which coefficients it affects most?

  3. Compare and contrast Mahalanobis Distance and simple leverage. When would you prefer one over the other, and what assumption does Mahalanobis make that leverage doesn't?

  4. A regression has one observation with Cook's Distance of 0.8. What threshold would you cite, and what additional diagnostics might inform your decision about whether to remove it?

  5. You're examining a partial regression plot and notice one point far from the regression line. Explain why this point might not have appeared unusual in a simple scatterplot of YY versus that predictor.