Influential Linear Regression Outliers to Know for Linear Modeling Theory

Understanding influential outliers in linear regression is crucial for accurate modeling. Key metrics like Cook's Distance, DFFITS, and leverage help identify observations that significantly impact the regression results, ensuring reliable predictions and insights from the data.

  1. Cook's Distance

    • Measures the influence of each observation on the overall regression model.
    • A high Cook's Distance indicates that the observation significantly affects the fitted values.
    • Typically, a threshold of 4/n (where n is the number of observations) is used to identify influential points.
  2. DFFITS (Difference in Fits)

    • Quantifies the change in fitted values when a specific observation is removed from the dataset.
    • A large DFFITS value suggests that the observation has a strong influence on the regression results.
    • Commonly, a threshold of ±2√(p/n) (where p is the number of predictors) is used for detection.
  3. DFBETAS (Difference in Beta Estimates)

    • Measures the change in each regression coefficient when a specific observation is excluded.
    • High absolute values indicate that the observation has a significant impact on the coefficient estimates.
    • A common threshold for identifying influential points is ±2/√n.
  4. Leverage (Hat Values)

    • Indicates how far an observation's predictor values are from the mean of the predictor values.
    • High leverage points can disproportionately affect the regression line.
    • A leverage value greater than 2(p/n) is often considered high.
  5. Studentized Residuals

    • Standardized residuals that account for the variability of the residuals.
    • Helps identify outliers by comparing the residuals to a t-distribution.
    • A common threshold for identifying outliers is an absolute value greater than 2 or 3.
  6. Mahalanobis Distance

    • Measures the distance of a point from the mean of a multivariate distribution, accounting for correlations between variables.
    • A high Mahalanobis distance indicates that an observation is an outlier in the context of the multivariate space.
    • Typically, a threshold based on the chi-squared distribution is used for detection.
  7. Outlier Detection Methods (e.g., Z-score, IQR)

    • Z-score identifies outliers based on how many standard deviations a point is from the mean.
    • IQR method identifies outliers by calculating the interquartile range and determining points outside 1.5 times the IQR.
    • Both methods provide a straightforward approach to flagging potential outliers.
  8. Influence Plots

    • Graphical representation that combines leverage and studentized residuals to identify influential observations.
    • Points that are far from the center and have high residuals are flagged as influential.
    • Useful for visualizing the impact of individual data points on the regression model.
  9. Partial Regression Plots

    • Illustrate the relationship between a specific predictor and the response variable while controlling for other predictors.
    • Help identify the influence of individual predictors on the response.
    • Useful for detecting outliers that may affect the relationship being studied.
  10. Added Variable Plots

    • Show the effect of adding a predictor to a regression model, controlling for other predictors.
    • Help visualize the contribution of a specific variable to the model.
    • Can reveal influential points that may distort the perceived relationship between variables.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.