Influence diagnostics help you figure out which observations are driving your regression results. A single unusual data point can shift your estimated coefficients, change your predicted values, and distort your overall model fit. These tools let you identify those points so you can decide what to do about them.

Understanding influence and leverage

Influence and leverage are related but distinct concepts:

Influence refers to how much a single observation changes the estimated regression coefficients or predicted values. If removing one data point substantially shifts your fitted model, that point is influential.
Leverage measures how far an observation's predictor values are from the center of the predictor space. A high-leverage point has an unusual combination of predictor values, sitting far from where most of the data lives.

The key relationship: a high-leverage point can be influential, but it isn't necessarily so. A point far out in predictor space that falls right along the regression surface won't distort your estimates much. But a high-leverage point with a large residual will pull the fitted model toward itself, potentially biasing your coefficients. Influence is roughly the product of leverage and the size of the residual.

Calculating leverage values

Leverage values come from the hat matrix $H = X(X^TX)^{-1}X^T$ , which maps observed responses to fitted values ( $\hat{y} = Hy$ ). The diagonal element $h_{ii}$ gives the leverage of the $i$ -th observation.

Properties of leverage values:

Each $h_{ii}$ falls between $1/n$ and 1
The leverage values sum to $p$ , the number of parameters (including the intercept), so the average leverage is $p/n$
An observation is typically flagged as high-leverage if $h_{ii} > 2p/n$ (a common cutoff) or sometimes $3p/n$ for a more conservative threshold

To assess whether a high-leverage point actually matters, compare the regression results with and without that observation. If removing it substantially changes the estimated coefficients or fit statistics, the point is both high-leverage and influential.

Identifying influential observations

Understanding influence and leverage, R Tutorial Series: R Tutorial Series: Multiple Linear Regression

Cook's distance

Cook's distance is the most widely used single-number summary of an observation's influence. It combines residual size and leverage into one measure, capturing how much all the fitted values shift when observation $i$ is deleted.

The formula is:

$D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}$

where $r_i$ is the internally studentized residual, $p$ is the number of parameters, and $h_{ii}$ is the leverage.

Notice how $D_i$ increases when either the residual or the leverage is large. A point needs to be unusual in at least one of these dimensions to register a high Cook's distance, and points that are extreme in both dimensions will dominate.

Cutoff rules of thumb:

$D_i > 4/(n - p)$ is a common threshold for flagging potentially influential points
Some texts use $D_i > 1$ or compare $D_i$ to the 50th percentile of the $F_{p, n-p}$ distribution
These are guidelines, not hard rules. Always investigate flagged points rather than automatically removing them.

Studentized deleted residuals

Ordinary residuals have a problem: the fitted model was influenced by the very observation you're trying to evaluate. Studentized deleted residuals (also called externally studentized residuals) solve this by refitting the model with observation $i$ removed and then measuring how far $y_i$ falls from that prediction.

The studentized deleted residual for observation $i$ is:

$t_i = e_i \cdot \sqrt{\frac{n - p - 1}{SSE(1 - h_{ii}) - e_i^2}}$

where $e_i$ is the ordinary residual and $SSE$ is the residual sum of squares from the full model.

Under standard assumptions, $t_i$ follows a $t$ -distribution with $n - p - 1$ degrees of freedom. Observations with $|t_i| > 2$ deserve a closer look, and $|t_i| > 3$ is a strong signal of an outlier. Because you're testing multiple observations, consider a Bonferroni correction (compare against $t_{\alpha/(2n), \, n-p-1}$ ) to control for multiple comparisons.

Impact of leverage points

Understanding influence and leverage, Data Analysis with R

Effects on regression coefficients and model fit

High-leverage points exert disproportionate pull on the regression surface because the least squares fit minimizes the sum of squared residuals, and a point far from the center of the predictor space has more "lever arm" to tilt the fitted plane.

Specific consequences include:

Biased slope estimates: A high-leverage point with a large residual can drag the regression surface toward itself, shifting one or more slope coefficients substantially.
Inflated or deflated $R^2$ : A high-leverage point that happens to follow the trend can artificially increase $R^2$ , while one that doesn't can decrease it.
Narrower standard errors at the leverage point: The hat matrix concentrates predictive precision near high-leverage observations, which can make the model appear more precise than it actually is across the rest of the data.

Assessing the influence of leverage points

A practical workflow for evaluating leverage points:

Flag observations where $h_{ii} > 2p/n$
Check whether those flagged points also have large studentized deleted residuals or Cook's distances
Refit the model excluding each flagged point (or groups of flagged points) and compare the estimated coefficients, standard errors, and $R^2$
Investigate the flagged observations themselves: Are they data entry errors? Do they come from a distinct subpopulation? Are they genuine but rare extreme values?

Points that are high-leverage but have small residuals are generally less concerning. They sit far out in predictor space but conform to the pattern established by the rest of the data. Points that are both high-leverage and have large residuals are the ones that demand the most scrutiny.

Outliers and high-leverage observations

Identifying outliers and high-leverage observations

Several graphical and numerical tools work together here:

Residual plots (residuals vs. fitted values, residuals vs. each predictor): Outliers show up as points with unusually large vertical distance from zero. High-leverage points show up at the extremes of the horizontal axis in predictor-specific plots.
Leverage values ( $h_{ii}$ ): Directly flag observations with unusual predictor combinations. A plot of $h_{ii}$ against observation index makes high-leverage points easy to spot.
Studentized deleted residuals ( $t_i$ ): Identify outliers in the response direction while accounting for each point's leverage.
Cook's distance plot: Combines both dimensions into a single diagnostic. Plotting $D_i$ against observation index quickly highlights the most influential cases.

Using these tools together is more reliable than relying on any single measure. An observation might have moderate leverage and a moderate residual, neither of which triggers a threshold on its own, but Cook's distance captures the combined effect.

Handling outliers and high-leverage observations

Once you've identified unusual observations, the goal is to understand them, not to automatically delete them.

Step-by-step approach:

Verify the data. Check for transcription errors, unit mismatches, or instrument malfunctions. If the value is simply wrong, correct it or remove it.
Investigate the context. Does the observation come from a different population or a rare but real scenario? Understanding why a point is unusual often matters more than the diagnostics themselves.
Compare models with and without the point. Report both sets of results. If your conclusions change depending on one observation, that's important information for your audience.
Consider robust alternatives. Methods like M-estimation or least trimmed squares down-weight extreme observations automatically, producing coefficient estimates that are less sensitive to outliers and high-leverage points. These are especially useful when you have multiple influential points that are difficult to assess individually.
Report transparently. Whether you retain or remove unusual observations, document the decision and its justification. Show the sensitivity of your results to these points so readers can evaluate the robustness of your conclusions.

Deleting observations just because they're inconvenient is bad practice. Retaining clearly erroneous data is equally problematic. The right call depends on the specific context, and the diagnostics covered here give you the evidence to make that call responsibly.

2,589 studying →