Outliers
Outliers are data points that fall far from the general pattern in your dataset. In the context of linear regression, they matter because even a single outlier can drag your regression line off course, distort your correlation coefficient, and lead you to wrong conclusions. Knowing how to detect and handle them is a core skill in statistics.

Outliers Using the Standard Deviations Rule
One straightforward way to flag potential outliers is to use the mean and standard deviation of your dataset. The idea: if a data point is unusually far from the center of the data, it deserves a closer look.
- Calculate the mean () and standard deviation () of your variable.
- Set boundaries at two standard deviations from the mean:
- Lower boundary:
- Upper boundary:
- Any data point outside these boundaries (i.e., with a z-score greater than 2 or less than -2) is considered a potential outlier.
"Potential" is the key word here. A point outside the boundaries isn't automatically invalid. It could be a measurement error, a data entry mistake, or simply a rare but legitimate observation. You need to investigate why it's unusual before deciding what to do with it.

Effects of Outlier Removal
Outliers can have a surprisingly large effect on both the regression line (the line that minimizes the sum of squared residuals) and the correlation coefficient (, which measures the strength and direction of a linear relationship on a scale from -1 to 1).
Effect on the regression line:
- An outlier far from the cluster of other points can "pull" the regression line toward itself, changing both the slope and the y-intercept.
- Removing that outlier often produces a line that better represents the trend in the rest of the data.
Effect on the correlation coefficient:
- An outlier can artificially inflate (e.g., a point that happens to extend the linear pattern) or deflate it (e.g., a point that falls far off the trend).
- After removal, may increase or decrease depending on where the outlier sat relative to the pattern.
This is why you should always recalculate both the regression equation and after removing an outlier, and compare the results to see how much influence that single point had.

Standard Deviation of Residuals
The standard deviations rule above looks at how far a point is from the mean of or . But in regression, a more targeted approach is to check how far each point falls from the regression line itself. That's where residuals come in.
A residual is the difference between what you actually observed and what the regression line predicted:
To use residuals for outlier detection:
- Use your regression equation to calculate the predicted for each data point.
- Compute each residual by subtracting the predicted value from the observed value.
- Find the standard deviation of all the residuals ().
- Flag any point whose residual falls outside (i.e., more than two standard deviations of residuals from zero).
This method is more useful in a regression context than the basic standard deviations rule because it identifies points that don't fit the linear trend, not just points that are far from the overall mean. A data point could have a perfectly normal -value and -value individually but still be an outlier relative to the regression line.
Additional Outlier Detection Methods
- Interquartile Range (IQR) method: Uses the spread of the middle 50% of the data to set boundaries. A point is flagged as an outlier if it falls below or above . This is the rule behind the "whiskers" and individual points you see on a boxplot. The IQR method is resistant to extreme values, which makes it a good complement to the standard deviation approach.
- Leverage: Measures how far a data point's -value is from the mean of . High-leverage points sit at the edges of the predictor variable's range, giving them more potential to tilt the regression line.
- Influential points: These are points that, when removed, cause a large change in the regression results. A point can have high leverage without being influential (if it falls right on the trend), or it can be influential because it's both far out in and far off the trend. Cook's distance quantifies this by measuring how much all the predicted values shift when a single observation is removed.