Fiveable

📊Honors Statistics Unit 12 Review

QR code for Honors Statistics practice questions

12.5 Outliers

12.5 Outliers

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📊Honors Statistics
Unit & Topic Study Guides
Pep mascot

Outliers

Outliers are data points that fall far from the general pattern in your dataset. In the context of linear regression, they matter because even a single outlier can drag your regression line off course, distort your correlation coefficient, and lead you to wrong conclusions. Knowing how to detect and handle them is a core skill in statistics.

Pep mascot
more resources to help you study

Outliers Using the Standard Deviations Rule

One straightforward way to flag potential outliers is to use the mean and standard deviation of your dataset. The idea: if a data point is unusually far from the center of the data, it deserves a closer look.

  • Calculate the mean (xˉ\bar{x}) and standard deviation (ss) of your variable.
  • Set boundaries at two standard deviations from the mean:
    • Lower boundary: xˉ2s\bar{x} - 2s
    • Upper boundary: xˉ+2s\bar{x} + 2s
  • Any data point outside these boundaries (i.e., with a z-score greater than 2 or less than -2) is considered a potential outlier.

"Potential" is the key word here. A point outside the boundaries isn't automatically invalid. It could be a measurement error, a data entry mistake, or simply a rare but legitimate observation. You need to investigate why it's unusual before deciding what to do with it.

Outliers using standard deviations rule, Standard score - wikidoc

Effects of Outlier Removal

Outliers can have a surprisingly large effect on both the regression line (the line that minimizes the sum of squared residuals) and the correlation coefficient (rr, which measures the strength and direction of a linear relationship on a scale from -1 to 1).

Effect on the regression line:

  • An outlier far from the cluster of other points can "pull" the regression line toward itself, changing both the slope and the y-intercept.
  • Removing that outlier often produces a line that better represents the trend in the rest of the data.

Effect on the correlation coefficient:

  • An outlier can artificially inflate rr (e.g., a point that happens to extend the linear pattern) or deflate it (e.g., a point that falls far off the trend).
  • After removal, rr may increase or decrease depending on where the outlier sat relative to the pattern.

This is why you should always recalculate both the regression equation and rr after removing an outlier, and compare the results to see how much influence that single point had.

Outliers using standard deviations rule, Normal distribution - Wikiversity

Standard Deviation of Residuals

The standard deviations rule above looks at how far a point is from the mean of xx or yy. But in regression, a more targeted approach is to check how far each point falls from the regression line itself. That's where residuals come in.

A residual is the difference between what you actually observed and what the regression line predicted:

residual=observed valuepredicted value\text{residual} = \text{observed value} - \text{predicted value}

To use residuals for outlier detection:

  1. Use your regression equation to calculate the predicted y^\hat{y} for each data point.
  2. Compute each residual by subtracting the predicted value from the observed value.
  3. Find the standard deviation of all the residuals (ses_e).
  4. Flag any point whose residual falls outside ±2se\pm 2s_e (i.e., more than two standard deviations of residuals from zero).

This method is more useful in a regression context than the basic standard deviations rule because it identifies points that don't fit the linear trend, not just points that are far from the overall mean. A data point could have a perfectly normal xx-value and yy-value individually but still be an outlier relative to the regression line.

Additional Outlier Detection Methods

  • Interquartile Range (IQR) method: Uses the spread of the middle 50% of the data to set boundaries. A point is flagged as an outlier if it falls below Q11.5×IQRQ_1 - 1.5 \times IQR or above Q3+1.5×IQRQ_3 + 1.5 \times IQR. This is the rule behind the "whiskers" and individual points you see on a boxplot. The IQR method is resistant to extreme values, which makes it a good complement to the standard deviation approach.
  • Leverage: Measures how far a data point's xx-value is from the mean of xx. High-leverage points sit at the edges of the predictor variable's range, giving them more potential to tilt the regression line.
  • Influential points: These are points that, when removed, cause a large change in the regression results. A point can have high leverage without being influential (if it falls right on the trend), or it can be influential because it's both far out in xx and far off the trend. Cook's distance quantifies this by measuring how much all the predicted values shift when a single observation is removed.