🎲Data, Inference, and Decisions

Outlier Detection Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Outlier detection sits at the heart of data inference because a single anomalous point can completely distort your conclusions. You're being tested on your ability to recognize when data violates assumptions, choose appropriate detection methods for different data structures, and justify why removing (or keeping) certain observations affects your statistical conclusions. These methods connect directly to concepts like distributional assumptions, robust estimation, model diagnostics, and the bias-variance tradeoff.

The key insight isn't just knowing these techniques exist—it's understanding when each method is appropriate. A Z-score works great for univariate normal data but fails spectacularly with multivariate correlations or skewed distributions. Density-based methods shine with clustered data but struggle in high dimensions. Don't just memorize formulas; know what data structure and distributional assumption each method requires, and be ready to defend your choice on an FRQ.

Distribution-Based Methods

These classical approaches assume your data follows a known distribution and flag points that fall too far from expected values. The underlying principle is probability: if a point is extremely unlikely under the assumed distribution, it's probably an outlier.

Z-Score Method

Measures standard deviations from the mean—points with $|Z| > 3$ are typically flagged as outliers, representing roughly 0.3% of normal data
Assumes normality, which limits applicability; non-normal distributions produce misleading Z-scores and false positives
Simple and interpretable for univariate analysis, but ignores relationships between variables in multivariate settings

Interquartile Range (IQR) Method

Uses quartiles instead of mean/standard deviation—outliers fall below $Q1 - 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$
Robust to skewness and non-normality because quartiles aren't affected by extreme values the way means are
Distribution-free approach makes it ideal when you can't assume normality or when data has heavy tails

Mahalanobis Distance

Accounts for correlations between variables—measures distance from the multivariate mean using the covariance matrix: $D_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$
Essential for multivariate data where variables are correlated; Euclidean distance would miss outliers that are extreme only in their combination of values
Requires estimating covariance, which itself can be corrupted by outliers—creating a chicken-and-egg problem

Compare: Z-score vs. IQR—both work for univariate data, but Z-score assumes normality while IQR is robust to skewness. If an FRQ gives you income data (typically right-skewed), IQR is your safer choice.

Density-Based Methods

These methods detect outliers by examining how isolated a point is relative to its neighbors. The core idea: normal points cluster together, while outliers live in sparse regions of the feature space.

Local Outlier Factor (LOF)

Compares local density to neighbors' density—a point surrounded by dense clusters but sitting in a sparse pocket gets a high LOF score
Handles varying cluster densities where global methods fail; a point might be far from the overall mean but perfectly normal within its local cluster
Requires choosing neighborhood size (k), which affects sensitivity—too small captures noise, too large misses local anomalies

DBSCAN

Simultaneously clusters and detects outliers—points in low-density regions that can't join any cluster are labeled as noise
Two parameters control behavior: epsilon (neighborhood radius) and minimum points (density threshold for core points)
Discovers clusters of arbitrary shape while naturally identifying outliers, making it powerful for exploratory analysis

Compare: LOF vs. DBSCAN—both use density, but LOF assigns continuous anomaly scores while DBSCAN makes binary cluster/noise decisions. LOF is better when you need to rank outliers by severity.

Model-Based Methods

These approaches learn a model of "normal" data and flag points that don't fit. The principle: outliers are points the model struggles to explain or that disproportionately influence model parameters.

Isolation Forest

Isolates outliers through random partitioning—outliers require fewer random splits to separate from the data
Scales efficiently to large, high-dimensional datasets because it samples subsets and doesn't compute distances between all points
Ensemble approach reduces variance by averaging across many random trees, making results more stable

One-Class SVM

Learns a boundary around normal data—maps points to high-dimensional space and finds a hyperplane separating them from the origin
Effective without labeled outliers because it only needs examples of normal behavior to learn what "normal" looks like
Kernel choice matters significantly—RBF kernels capture complex boundaries but require careful parameter tuning

Autoencoder-Based Detection

Uses reconstruction error as anomaly score—neural network learns to compress and reconstruct normal data; outliers reconstruct poorly
Captures non-linear relationships that linear methods miss, making it powerful for complex, high-dimensional data
Requires substantial normal data for training and careful architecture design; overfitting can cause the model to reconstruct outliers well

Compare: Isolation Forest vs. Autoencoder—both handle high-dimensional data, but Isolation Forest is interpretable and needs no training labels, while autoencoders can model complex non-linear patterns but act as black boxes. Choose Isolation Forest for explainability, autoencoders for complex data.

Regression Diagnostics

When your goal is building a predictive model, these methods identify points that unduly influence your fitted parameters. The key question: would my conclusions change substantially if I removed this point?

Cook's Distance

Measures each point's influence on regression coefficients—combines leverage (unusual $x$ values) and residual size (unusual $y$ values)
Rule of thumb: $D_i > 1$ suggests high influence, though some use $D_i > 4/n$ as a threshold for larger datasets
Essential for regression diagnostics because a single influential point can flip the sign of coefficients or dramatically change predictions

Robust Covariance Estimation (Minimum Covariance Determinant)

Estimates covariance while downweighting outliers—finds the subset of points that minimizes the determinant of their covariance matrix
Solves the circular problem where outliers corrupt the very covariance estimate you need to detect them
Foundation for robust Mahalanobis distance—use MCD-estimated covariance instead of classical covariance for more reliable multivariate outlier detection

Compare: Cook's Distance vs. Mahalanobis—Cook's measures influence on model parameters while Mahalanobis measures distance in data space. A point can have high Mahalanobis distance but low Cook's distance if it falls along the regression line.

Quick Reference Table

Concept	Best Examples
Univariate, normal data	Z-score
Univariate, skewed/robust	IQR method
Multivariate with correlations	Mahalanobis distance, Robust covariance (MCD)
Varying density clusters	LOF, DBSCAN
High-dimensional data	Isolation Forest, One-class SVM, Autoencoder
Regression model diagnostics	Cook's distance
No distributional assumptions	IQR, LOF, Isolation Forest
Complex non-linear patterns	Autoencoder-based detection

Self-Check Questions

You have univariate income data that's heavily right-skewed. Why would IQR outperform Z-score for outlier detection, and what specific assumption does Z-score violate?
Compare Mahalanobis distance and Euclidean distance: in what scenario would they identify different points as outliers, and why does correlation matter?
Both LOF and DBSCAN use density to detect outliers. What's the key difference in their output, and when would you prefer one over the other?
A regression model has a point with high leverage but a small residual. Would Cook's distance flag this point? Explain the relationship between leverage, residuals, and influence.
You're choosing between Isolation Forest and an autoencoder for detecting anomalies in high-dimensional sensor data. What are two advantages of Isolation Forest and one scenario where the autoencoder might perform better?

🎲Data, Inference, and Decisions

Outlier Detection Methods

Why This Matters

Distribution-Based Methods

Z-Score Method

Interquartile Range (IQR) Method

Mahalanobis Distance

Density-Based Methods

Local Outlier Factor (LOF)

DBSCAN

Model-Based Methods

Isolation Forest

One-Class SVM

Autoencoder-Based Detection

Regression Diagnostics

Cook's Distance

Robust Covariance Estimation (Minimum Covariance Determinant)

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes