🤖Statistical Prediction

Outlier Detection Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Outlier detection sits at the intersection of data preprocessing, model robustness, and anomaly detection—three areas that appear repeatedly on exams. Whether you're cleaning data before fitting a regression, identifying fraud in transaction data, or spotting sensor malfunctions, the technique you choose depends on your data's structure: Is it univariate or multivariate? Does it follow a normal distribution? Is it high-dimensional? These questions determine which method will actually work.

You're being tested on your ability to match the right technique to the right data scenario. A Z-score is elegant for normally distributed univariate data, but it falls apart with correlated features or non-Gaussian distributions. Density-based methods shine with clustered data but struggle in high dimensions. Don't just memorize the formulas—know what assumptions each method makes and when those assumptions break down.

Statistical Distance Methods

These techniques define outliers based on how far a point lies from some central tendency, using statistical measures of spread. The key assumption is that "normal" data clusters around a center, and outliers deviate significantly from it.

Z-Score Method

Measures standard deviations from the mean—a point with $|Z| > 3$ is typically flagged as an outlier
Assumes normal distribution, which limits applicability to symmetric, bell-curved data
Formula: $Z = \frac{x - \mu}{\sigma}$ , where $\mu$ is the mean and $\sigma$ is the standard deviation

Interquartile Range (IQR) Method

Uses quartiles instead of mean/standard deviation—outliers fall below $Q_1 - 1.5 \times IQR$ or above $Q_3 + 1.5 \times IQR$
Robust to skewed distributions because quartiles aren't affected by extreme values the way means are
Non-parametric approach makes it ideal when you can't assume normality

Mahalanobis Distance

Accounts for correlations between variables—measures distance from the multivariate mean using the covariance matrix
Formula: $D_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$ , where $\Sigma$ is the covariance matrix
Essential for multivariate outlier detection when features are correlated (Z-scores applied independently would miss these)

Compare: Z-score vs. Mahalanobis Distance—both measure distance from center, but Z-score treats each variable independently while Mahalanobis accounts for covariance structure. If an FRQ gives you correlated features, Mahalanobis is your answer.

Density-Based Methods

These approaches identify outliers as points in regions of unusually low density. The intuition: normal points have many neighbors, while outliers are isolated.

Local Outlier Factor (LOF)

Compares local density to neighbors' density—a point is an outlier if its neighborhood is sparser than its neighbors' neighborhoods
Handles clusters of varying density because it uses local rather than global density measures
Outputs a score where values significantly greater than 1 indicate outliers

DBSCAN

Clustering algorithm that labels noise points as outliers—points not belonging to any cluster are anomalies
Requires two parameters: $\epsilon$ (neighborhood radius) and $minPts$ (minimum points to form a dense region)
Discovers arbitrary cluster shapes unlike K-means, making it powerful for complex data geometries

Compare: LOF vs. DBSCAN—both use local density, but LOF produces a continuous outlier score while DBSCAN gives a binary classification (in cluster or noise). Use LOF when you need to rank anomalies by severity.

Tree-Based and Ensemble Methods

These techniques use random partitioning to isolate anomalies. The core insight: outliers are easier to separate from the bulk of data, requiring fewer random splits.

Isolation Forest

Isolates points using random recursive partitions—outliers require fewer splits to isolate
Scales efficiently to high-dimensional data with $O(n \log n)$ complexity
No distance calculations required, avoiding the curse of dimensionality that plagues density methods

Robust Random Cut Forest

Ensemble of random cut trees that assigns anomaly scores based on how partitioning changes when a point is removed
Handles streaming data and can update incrementally as new observations arrive
Robust to noise in the training data, unlike methods that assume clean training sets

Compare: Isolation Forest vs. Robust Random Cut Forest—both use tree-based isolation, but RRCF is designed for streaming applications and provides displacement-based scoring. Choose Isolation Forest for batch processing, RRCF for real-time anomaly detection.

Model-Based Methods

These techniques fit a model to "normal" data and flag points that don't conform. The assumption: you can characterize what normal looks like, and deviations from that model are outliers.

One-Class SVM

Learns a boundary around normal data using only positive (normal) examples—no labeled outliers needed
Uses kernel trick to handle non-linear boundaries in high-dimensional feature spaces
Hyperparameter $\nu$ controls the fraction of points allowed outside the boundary

Elliptic Envelope

Fits a multivariate Gaussian to the data and defines outliers based on the fitted ellipse
Assumes data is normally distributed—performance degrades with skewed or multimodal data
Provides probabilistic interpretation through Mahalanobis distance from the fitted distribution

Compare: One-Class SVM vs. Elliptic Envelope—both learn a boundary around normal data, but Elliptic Envelope assumes Gaussian distribution while One-Class SVM can learn arbitrary shapes via kernels. Use Elliptic Envelope when normality holds; use One-Class SVM otherwise.

Regression-Specific Methods

When your goal is prediction rather than general anomaly detection, these techniques identify points that disproportionately influence your model.

Cook's Distance

Measures each point's influence on regression coefficients—quantifies how much fitted values change when a point is removed
Rule of thumb: points with Cook's Distance $> 1$ (or $> 4/n$ ) warrant investigation
Combines leverage and residual information: $D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{(1-h_{ii})^2}$

Compare: Cook's Distance vs. Z-score on residuals—Z-score only considers how far a point's residual is from zero, while Cook's Distance also accounts for leverage (how extreme the predictor values are). High-leverage points with moderate residuals can still be highly influential.

Quick Reference Table

Concept	Best Examples
Univariate, normal data	Z-score, IQR
Multivariate with correlations	Mahalanobis Distance, Elliptic Envelope
Varying density clusters	LOF, DBSCAN
High-dimensional data	Isolation Forest, One-Class SVM
No distributional assumptions	IQR, Isolation Forest, DBSCAN
Streaming/real-time data	Robust Random Cut Forest
Regression influence	Cook's Distance
Only normal examples available	One-Class SVM, Elliptic Envelope

Self-Check Questions

You have a dataset with two highly correlated features. Why would applying Z-scores to each feature independently fail to detect certain outliers, and which method should you use instead?
Compare LOF and Isolation Forest: both detect outliers, but they use fundamentally different approaches. What is the key conceptual difference, and in what scenario would each excel?
A colleague suggests using Elliptic Envelope for outlier detection on customer purchase data that shows a long right tail. What's wrong with this choice, and what would you recommend?
If an FRQ asks you to identify influential points in a linear regression, which technique would you use and why? How does it differ from simply flagging points with large residuals?
You're building a fraud detection system where you only have examples of legitimate transactions (no labeled fraud cases). Which two methods from this guide are designed for exactly this scenario, and what assumption distinguishes them?

🤖Statistical Prediction

Outlier Detection Techniques

Why This Matters

Statistical Distance Methods

Z-Score Method

Interquartile Range (IQR) Method

Mahalanobis Distance

Density-Based Methods

Local Outlier Factor (LOF)

DBSCAN

Tree-Based and Ensemble Methods

Isolation Forest

Robust Random Cut Forest

Model-Based Methods

One-Class SVM

Elliptic Envelope

Regression-Specific Methods

Cook's Distance

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes