upgrade
upgrade

🤖Statistical Prediction

Outlier Detection Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Outlier detection sits at the intersection of data preprocessing, model robustness, and anomaly detection—three areas that appear repeatedly on exams. Whether you're cleaning data before fitting a regression, identifying fraud in transaction data, or spotting sensor malfunctions, the technique you choose depends on your data's structure: Is it univariate or multivariate? Does it follow a normal distribution? Is it high-dimensional? These questions determine which method will actually work.

You're being tested on your ability to match the right technique to the right data scenario. A Z-score is elegant for normally distributed univariate data, but it falls apart with correlated features or non-Gaussian distributions. Density-based methods shine with clustered data but struggle in high dimensions. Don't just memorize the formulas—know what assumptions each method makes and when those assumptions break down.


Statistical Distance Methods

These techniques define outliers based on how far a point lies from some central tendency, using statistical measures of spread. The key assumption is that "normal" data clusters around a center, and outliers deviate significantly from it.

Z-Score Method

  • Measures standard deviations from the mean—a point with Z>3|Z| > 3 is typically flagged as an outlier
  • Assumes normal distribution, which limits applicability to symmetric, bell-curved data
  • Formula: Z=xμσZ = \frac{x - \mu}{\sigma}, where μ\mu is the mean and σ\sigma is the standard deviation

Interquartile Range (IQR) Method

  • Uses quartiles instead of mean/standard deviation—outliers fall below Q11.5×IQRQ_1 - 1.5 \times IQR or above Q3+1.5×IQRQ_3 + 1.5 \times IQR
  • Robust to skewed distributions because quartiles aren't affected by extreme values the way means are
  • Non-parametric approach makes it ideal when you can't assume normality

Mahalanobis Distance

  • Accounts for correlations between variables—measures distance from the multivariate mean using the covariance matrix
  • Formula: DM=(xμ)TΣ1(xμ)D_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}, where Σ\Sigma is the covariance matrix
  • Essential for multivariate outlier detection when features are correlated (Z-scores applied independently would miss these)

Compare: Z-score vs. Mahalanobis Distance—both measure distance from center, but Z-score treats each variable independently while Mahalanobis accounts for covariance structure. If an FRQ gives you correlated features, Mahalanobis is your answer.


Density-Based Methods

These approaches identify outliers as points in regions of unusually low density. The intuition: normal points have many neighbors, while outliers are isolated.

Local Outlier Factor (LOF)

  • Compares local density to neighbors' density—a point is an outlier if its neighborhood is sparser than its neighbors' neighborhoods
  • Handles clusters of varying density because it uses local rather than global density measures
  • Outputs a score where values significantly greater than 1 indicate outliers

DBSCAN

  • Clustering algorithm that labels noise points as outliers—points not belonging to any cluster are anomalies
  • Requires two parameters: ϵ\epsilon (neighborhood radius) and minPtsminPts (minimum points to form a dense region)
  • Discovers arbitrary cluster shapes unlike K-means, making it powerful for complex data geometries

Compare: LOF vs. DBSCAN—both use local density, but LOF produces a continuous outlier score while DBSCAN gives a binary classification (in cluster or noise). Use LOF when you need to rank anomalies by severity.


Tree-Based and Ensemble Methods

These techniques use random partitioning to isolate anomalies. The core insight: outliers are easier to separate from the bulk of data, requiring fewer random splits.

Isolation Forest

  • Isolates points using random recursive partitions—outliers require fewer splits to isolate
  • Scales efficiently to high-dimensional data with O(nlogn)O(n \log n) complexity
  • No distance calculations required, avoiding the curse of dimensionality that plagues density methods

Robust Random Cut Forest

  • Ensemble of random cut trees that assigns anomaly scores based on how partitioning changes when a point is removed
  • Handles streaming data and can update incrementally as new observations arrive
  • Robust to noise in the training data, unlike methods that assume clean training sets

Compare: Isolation Forest vs. Robust Random Cut Forest—both use tree-based isolation, but RRCF is designed for streaming applications and provides displacement-based scoring. Choose Isolation Forest for batch processing, RRCF for real-time anomaly detection.


Model-Based Methods

These techniques fit a model to "normal" data and flag points that don't conform. The assumption: you can characterize what normal looks like, and deviations from that model are outliers.

One-Class SVM

  • Learns a boundary around normal data using only positive (normal) examples—no labeled outliers needed
  • Uses kernel trick to handle non-linear boundaries in high-dimensional feature spaces
  • Hyperparameter ν\nu controls the fraction of points allowed outside the boundary

Elliptic Envelope

  • Fits a multivariate Gaussian to the data and defines outliers based on the fitted ellipse
  • Assumes data is normally distributed—performance degrades with skewed or multimodal data
  • Provides probabilistic interpretation through Mahalanobis distance from the fitted distribution

Compare: One-Class SVM vs. Elliptic Envelope—both learn a boundary around normal data, but Elliptic Envelope assumes Gaussian distribution while One-Class SVM can learn arbitrary shapes via kernels. Use Elliptic Envelope when normality holds; use One-Class SVM otherwise.


Regression-Specific Methods

When your goal is prediction rather than general anomaly detection, these techniques identify points that disproportionately influence your model.

Cook's Distance

  • Measures each point's influence on regression coefficients—quantifies how much fitted values change when a point is removed
  • Rule of thumb: points with Cook's Distance >1> 1 (or >4/n> 4/n) warrant investigation
  • Combines leverage and residual information: Di=ei2pMSEhii(1hii)2D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{(1-h_{ii})^2}

Compare: Cook's Distance vs. Z-score on residuals—Z-score only considers how far a point's residual is from zero, while Cook's Distance also accounts for leverage (how extreme the predictor values are). High-leverage points with moderate residuals can still be highly influential.


Quick Reference Table

ConceptBest Examples
Univariate, normal dataZ-score, IQR
Multivariate with correlationsMahalanobis Distance, Elliptic Envelope
Varying density clustersLOF, DBSCAN
High-dimensional dataIsolation Forest, One-Class SVM
No distributional assumptionsIQR, Isolation Forest, DBSCAN
Streaming/real-time dataRobust Random Cut Forest
Regression influenceCook's Distance
Only normal examples availableOne-Class SVM, Elliptic Envelope

Self-Check Questions

  1. You have a dataset with two highly correlated features. Why would applying Z-scores to each feature independently fail to detect certain outliers, and which method should you use instead?

  2. Compare LOF and Isolation Forest: both detect outliers, but they use fundamentally different approaches. What is the key conceptual difference, and in what scenario would each excel?

  3. A colleague suggests using Elliptic Envelope for outlier detection on customer purchase data that shows a long right tail. What's wrong with this choice, and what would you recommend?

  4. If an FRQ asks you to identify influential points in a linear regression, which technique would you use and why? How does it differ from simply flagging points with large residuals?

  5. You're building a fraud detection system where you only have examples of legitimate transactions (no labeled fraud cases). Which two methods from this guide are designed for exactly this scenario, and what assumption distinguishes them?