🧮Data Science Numerical Analysis

Outlier Detection Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Outlier detection sits at the intersection of numerical analysis, statistical inference, and machine learning—three pillars you'll be tested on repeatedly. These algorithms aren't just data cleaning tools; they represent fundamentally different assumptions about what makes data "normal" and how anomalies reveal themselves mathematically. When you encounter an outlier detection question, you're really being asked to demonstrate your understanding of distributional assumptions, distance metrics, density estimation, and algorithmic complexity.

Don't just memorize which threshold makes something an outlier. Instead, focus on why each method works: What mathematical property does it exploit? What assumptions must hold for it to be valid? When does it fail? The strongest exam responses connect specific algorithms to their underlying numerical principles—whether that's covariance structure, local density estimation, or recursive partitioning. Master the "why," and the "what" becomes easy.

Statistical Distribution-Based Methods

These methods assume your data follows a known distribution and flag points that fall in the extreme tails. The core principle: if we know what "normal" looks like mathematically, outliers are points with very low probability under that model.

Z-Score Method

Measures standard deviations from the mean—calculated as $z = \frac{x - \mu}{\sigma}$ , where values beyond $|z| > 3$ typically indicate outliers
Assumes normal distribution—this is both its strength (simple, interpretable) and its weakness (fails badly with skewed or multimodal data)
Sensitive to the outliers themselves—since $\mu$ and $\sigma$ are computed from all data, extreme values inflate these statistics and can mask other outliers

Interquartile Range (IQR) Method

Uses quartiles instead of mean/standard deviation—outliers fall below $Q_1 - 1.5 \times IQR$ or above $Q_3 + 1.5 \times IQR$ , where $IQR = Q_3 - Q_1$
Robust to distributional assumptions—works on skewed data because quartiles aren't pulled by extreme values the way means are
The 1.5 multiplier is conventional, not sacred—some applications use 3.0 for "extreme" outliers; know that this threshold is adjustable

Mahalanobis Distance

Accounts for covariance structure—computed as $D_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$ , where $\Sigma$ is the covariance matrix
Detects multivariate outliers that univariate methods miss—a point can be normal in each dimension separately but anomalous when correlations are considered
Requires invertible covariance matrix—fails when features are collinear or when $n < p$ (more variables than observations)

Elliptic Envelope

Fits a minimum covariance determinant ellipse—assumes data follows a Gaussian distribution and identifies points outside the fitted ellipsoid
Robust covariance estimation—uses a subset of "clean" points to estimate $\Sigma$ , reducing the influence of outliers on the fit itself
Provides probabilistic interpretation—outputs can be converted to p-values, useful when you need statistical significance, not just classification

Compare: Z-Score vs. Mahalanobis Distance—both measure "distance from center," but Z-Score treats each variable independently while Mahalanobis accounts for correlations. If an FRQ gives you multivariate data with correlated features, Mahalanobis is your answer.

Density-Based Methods

These methods define outliers as points in sparse regions of the feature space. The key insight: outliers live where other points don't.

Local Outlier Factor (LOF)

Compares local density to neighbors' densities—computes a ratio; LOF ≈ 1 means similar density to neighbors, LOF >> 1 indicates the point is in a sparser region
Handles clusters of varying density—unlike global methods, LOF can identify outliers within dense clusters that would look normal globally
Requires choosing k (number of neighbors)—results are sensitive to this parameter; too small captures noise, too large misses local structure

DBSCAN

Clusters dense regions, labels sparse points as noise—requires two parameters: $\epsilon$ (neighborhood radius) and $minPts$ (minimum points to form a core)
No assumption about cluster shape—can find arbitrarily shaped clusters, unlike k-means, making it versatile for real-world spatial data
Outliers are a byproduct, not the focus—points that don't belong to any cluster are labeled as noise; useful when you want clustering and outlier detection simultaneously

Compare: LOF vs. DBSCAN—both use local density, but LOF produces a continuous outlier score while DBSCAN produces a binary label. Use LOF when you need to rank anomalies by severity; use DBSCAN when you also need cluster assignments.

Tree-Based and Ensemble Methods

These algorithms exploit the principle that outliers are easier to isolate through random partitioning. The intuition: normal points require many splits to separate; outliers get isolated quickly.

Isolation Forest

Isolates points via random recursive partitioning—builds trees by randomly selecting features and split values; outliers have shorter average path lengths
Computational complexity is $O(n \log n)$ —scales efficiently to large datasets, unlike distance-based methods that often require $O(n^2)$ comparisons
Excels in high-dimensional spaces—random feature selection naturally handles the curse of dimensionality without explicit dimension reduction

Robust Random Cut Forest

Designed for streaming data—can update the model incrementally as new points arrive, unlike Isolation Forest which requires batch retraining
Uses displacement-based scoring—measures how much a point's removal changes the tree structure, providing a principled anomaly score
Handles concept drift—the forest adapts over time, making it suitable for applications where "normal" evolves (e.g., network traffic, sensor data)

Compare: Isolation Forest vs. Robust Random Cut Forest—both use tree ensembles, but Isolation Forest is batch-oriented while RRCF handles streaming data. If an exam question mentions "real-time" or "online" detection, RRCF is the appropriate choice.

Model-Based Methods

These approaches learn a boundary or influence measure from the data, treating outlier detection as a modeling problem. The framework: fit a model to normal behavior, then flag points that don't conform.

One-Class SVM

Learns a decision boundary around normal data—maps data to high-dimensional space via kernel trick, then finds a hyperplane maximizing margin from the origin
Handles non-linear boundaries—with RBF or polynomial kernels, can capture complex "normal" regions that linear methods miss
Sensitive to hyperparameters—the $\nu$ parameter controls the fraction of outliers expected; poor tuning leads to over- or under-detection

Cook's Distance

Measures influence on regression coefficients—computed as $D_i = \frac{(\hat{y} - \hat{y}_{(i)})^T (\hat{y} - \hat{y}_{(i)})}{p \cdot MSE}$ , where $\hat{y}_{(i)}$ is the prediction without point $i$
Threshold of $4/n$ is common convention—points exceeding this disproportionately affect the fitted model and warrant investigation
Specific to regression contexts—unlike general-purpose methods, Cook's Distance identifies influential points that distort model parameters, not just unusual values

Compare: One-Class SVM vs. Elliptic Envelope—both learn boundaries around normal data, but Elliptic Envelope assumes Gaussian distributions while One-Class SVM can learn arbitrary shapes via kernels. Use Elliptic Envelope when normality is reasonable; use One-Class SVM for complex, non-linear boundaries.

Quick Reference Table

Concept	Best Examples
Distributional assumptions required	Z-Score, Elliptic Envelope, Mahalanobis Distance
Robust to non-normal data	IQR, LOF, DBSCAN, Isolation Forest
Handles multivariate correlations	Mahalanobis Distance, Elliptic Envelope, One-Class SVM
Local/density-based detection	LOF, DBSCAN
Scales to high dimensions	Isolation Forest, One-Class SVM, Robust Random Cut Forest
Works on streaming data	Robust Random Cut Forest
Regression-specific	Cook's Distance
Provides continuous scores	LOF, Isolation Forest, Cook's Distance, Mahalanobis Distance

Self-Check Questions

Which two methods both rely on measuring distance from a center point, and what key difference determines when you'd choose one over the other?
You have a dataset with two highly correlated features. A point appears normal when examining each feature's Z-Score individually but is flagged by Mahalanobis Distance. Explain why this happens mathematically.
Compare and contrast Isolation Forest and LOF: What assumptions does each make about how outliers manifest in data, and in what scenario would LOF detect an outlier that Isolation Forest might miss?
An FRQ asks you to detect anomalies in real-time network traffic data that arrives continuously. Which algorithm is most appropriate, and what property makes it suitable for this context?
Why is Cook's Distance fundamentally different from the other outlier detection methods on this list? In what specific analytical context is it the only appropriate choice?

🧮Data Science Numerical Analysis

Outlier Detection Algorithms

Why This Matters

Statistical Distribution-Based Methods

Z-Score Method

Interquartile Range (IQR) Method

Mahalanobis Distance

Elliptic Envelope

Density-Based Methods

Local Outlier Factor (LOF)

DBSCAN

Tree-Based and Ensemble Methods

Isolation Forest

Robust Random Cut Forest

Model-Based Methods

One-Class SVM

Cook's Distance

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes