Outlier detection sits at the intersection of numerical analysis, statistical inference, and machine learning—three pillars you'll be tested on repeatedly. These algorithms aren't just data cleaning tools; they represent fundamentally different assumptions about what makes data "normal" and how anomalies reveal themselves mathematically. When you encounter an outlier detection question, you're really being asked to demonstrate your understanding of distributional assumptions, distance metrics, density estimation, and algorithmic complexity.
Don't just memorize which threshold makes something an outlier. Instead, focus on why each method works: What mathematical property does it exploit? What assumptions must hold for it to be valid? When does it fail? The strongest exam responses connect specific algorithms to their underlying numerical principles—whether that's covariance structure, local density estimation, or recursive partitioning. Master the "why," and the "what" becomes easy.
Statistical Distribution-Based Methods
These methods assume your data follows a known distribution and flag points that fall in the extreme tails. The core principle: if we know what "normal" looks like mathematically, outliers are points with very low probability under that model.
Z-Score Method
Measures standard deviations from the mean—calculated as z=σx−μ, where values beyond ∣z∣>3 typically indicate outliers
Assumes normal distribution—this is both its strength (simple, interpretable) and its weakness (fails badly with skewed or multimodal data)
Sensitive to the outliers themselves—since μ and σ are computed from all data, extreme values inflate these statistics and can mask other outliers
Interquartile Range (IQR) Method
Uses quartiles instead of mean/standard deviation—outliers fall below Q1−1.5×IQR or above Q3+1.5×IQR, where IQR=Q3−Q1
Robust to distributional assumptions—works on skewed data because quartiles aren't pulled by extreme values the way means are
The 1.5 multiplier is conventional, not sacred—some applications use 3.0 for "extreme" outliers; know that this threshold is adjustable
Mahalanobis Distance
Accounts for covariance structure—computed as DM=(x−μ)TΣ−1(x−μ), where Σ is the covariance matrix
Detects multivariate outliers that univariate methods miss—a point can be normal in each dimension separately but anomalous when correlations are considered
Requires invertible covariance matrix—fails when features are collinear or when n<p (more variables than observations)
Elliptic Envelope
Fits a minimum covariance determinant ellipse—assumes data follows a Gaussian distribution and identifies points outside the fitted ellipsoid
Robust covariance estimation—uses a subset of "clean" points to estimate Σ, reducing the influence of outliers on the fit itself
Provides probabilistic interpretation—outputs can be converted to p-values, useful when you need statistical significance, not just classification
Compare: Z-Score vs. Mahalanobis Distance—both measure "distance from center," but Z-Score treats each variable independently while Mahalanobis accounts for correlations. If an FRQ gives you multivariate data with correlated features, Mahalanobis is your answer.
Density-Based Methods
These methods define outliers as points in sparse regions of the feature space. The key insight: outliers live where other points don't.
Local Outlier Factor (LOF)
Compares local density to neighbors' densities—computes a ratio; LOF ≈ 1 means similar density to neighbors, LOF >> 1 indicates the point is in a sparser region
Handles clusters of varying density—unlike global methods, LOF can identify outliers within dense clusters that would look normal globally
Requires choosing k (number of neighbors)—results are sensitive to this parameter; too small captures noise, too large misses local structure
DBSCAN
Clusters dense regions, labels sparse points as noise—requires two parameters: ϵ (neighborhood radius) and minPts (minimum points to form a core)
No assumption about cluster shape—can find arbitrarily shaped clusters, unlike k-means, making it versatile for real-world spatial data
Outliers are a byproduct, not the focus—points that don't belong to any cluster are labeled as noise; useful when you want clustering and outlier detection simultaneously
Compare: LOF vs. DBSCAN—both use local density, but LOF produces a continuous outlier score while DBSCAN produces a binary label. Use LOF when you need to rank anomalies by severity; use DBSCAN when you also need cluster assignments.
Tree-Based and Ensemble Methods
These algorithms exploit the principle that outliers are easier to isolate through random partitioning. The intuition: normal points require many splits to separate; outliers get isolated quickly.
Isolation Forest
Isolates points via random recursive partitioning—builds trees by randomly selecting features and split values; outliers have shorter average path lengths
Computational complexity is O(nlogn)—scales efficiently to large datasets, unlike distance-based methods that often require O(n2) comparisons
Excels in high-dimensional spaces—random feature selection naturally handles the curse of dimensionality without explicit dimension reduction
Robust Random Cut Forest
Designed for streaming data—can update the model incrementally as new points arrive, unlike Isolation Forest which requires batch retraining
Uses displacement-based scoring—measures how much a point's removal changes the tree structure, providing a principled anomaly score
Handles concept drift—the forest adapts over time, making it suitable for applications where "normal" evolves (e.g., network traffic, sensor data)
Compare: Isolation Forest vs. Robust Random Cut Forest—both use tree ensembles, but Isolation Forest is batch-oriented while RRCF handles streaming data. If an exam question mentions "real-time" or "online" detection, RRCF is the appropriate choice.
Model-Based Methods
These approaches learn a boundary or influence measure from the data, treating outlier detection as a modeling problem. The framework: fit a model to normal behavior, then flag points that don't conform.
One-Class SVM
Learns a decision boundary around normal data—maps data to high-dimensional space via kernel trick, then finds a hyperplane maximizing margin from the origin
Handles non-linear boundaries—with RBF or polynomial kernels, can capture complex "normal" regions that linear methods miss
Sensitive to hyperparameters—the ν parameter controls the fraction of outliers expected; poor tuning leads to over- or under-detection
Cook's Distance
Measures influence on regression coefficients—computed as Di=p⋅MSE(y^−y^(i))T(y^−y^(i)), where y^(i) is the prediction without point i
Threshold of 4/n is common convention—points exceeding this disproportionately affect the fitted model and warrant investigation
Specific to regression contexts—unlike general-purpose methods, Cook's Distance identifies influential points that distort model parameters, not just unusual values
Compare: One-Class SVM vs. Elliptic Envelope—both learn boundaries around normal data, but Elliptic Envelope assumes Gaussian distributions while One-Class SVM can learn arbitrary shapes via kernels. Use Elliptic Envelope when normality is reasonable; use One-Class SVM for complex, non-linear boundaries.
Which two methods both rely on measuring distance from a center point, and what key difference determines when you'd choose one over the other?
You have a dataset with two highly correlated features. A point appears normal when examining each feature's Z-Score individually but is flagged by Mahalanobis Distance. Explain why this happens mathematically.
Compare and contrast Isolation Forest and LOF: What assumptions does each make about how outliers manifest in data, and in what scenario would LOF detect an outlier that Isolation Forest might miss?
An FRQ asks you to detect anomalies in real-time network traffic data that arrives continuously. Which algorithm is most appropriate, and what property makes it suitable for this context?
Why is Cook's Distance fundamentally different from the other outlier detection methods on this list? In what specific analytical context is it the only appropriate choice?