Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Outlier detection sits at the intersection of data preprocessing, model robustness, and anomaly detection—three areas that appear repeatedly on exams. Whether you're cleaning data before fitting a regression, identifying fraud in transaction data, or spotting sensor malfunctions, the technique you choose depends on your data's structure: Is it univariate or multivariate? Does it follow a normal distribution? Is it high-dimensional? These questions determine which method will actually work.
You're being tested on your ability to match the right technique to the right data scenario. A Z-score is elegant for normally distributed univariate data, but it falls apart with correlated features or non-Gaussian distributions. Density-based methods shine with clustered data but struggle in high dimensions. Don't just memorize the formulas—know what assumptions each method makes and when those assumptions break down.
These techniques define outliers based on how far a point lies from some central tendency, using statistical measures of spread. The key assumption is that "normal" data clusters around a center, and outliers deviate significantly from it.
Compare: Z-score vs. Mahalanobis Distance—both measure distance from center, but Z-score treats each variable independently while Mahalanobis accounts for covariance structure. If an FRQ gives you correlated features, Mahalanobis is your answer.
These approaches identify outliers as points in regions of unusually low density. The intuition: normal points have many neighbors, while outliers are isolated.
Compare: LOF vs. DBSCAN—both use local density, but LOF produces a continuous outlier score while DBSCAN gives a binary classification (in cluster or noise). Use LOF when you need to rank anomalies by severity.
These techniques use random partitioning to isolate anomalies. The core insight: outliers are easier to separate from the bulk of data, requiring fewer random splits.
Compare: Isolation Forest vs. Robust Random Cut Forest—both use tree-based isolation, but RRCF is designed for streaming applications and provides displacement-based scoring. Choose Isolation Forest for batch processing, RRCF for real-time anomaly detection.
These techniques fit a model to "normal" data and flag points that don't conform. The assumption: you can characterize what normal looks like, and deviations from that model are outliers.
Compare: One-Class SVM vs. Elliptic Envelope—both learn a boundary around normal data, but Elliptic Envelope assumes Gaussian distribution while One-Class SVM can learn arbitrary shapes via kernels. Use Elliptic Envelope when normality holds; use One-Class SVM otherwise.
When your goal is prediction rather than general anomaly detection, these techniques identify points that disproportionately influence your model.
Compare: Cook's Distance vs. Z-score on residuals—Z-score only considers how far a point's residual is from zero, while Cook's Distance also accounts for leverage (how extreme the predictor values are). High-leverage points with moderate residuals can still be highly influential.
| Concept | Best Examples |
|---|---|
| Univariate, normal data | Z-score, IQR |
| Multivariate with correlations | Mahalanobis Distance, Elliptic Envelope |
| Varying density clusters | LOF, DBSCAN |
| High-dimensional data | Isolation Forest, One-Class SVM |
| No distributional assumptions | IQR, Isolation Forest, DBSCAN |
| Streaming/real-time data | Robust Random Cut Forest |
| Regression influence | Cook's Distance |
| Only normal examples available | One-Class SVM, Elliptic Envelope |
You have a dataset with two highly correlated features. Why would applying Z-scores to each feature independently fail to detect certain outliers, and which method should you use instead?
Compare LOF and Isolation Forest: both detect outliers, but they use fundamentally different approaches. What is the key conceptual difference, and in what scenario would each excel?
A colleague suggests using Elliptic Envelope for outlier detection on customer purchase data that shows a long right tail. What's wrong with this choice, and what would you recommend?
If an FRQ asks you to identify influential points in a linear regression, which technique would you use and why? How does it differ from simply flagging points with large residuals?
You're building a fraud detection system where you only have examples of legitimate transactions (no labeled fraud cases). Which two methods from this guide are designed for exactly this scenario, and what assumption distinguishes them?