Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Outlier detection sits at the heart of data inference because a single anomalous point can completely distort your conclusions. You're being tested on your ability to recognize when data violates assumptions, choose appropriate detection methods for different data structures, and justify why removing (or keeping) certain observations affects your statistical conclusions. These methods connect directly to concepts like distributional assumptions, robust estimation, model diagnostics, and the bias-variance tradeoff.
The key insight isn't just knowing these techniques existโit's understanding when each method is appropriate. A Z-score works great for univariate normal data but fails spectacularly with multivariate correlations or skewed distributions. Density-based methods shine with clustered data but struggle in high dimensions. Don't just memorize formulas; know what data structure and distributional assumption each method requires, and be ready to defend your choice on an FRQ.
These classical approaches assume your data follows a known distribution and flag points that fall too far from expected values. The underlying principle is probability: if a point is extremely unlikely under the assumed distribution, it's probably an outlier.
Compare: Z-score vs. IQRโboth work for univariate data, but Z-score assumes normality while IQR is robust to skewness. If an FRQ gives you income data (typically right-skewed), IQR is your safer choice.
These methods detect outliers by examining how isolated a point is relative to its neighbors. The core idea: normal points cluster together, while outliers live in sparse regions of the feature space.
Compare: LOF vs. DBSCANโboth use density, but LOF assigns continuous anomaly scores while DBSCAN makes binary cluster/noise decisions. LOF is better when you need to rank outliers by severity.
These approaches learn a model of "normal" data and flag points that don't fit. The principle: outliers are points the model struggles to explain or that disproportionately influence model parameters.
Compare: Isolation Forest vs. Autoencoderโboth handle high-dimensional data, but Isolation Forest is interpretable and needs no training labels, while autoencoders can model complex non-linear patterns but act as black boxes. Choose Isolation Forest for explainability, autoencoders for complex data.
When your goal is building a predictive model, these methods identify points that unduly influence your fitted parameters. The key question: would my conclusions change substantially if I removed this point?
Compare: Cook's Distance vs. MahalanobisโCook's measures influence on model parameters while Mahalanobis measures distance in data space. A point can have high Mahalanobis distance but low Cook's distance if it falls along the regression line.
| Concept | Best Examples |
|---|---|
| Univariate, normal data | Z-score |
| Univariate, skewed/robust | IQR method |
| Multivariate with correlations | Mahalanobis distance, Robust covariance (MCD) |
| Varying density clusters | LOF, DBSCAN |
| High-dimensional data | Isolation Forest, One-class SVM, Autoencoder |
| Regression model diagnostics | Cook's distance |
| No distributional assumptions | IQR, LOF, Isolation Forest |
| Complex non-linear patterns | Autoencoder-based detection |
You have univariate income data that's heavily right-skewed. Why would IQR outperform Z-score for outlier detection, and what specific assumption does Z-score violate?
Compare Mahalanobis distance and Euclidean distance: in what scenario would they identify different points as outliers, and why does correlation matter?
Both LOF and DBSCAN use density to detect outliers. What's the key difference in their output, and when would you prefer one over the other?
A regression model has a point with high leverage but a small residual. Would Cook's distance flag this point? Explain the relationship between leverage, residuals, and influence.
You're choosing between Isolation Forest and an autoencoder for detecting anomalies in high-dimensional sensor data. What are two advantages of Isolation Forest and one scenario where the autoencoder might perform better?