Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In machine learning, you'll constantly face the curse of dimensionality—as features increase, models become slower, more prone to overfitting, and harder to interpret. Dimensionality reduction techniques solve this by compressing data while preserving what matters most. You're being tested on understanding when to use each method, what structure it preserves, and the tradeoffs involved. These concepts appear in system design interviews, ML engineering assessments, and real-world pipeline decisions.
The key insight is that not all methods do the same thing. Some preserve global variance, others maintain local neighborhoods, and still others maximize class separation. Don't just memorize algorithm names—know whether a method is linear vs. nonlinear, supervised vs. unsupervised, and what geometric property it optimizes. This conceptual understanding will help you select the right tool and explain your reasoning under pressure.
These techniques assume data lies on or near a linear subspace and work by finding directions that capture maximum variance or signal. They project data onto hyperplanes using matrix decompositions.
Compare: PCA vs. Factor Analysis—both find linear combinations of features, but PCA maximizes total variance while Factor Analysis models shared variance and assumes latent factors cause observations. Use PCA for general compression; use Factor Analysis when you believe hidden constructs drive your measurements.
When you have labeled data, you can do better than unsupervised variance maximization. These methods optimize for class separability, not just spread.
Compare: PCA vs. LDA—PCA is unsupervised and maximizes variance regardless of labels; LDA is supervised and maximizes class separation. If you have labels and want features for classification, LDA often outperforms PCA as a preprocessing step.
Real-world data often lies on curved surfaces (manifolds) embedded in high-dimensional space. These methods preserve geometric relationships that linear techniques miss.
Compare: t-SNE vs. Isomap vs. LLE—all handle nonlinear structure, but t-SNE optimizes for visualization and local clusters, Isomap preserves global geodesic distances, and LLE preserves local linear relationships. For publication-quality cluster visualization, use t-SNE; for understanding global manifold structure, try Isomap first.
Sometimes the goal isn't compression but unmixing—separating independent sources that have been combined. These methods assume the observed data is a mixture of hidden signals.
Compare: ICA vs. PCA—PCA finds uncorrelated components ordered by variance; ICA finds statistically independent components with no natural ordering. Use ICA when you believe your data is a mixture of independent sources (EEG signals, audio); use PCA for general-purpose compression.
Deep learning offers flexible, learnable dimensionality reduction that can capture complex nonlinear relationships. The encoder-decoder paradigm learns compressed representations end-to-end.
Compare: PCA vs. Linear Autoencoder—a single-layer autoencoder with linear activations learns the same subspace as PCA. Add nonlinear activations and depth to capture structure that PCA misses. If your data has complex nonlinear relationships and you have enough samples, autoencoders often outperform classical methods.
| Concept | Best Examples |
|---|---|
| Linear variance maximization | PCA, Truncated SVD |
| Supervised class separation | LDA |
| Local neighborhood preservation | t-SNE, LLE |
| Global manifold geometry | Isomap, MDS |
| Source separation / independence | ICA |
| Latent factor modeling | Factor Analysis |
| Learnable nonlinear compression | Autoencoders |
| Works on sparse/distance matrices | Truncated SVD, MDS |
You have a labeled dataset with 3 classes and want to reduce 100 features before training a classifier. Would you choose PCA or LDA, and why?
Which two methods both preserve local neighborhood structure but differ in their optimization approach? What's the key difference in what they minimize?
Your colleague used t-SNE to reduce features for a downstream classifier and got poor results. What's wrong with this approach, and what would you recommend instead?
Compare and contrast: When would you choose ICA over PCA? Give a specific example where ICA would succeed and PCA would fail.
You're building a pipeline to process new data points after training. Which methods from this guide cannot transform new points without retraining, and what alternatives would you use in production?