upgrade
upgrade

🧠Machine Learning Engineering

Dimensionality Reduction Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In machine learning, you'll constantly face the curse of dimensionality—as features increase, models become slower, more prone to overfitting, and harder to interpret. Dimensionality reduction techniques solve this by compressing data while preserving what matters most. You're being tested on understanding when to use each method, what structure it preserves, and the tradeoffs involved. These concepts appear in system design interviews, ML engineering assessments, and real-world pipeline decisions.

The key insight is that not all methods do the same thing. Some preserve global variance, others maintain local neighborhoods, and still others maximize class separation. Don't just memorize algorithm names—know whether a method is linear vs. nonlinear, supervised vs. unsupervised, and what geometric property it optimizes. This conceptual understanding will help you select the right tool and explain your reasoning under pressure.


Linear Variance-Based Methods

These techniques assume data lies on or near a linear subspace and work by finding directions that capture maximum variance or signal. They project data onto hyperplanes using matrix decompositions.

Principal Component Analysis (PCA)

  • Maximizes variance—finds orthogonal directions (principal components) that capture the most spread in the data
  • Eigenvalue decomposition of the covariance matrix identifies these directions; components are ranked by explained variance
  • Unsupervised and linear—excellent for preprocessing, noise reduction, and visualizing high-dimensional data when linear assumptions hold

Truncated Singular Value Decomposition (SVD)

  • Matrix factorization into UΣVTU \Sigma V^T—retains only the top-kk singular values for dimensionality reduction
  • Works on sparse matrices—unlike PCA, doesn't require centering, making it ideal for NLP (LSA) and recommender systems
  • Identifies latent structure—the singular vectors reveal hidden patterns while filtering noise from the smaller components

Factor Analysis

  • Models observed variables as linear combinations of latent factors plus noise, assuming X=Lf+ϵX = Lf + \epsilon
  • Separates common variance from unique variance—unlike PCA, explicitly accounts for measurement error
  • Interpretable factors—widely used in psychometrics and social sciences to identify underlying constructs like "intelligence" or "satisfaction"

Compare: PCA vs. Factor Analysis—both find linear combinations of features, but PCA maximizes total variance while Factor Analysis models shared variance and assumes latent factors cause observations. Use PCA for general compression; use Factor Analysis when you believe hidden constructs drive your measurements.


Supervised Projection Methods

When you have labeled data, you can do better than unsupervised variance maximization. These methods optimize for class separability, not just spread.

Linear Discriminant Analysis (LDA)

  • Maximizes between-class variance while minimizing within-class variance—finds projections that best separate known classes
  • Projects to at most k1k-1 dimensions where kk is the number of classes; combines dimensionality reduction with classification
  • Assumes Gaussian classes with equal covariance—works well for preprocessing before classifiers, reducing overfitting on high-dimensional data

Compare: PCA vs. LDA—PCA is unsupervised and maximizes variance regardless of labels; LDA is supervised and maximizes class separation. If you have labels and want features for classification, LDA often outperforms PCA as a preprocessing step.


Nonlinear Manifold Learning

Real-world data often lies on curved surfaces (manifolds) embedded in high-dimensional space. These methods preserve geometric relationships that linear techniques miss.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • Preserves local neighborhoods—converts pairwise similarities to probabilities and minimizes KL divergence between high-D and low-D distributions
  • Heavy-tailed t-distribution in the low-D space prevents crowding and reveals cluster structure clearly
  • Visualization only—computationally expensive, non-deterministic, and doesn't provide a mapping for new points; use for exploration, not feature engineering

Isomap

  • Geodesic distances—approximates the true manifold distance by computing shortest paths through a neighborhood graph
  • Applies classical MDS to the geodesic distance matrix, preserving global geometry better than local methods
  • Sensitive to noise and holes—works beautifully on clean manifolds like the Swiss roll but can fail with sparse or noisy data

Locally Linear Embedding (LLE)

  • Reconstructs points from neighbors—finds weights that express each point as a linear combination of its kk nearest neighbors
  • Preserves local geometry—the same weights are used to embed points in the low-dimensional space
  • Efficient and elegant—solves a sparse eigenvalue problem, but sensitive to neighborhood size and struggles with non-uniform sampling

Compare: t-SNE vs. Isomap vs. LLE—all handle nonlinear structure, but t-SNE optimizes for visualization and local clusters, Isomap preserves global geodesic distances, and LLE preserves local linear relationships. For publication-quality cluster visualization, use t-SNE; for understanding global manifold structure, try Isomap first.


Signal Separation Methods

Sometimes the goal isn't compression but unmixing—separating independent sources that have been combined. These methods assume the observed data is a mixture of hidden signals.

Independent Component Analysis (ICA)

  • Maximizes statistical independence—finds components that are as non-Gaussian and independent as possible
  • Blind source separation—the classic "cocktail party problem" where mixed audio signals are separated without knowing the sources
  • Assumes non-Gaussian sources—fails if sources are Gaussian (where PCA and ICA become equivalent); order of components is arbitrary

Multidimensional Scaling (MDS)

  • Preserves pairwise distances—given a dissimilarity matrix, finds coordinates that best reproduce those distances in low-D space
  • Classical vs. metric vs. non-metric—different variants optimize for exact distances, monotonic relationships, or ordinal rankings
  • Flexible input—works directly from distance matrices, useful when you have similarities but not feature vectors

Compare: ICA vs. PCA—PCA finds uncorrelated components ordered by variance; ICA finds statistically independent components with no natural ordering. Use ICA when you believe your data is a mixture of independent sources (EEG signals, audio); use PCA for general-purpose compression.


Neural Network Approaches

Deep learning offers flexible, learnable dimensionality reduction that can capture complex nonlinear relationships. The encoder-decoder paradigm learns compressed representations end-to-end.

Autoencoders

  • Encoder-decoder architecture—compresses input to a bottleneck layer (latent representation), then reconstructs; trained to minimize reconstruction error
  • Nonlinear and flexible—with enough capacity, can learn arbitrarily complex manifolds; variants include denoising, sparse, and variational autoencoders
  • Feature learning powerhouse—the latent space can be used for downstream tasks, anomaly detection (high reconstruction error), or generative modeling (VAEs)

Compare: PCA vs. Linear Autoencoder—a single-layer autoencoder with linear activations learns the same subspace as PCA. Add nonlinear activations and depth to capture structure that PCA misses. If your data has complex nonlinear relationships and you have enough samples, autoencoders often outperform classical methods.


Quick Reference Table

ConceptBest Examples
Linear variance maximizationPCA, Truncated SVD
Supervised class separationLDA
Local neighborhood preservationt-SNE, LLE
Global manifold geometryIsomap, MDS
Source separation / independenceICA
Latent factor modelingFactor Analysis
Learnable nonlinear compressionAutoencoders
Works on sparse/distance matricesTruncated SVD, MDS

Self-Check Questions

  1. You have a labeled dataset with 3 classes and want to reduce 100 features before training a classifier. Would you choose PCA or LDA, and why?

  2. Which two methods both preserve local neighborhood structure but differ in their optimization approach? What's the key difference in what they minimize?

  3. Your colleague used t-SNE to reduce features for a downstream classifier and got poor results. What's wrong with this approach, and what would you recommend instead?

  4. Compare and contrast: When would you choose ICA over PCA? Give a specific example where ICA would succeed and PCA would fail.

  5. You're building a pipeline to process new data points after training. Which methods from this guide cannot transform new points without retraining, and what alternatives would you use in production?