upgrade
upgrade

🤝Collaborative Data Science

Dimensionality Reduction Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with datasets containing hundreds or thousands of features, you're facing the curse of dimensionality—a phenomenon where algorithms struggle, visualizations become meaningless, and computational costs explode. Dimensionality reduction isn't just a convenience; it's often the difference between a model that works and one that fails. These techniques connect directly to core concepts you'll be tested on: variance preservation, manifold learning, linear vs. non-linear transformations, and the tradeoff between local and global structure.

Understanding these algorithms means knowing when and why to apply each one—not just how they work mechanically. Are you trying to visualize clusters? Preserve distances for downstream classification? Compress data for storage? Each scenario calls for a different approach. Don't just memorize algorithm names—know what mathematical principle each method optimizes and what structure it preserves.


Linear Methods: Variance and Matrix Factorization

These foundational techniques assume your data lies in or near a linear subspace. They're computationally efficient, mathematically elegant, and form the basis for many advanced methods. The core idea: find directions or components that capture the most important information while discarding noise.

Principal Component Analysis (PCA)

  • Maximizes variance along orthogonal axes—transforms data into uncorrelated principal components ranked by how much variance each captures
  • Unsupervised and deterministic, meaning it finds the same solution every time and doesn't require labels
  • Foundation for preprocessing pipelines—use it for noise reduction, multicollinearity removal, and as input to other algorithms

Singular Value Decomposition (SVD)

  • Factorizes any matrix into three components: A=UΣVTA = U\Sigma V^T, where Σ\Sigma contains singular values revealing data's intrinsic dimensionality
  • Mathematically equivalent to PCA when applied to centered data, but more numerically stable for large matrices
  • Powers recommender systems and NLP—latent semantic analysis and collaborative filtering rely heavily on truncated SVD

Linear Discriminant Analysis (LDA)

  • Supervised method that maximizes class separability—finds projections that push class means apart while keeping within-class variance tight
  • Optimizes Fisher's criterion: the ratio of between-class to within-class variance, σbetween2σwithin2\frac{\sigma_{between}^2}{\sigma_{within}^2}
  • Limited to c1c-1 dimensions where cc is the number of classes—a key constraint to remember for exam questions

Compare: PCA vs. LDA—both find linear projections, but PCA maximizes total variance (unsupervised) while LDA maximizes class discrimination (supervised). If an FRQ asks about dimensionality reduction before classification, discuss whether labels are available.

Factor Analysis

  • Models observed variables as linear combinations of latent factors—assumes hidden variables explain correlations in your data
  • Includes an error term for each variable, unlike PCA which assumes all variance is signal
  • Interpretability-focused—commonly used in psychology and social sciences to identify underlying constructs like "intelligence" or "satisfaction"

Compare: PCA vs. Factor Analysis—PCA finds components that explain variance; Factor Analysis finds latent factors that explain correlations. Factor Analysis is better when you believe hidden variables generate your observations.


Manifold Learning: Preserving Geometric Structure

Real-world data often lies on curved, non-linear surfaces (manifolds) embedded in high-dimensional space. These methods unroll the manifold to reveal its true low-dimensional structure. The key insight: Euclidean distance in the original space may not reflect the actual relationships between points.

Isomap

  • Replaces Euclidean distances with geodesic distances—measures along the manifold surface rather than through ambient space
  • Builds a neighborhood graph then computes shortest paths, applying classical MDS to the resulting distance matrix
  • Preserves global structure but sensitive to noise and holes in the data—requires careful parameter tuning for kk-nearest neighbors

Locally Linear Embedding (LLE)

  • Reconstructs each point as a weighted combination of its neighbors—then finds low-dimensional coordinates preserving those same weights
  • Focuses on local geometry by assuming small neighborhoods are approximately linear patches of the manifold
  • Computationally efficient but struggles with non-uniform sampling and doesn't handle out-of-sample points naturally

Multidimensional Scaling (MDS)

  • Preserves pairwise distances in the embedding—finds coordinates where distances match the original dissimilarity matrix as closely as possible
  • Classical MDS uses eigendecomposition and is equivalent to PCA when distances are Euclidean; non-metric MDS preserves only rank ordering
  • Flexible input format—can work directly from a distance matrix without needing the original features

Compare: Isomap vs. LLE—both are manifold methods, but Isomap preserves global geodesic distances while LLE preserves local linear reconstructions. Isomap better maintains overall shape; LLE better captures local neighborhoods.


Probabilistic and Neighbor-Based Embeddings

These methods focus on preserving relationships between points rather than distances or variances. They excel at visualization and cluster discovery, using probability distributions or topological structures to define similarity. Particularly powerful for revealing hidden groupings in complex data.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • Converts distances to probabilities—uses Gaussian distributions in high-D and heavy-tailed Student's t-distribution in low-D to model point similarities
  • Emphasizes local structure by penalizing nearby points that end up far apart more than distant points that stay distant
  • Non-deterministic and non-parametric—different runs produce different results, and you can't easily embed new points without rerunning

Uniform Manifold Approximation and Projection (UMAP)

  • Balances local and global structure—constructs a fuzzy topological representation and optimizes cross-entropy between high-D and low-D graphs
  • Faster than t-SNE with better preservation of continuous structures and cluster relationships
  • Theoretically grounded in Riemannian geometry and algebraic topology, providing more principled hyperparameter choices

Compare: t-SNE vs. UMAP—both excel at visualization, but UMAP is faster, preserves more global structure, and handles larger datasets better. t-SNE often produces tighter, more separated clusters; UMAP maintains more continuous relationships. For exploratory analysis, try both.


Neural Network Approaches: Learning Representations

Deep learning offers flexible, powerful dimensionality reduction that can capture arbitrarily complex patterns. The tradeoff: more expressive power requires more data and computational resources.

Autoencoders

  • Encoder-decoder architecture compresses input to a bottleneck layer (latent space) then reconstructs the original—the bottleneck forces learning of efficient representations
  • Loss function drives compression—typically minimizes reconstruction error like MSE=1n(xx^)2\text{MSE} = \frac{1}{n}\sum(x - \hat{x})^2
  • Variants unlock different capabilities—variational autoencoders (VAEs) learn probabilistic latent spaces; denoising autoencoders improve robustness

Compare: PCA vs. Autoencoders—a single-layer linear autoencoder learns the same subspace as PCA. Deep autoencoders with non-linear activations can capture complex manifolds that PCA misses entirely. Use PCA for interpretability; autoencoders for expressiveness.


Quick Reference Table

ConceptBest Examples
Linear projection (unsupervised)PCA, SVD
Linear projection (supervised)LDA
Latent variable modelingFactor Analysis, Autoencoders
Global structure preservationIsomap, MDS, UMAP
Local structure preservationLLE, t-SNE
Visualization of clusterst-SNE, UMAP
Non-linear manifold learningIsomap, LLE, Autoencoders
Scalability to large datasetsPCA, SVD, UMAP

Self-Check Questions

  1. You have labeled data and want to reduce dimensions before classification. Which two methods could you use, and how do they differ in their optimization objectives?

  2. A colleague's t-SNE plot shows well-separated clusters, but they conclude the clusters are equally distant from each other. What's wrong with this interpretation, and which alternative method might give more reliable global distance information?

  3. Compare PCA and Factor Analysis: if your goal is to identify interpretable latent constructs (like "customer satisfaction"), which would you choose and why?

  4. You're working with data that lies on a curved surface (like points sampled from a Swiss roll). Why would PCA fail here, and which manifold learning method would you try first?

  5. An FRQ asks you to design a dimensionality reduction pipeline for a dataset with 10,000 samples and 500 features, where you need both visualization and input features for a downstream classifier. Describe your approach using at least two different algorithms and justify your choices.