upgrade
upgrade

🧮Data Science Numerical Analysis

Dimensionality Reduction Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with real-world datasets containing hundreds or thousands of features, you're facing the curse of dimensionality—models become computationally expensive, prone to overfitting, and nearly impossible to visualize. Dimensionality reduction techniques are your toolkit for tackling this problem, and they appear throughout numerical analysis, machine learning, and statistics coursework. You're being tested on your ability to choose the right technique for the right problem, understand the mathematical foundations behind each method, and recognize the trade-offs between variance preservation, interpretability, computational cost, and linearity assumptions.

These techniques demonstrate core principles from linear algebra (eigendecomposition, matrix factorization, orthogonal projections), probability theory (likelihood-based embeddings, Gaussian assumptions), and optimization (reconstruction error minimization, manifold learning). Don't just memorize algorithm names—know what mathematical problem each technique solves, when it fails, and how it connects to the broader landscape of numerical methods.


Linear Projection Methods

These techniques project high-dimensional data onto lower-dimensional subspaces using linear transformations. The key insight is that much of the variance in data often lies along a few principal directions, making linear projections surprisingly effective for many real-world datasets.

Principal Component Analysis (PCA)

  • Eigenvalue decomposition of the covariance matrix—identifies orthogonal directions (principal components) ordered by the amount of variance they capture
  • Variance maximization drives the projection; the first kk components capture the maximum possible variance in a kk-dimensional subspace
  • Assumes linear relationships and centered data; sensitive to feature scaling, so standardization is typically required before application

Linear Discriminant Analysis (LDA)

  • Supervised technique that maximizes the ratio σbetween2σwithin2\frac{\sigma_{\text{between}}^2}{\sigma_{\text{within}}^2}, finding projections that best separate known classes
  • Fisher's criterion guides the optimization—push class means apart while keeping within-class scatter tight
  • Limited to c1c-1 dimensions where cc is the number of classes; assumes Gaussian distributions with equal covariance matrices

Random Projections

  • Johnson-Lindenstrauss lemma guarantees that distances are preserved within a factor of (1±ϵ)(1 \pm \epsilon) when projecting to O(logn/ϵ2)O(\log n / \epsilon^2) dimensions
  • Computational efficiency is the main advantage—no eigendecomposition required, just multiply by a random matrix
  • Trade-off between accuracy and speed makes this ideal for preprocessing massive datasets before applying more expensive techniques

Compare: PCA vs. LDA—both find linear projections, but PCA is unsupervised (maximizes variance) while LDA is supervised (maximizes class separation). If an exam question involves labeled data and classification, LDA is your answer; for unlabeled exploratory analysis, reach for PCA.


Matrix Factorization Approaches

These methods decompose the original data matrix into products of smaller matrices, revealing latent structure. The mathematical foundation is that any matrix can be approximated by lower-rank matrices, and different factorizations impose different constraints that lead to different interpretations.

Singular Value Decomposition (SVD)

  • Factorizes any matrix A=UΣVTA = U\Sigma V^T where UU and VV are orthonormal and Σ\Sigma contains singular values in descending order
  • Optimal low-rank approximation—truncating to kk singular values minimizes the Frobenius norm of the reconstruction error (Eckart-Young theorem)
  • Foundation for PCA when applied to centered data; the right singular vectors are the principal components

Truncated SVD (Latent Semantic Analysis)

  • Retains only top kk singular values and corresponding vectors, directly computing the reduced representation without full decomposition
  • Works on sparse matrices without centering, making it essential for text data where TF-IDF matrices are sparse and centering destroys sparsity
  • Latent semantic structure emerges in text applications—documents and terms are mapped to a shared semantic space

Non-negative Matrix Factorization (NMF)

  • Decomposes AWHA \approx WH where both WW and HH have non-negative entries, enabling parts-based representation
  • Interpretability advantage—components represent additive combinations (like facial features or topic mixtures) rather than canceling positive/negative contributions
  • Non-convex optimization means results depend on initialization; no uniqueness guarantee unlike SVD

Compare: SVD vs. NMF—SVD gives the mathematically optimal low-rank approximation but allows negative values; NMF sacrifices optimality for interpretability through non-negativity constraints. For image decomposition or topic modeling where "negative features" make no sense, NMF wins.


Manifold Learning Techniques

When data lies on a curved surface (manifold) embedded in high-dimensional space, linear methods fail. These techniques attempt to "unfold" the manifold by preserving geometric relationships—either local neighborhoods or global geodesic distances.

Isomap

  • Geodesic distance preservation—builds a neighborhood graph, computes shortest paths between all pairs, then applies MDS to the distance matrix
  • Global structure focus distinguishes it from local methods; captures the intrinsic geometry of the entire manifold
  • Sensitive to noise and holes in the data; a single shortcut edge in the neighborhood graph can collapse the embedding

Locally Linear Embedding (LLE)

  • Local linearity assumption—each point is reconstructed as a weighted combination of its kk nearest neighbors, then those weights are preserved in the embedding
  • Solves a sparse eigenvalue problem making it computationally tractable for moderately sized datasets
  • Struggles with non-uniform sampling and can produce distorted embeddings when local neighborhoods have varying densities

Multidimensional Scaling (MDS)

  • Distance matrix as input—finds coordinates in low-dimensional space that best preserve pairwise distances (or dissimilarities)
  • Classical MDS uses eigendecomposition and is equivalent to PCA when distances are Euclidean; metric MDS optimizes stress functions for general dissimilarities
  • Visualization tool for understanding relationships; commonly used when you have similarity judgments rather than feature vectors

Compare: Isomap vs. LLE—both are manifold learning methods, but Isomap preserves global geodesic distances while LLE preserves local linear relationships. Isomap works better for smoothly curved manifolds; LLE handles more complex local geometry but may distort global structure.


Probabilistic and Neighbor-Based Embeddings

These methods model the probability distributions of neighborhoods in high and low dimensions, optimizing embeddings to match these distributions. The key innovation is shifting from geometric distance preservation to probabilistic similarity preservation.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • Probabilistic similarity matching—converts distances to probabilities using Gaussian kernels in high-D and Student's t-distribution in low-D
  • Heavy-tailed t-distribution in the embedding space prevents the "crowding problem" where moderate distances all collapse together
  • Perplexity hyperparameter controls the effective number of neighbors; results can vary dramatically with different settings

Uniform Manifold Approximation and Projection (UMAP)

  • Topological foundation—constructs fuzzy simplicial complexes and optimizes cross-entropy between high-D and low-D representations
  • Preserves both local and global structure better than t-SNE while running significantly faster on large datasets
  • Theoretical grounding in Riemannian geometry and algebraic topology provides stronger mathematical justification than t-SNE's heuristics

Compare: t-SNE vs. UMAP—both excel at visualization and cluster separation, but UMAP is faster, scales better, and better preserves global structure. t-SNE remains popular for its well-understood behavior, but UMAP is increasingly preferred for production workflows. Neither produces consistent results across runs without fixing random seeds.


Neural Network and Latent Variable Methods

These approaches learn nonlinear mappings through optimization, either via neural networks or statistical latent variable models. The flexibility of learned representations comes at the cost of interpretability and the need for careful regularization.

Autoencoders

  • Encoder-decoder architecture—the encoder maps input xx to latent code zz, the decoder reconstructs x^\hat{x}, and training minimizes xx^2\|x - \hat{x}\|^2
  • Bottleneck layer forces compression; the dimensionality of zz determines the reduction, and nonlinear activations enable learning curved manifolds
  • Variational autoencoders (VAEs) add probabilistic structure by learning p(zx)p(z|x) and p(xz)p(x|z), enabling generative modeling alongside dimensionality reduction

Factor Analysis

  • Generative latent variable model—assumes observed variables x=Λf+ϵx = \Lambda f + \epsilon where ff are latent factors and ϵ\epsilon is noise with diagonal covariance
  • Distinguishes signal from noise unlike PCA; the noise model allows factors to explain only the shared variance among variables
  • Maximum likelihood estimation or EM algorithm fits the model; factor loadings Λ\Lambda reveal how latent constructs influence observed measurements

Independent Component Analysis (ICA)

  • Blind source separation—recovers independent source signals from observed mixtures assuming x=Asx = As where sources ss are statistically independent
  • Non-Gaussianity is essential—the Central Limit Theorem means mixtures are more Gaussian than sources, so ICA maximizes non-Gaussianity to find original signals
  • Preprocessing with PCA is standard to whiten the data and reduce dimensionality before applying ICA

Compare: PCA vs. Factor Analysis—PCA finds orthogonal directions of maximum variance with no noise model; Factor Analysis explicitly models measurement error and seeks latent constructs. For psychometric scales or survey instruments where measurement noise matters, Factor Analysis is more appropriate.


Kernel Methods

When linear techniques fail to capture the structure in your data, kernel methods offer a principled way to work in higher-dimensional feature spaces without explicitly computing the transformation.

Kernel PCA

  • Kernel trick application—computes PCA in a feature space ϕ(x)\phi(x) implicitly through kernel evaluations k(xi,xj)=ϕ(xi),ϕ(xj)k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle
  • Eigendecomposition of the kernel matrix KK yields principal components in feature space; common kernels include RBF (eγxy2e^{-\gamma\|x-y\|^2}) and polynomial
  • No explicit low-dimensional coordinates—new points require kernel evaluations with training data, making out-of-sample extension more complex than linear PCA

Compare: PCA vs. Kernel PCA—standard PCA finds linear subspaces; Kernel PCA can capture nonlinear structure by implicitly working in high-dimensional feature spaces. The trade-off is computational cost (O(n3)O(n^3) for kernel matrix eigendecomposition) and the need to choose an appropriate kernel.


Quick Reference Table

ConceptBest Examples
Variance maximization (unsupervised)PCA, SVD, Truncated SVD
Class separation (supervised)LDA
Matrix factorization with constraintsNMF (non-negativity), ICA (independence)
Local neighborhood preservationLLE, t-SNE, UMAP
Global distance preservationIsomap, MDS
Probabilistic/generative modelsFactor Analysis, Autoencoders (VAE)
Nonlinear via kernelsKernel PCA
Computational efficiency at scaleRandom Projections, Truncated SVD, UMAP

Self-Check Questions

  1. Both PCA and Factor Analysis identify latent structure in data. What is the key modeling difference between them, and when would you choose Factor Analysis over PCA?

  2. You have a dataset with 10,000 samples and 50,000 features that you need to reduce to 100 dimensions quickly. Which technique would you apply first, and why does the Johnson-Lindenstrauss lemma justify this choice?

  3. Compare t-SNE and Isomap: both are nonlinear dimensionality reduction techniques, but they preserve different types of structure. What does each prioritize, and how does this affect their typical use cases?

  4. An FRQ asks you to explain why NMF produces more interpretable components than SVD for image data. What mathematical constraint drives this difference, and what is sacrificed in exchange?

  5. You're applying ICA to separate mixed audio signals. Why would PCA fail at this task, and what statistical property of the source signals does ICA exploit that PCA ignores?