🧮Data Science Numerical Analysis

Dimensionality Reduction Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with real-world datasets containing hundreds or thousands of features, you're facing the curse of dimensionality—models become computationally expensive, prone to overfitting, and nearly impossible to visualize. Dimensionality reduction techniques are your toolkit for tackling this problem, and they appear throughout numerical analysis, machine learning, and statistics coursework. You're being tested on your ability to choose the right technique for the right problem, understand the mathematical foundations behind each method, and recognize the trade-offs between variance preservation, interpretability, computational cost, and linearity assumptions.

These techniques demonstrate core principles from linear algebra (eigendecomposition, matrix factorization, orthogonal projections), probability theory (likelihood-based embeddings, Gaussian assumptions), and optimization (reconstruction error minimization, manifold learning). Don't just memorize algorithm names—know what mathematical problem each technique solves, when it fails, and how it connects to the broader landscape of numerical methods.

Linear Projection Methods

These techniques project high-dimensional data onto lower-dimensional subspaces using linear transformations. The key insight is that much of the variance in data often lies along a few principal directions, making linear projections surprisingly effective for many real-world datasets.

Principal Component Analysis (PCA)

Eigenvalue decomposition of the covariance matrix—identifies orthogonal directions (principal components) ordered by the amount of variance they capture
Variance maximization drives the projection; the first $k$ components capture the maximum possible variance in a $k$ -dimensional subspace
Assumes linear relationships and centered data; sensitive to feature scaling, so standardization is typically required before application

Linear Discriminant Analysis (LDA)

Supervised technique that maximizes the ratio $\frac{\sigma_{\text{between}}^2}{\sigma_{\text{within}}^2}$ , finding projections that best separate known classes
Fisher's criterion guides the optimization—push class means apart while keeping within-class scatter tight
Limited to $c-1$ dimensions where $c$ is the number of classes; assumes Gaussian distributions with equal covariance matrices

Random Projections

Johnson-Lindenstrauss lemma guarantees that distances are preserved within a factor of $(1 \pm \epsilon)$ when projecting to $O(\log n / \epsilon^2)$ dimensions
Computational efficiency is the main advantage—no eigendecomposition required, just multiply by a random matrix
Trade-off between accuracy and speed makes this ideal for preprocessing massive datasets before applying more expensive techniques

Compare: PCA vs. LDA—both find linear projections, but PCA is unsupervised (maximizes variance) while LDA is supervised (maximizes class separation). If an exam question involves labeled data and classification, LDA is your answer; for unlabeled exploratory analysis, reach for PCA.

Matrix Factorization Approaches

These methods decompose the original data matrix into products of smaller matrices, revealing latent structure. The mathematical foundation is that any matrix can be approximated by lower-rank matrices, and different factorizations impose different constraints that lead to different interpretations.

Singular Value Decomposition (SVD)

Factorizes any matrix $A = U\Sigma V^T$ where $U$ and $V$ are orthonormal and $\Sigma$ contains singular values in descending order
Optimal low-rank approximation—truncating to $k$ singular values minimizes the Frobenius norm of the reconstruction error (Eckart-Young theorem)
Foundation for PCA when applied to centered data; the right singular vectors are the principal components

Truncated SVD (Latent Semantic Analysis)

Retains only top $k$ singular values and corresponding vectors, directly computing the reduced representation without full decomposition
Works on sparse matrices without centering, making it essential for text data where TF-IDF matrices are sparse and centering destroys sparsity
Latent semantic structure emerges in text applications—documents and terms are mapped to a shared semantic space

Non-negative Matrix Factorization (NMF)

Decomposes $A \approx WH$ where both $W$ and $H$ have non-negative entries, enabling parts-based representation
Interpretability advantage—components represent additive combinations (like facial features or topic mixtures) rather than canceling positive/negative contributions
Non-convex optimization means results depend on initialization; no uniqueness guarantee unlike SVD

Compare: SVD vs. NMF—SVD gives the mathematically optimal low-rank approximation but allows negative values; NMF sacrifices optimality for interpretability through non-negativity constraints. For image decomposition or topic modeling where "negative features" make no sense, NMF wins.

Manifold Learning Techniques

When data lies on a curved surface (manifold) embedded in high-dimensional space, linear methods fail. These techniques attempt to "unfold" the manifold by preserving geometric relationships—either local neighborhoods or global geodesic distances.

Isomap

Geodesic distance preservation—builds a neighborhood graph, computes shortest paths between all pairs, then applies MDS to the distance matrix
Global structure focus distinguishes it from local methods; captures the intrinsic geometry of the entire manifold
Sensitive to noise and holes in the data; a single shortcut edge in the neighborhood graph can collapse the embedding

Locally Linear Embedding (LLE)

Local linearity assumption—each point is reconstructed as a weighted combination of its $k$ nearest neighbors, then those weights are preserved in the embedding
Solves a sparse eigenvalue problem making it computationally tractable for moderately sized datasets
Struggles with non-uniform sampling and can produce distorted embeddings when local neighborhoods have varying densities

Multidimensional Scaling (MDS)

Distance matrix as input—finds coordinates in low-dimensional space that best preserve pairwise distances (or dissimilarities)
Classical MDS uses eigendecomposition and is equivalent to PCA when distances are Euclidean; metric MDS optimizes stress functions for general dissimilarities
Visualization tool for understanding relationships; commonly used when you have similarity judgments rather than feature vectors

Compare: Isomap vs. LLE—both are manifold learning methods, but Isomap preserves global geodesic distances while LLE preserves local linear relationships. Isomap works better for smoothly curved manifolds; LLE handles more complex local geometry but may distort global structure.

Probabilistic and Neighbor-Based Embeddings

These methods model the probability distributions of neighborhoods in high and low dimensions, optimizing embeddings to match these distributions. The key innovation is shifting from geometric distance preservation to probabilistic similarity preservation.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Probabilistic similarity matching—converts distances to probabilities using Gaussian kernels in high-D and Student's t-distribution in low-D
Heavy-tailed t-distribution in the embedding space prevents the "crowding problem" where moderate distances all collapse together
Perplexity hyperparameter controls the effective number of neighbors; results can vary dramatically with different settings

Uniform Manifold Approximation and Projection (UMAP)

Topological foundation—constructs fuzzy simplicial complexes and optimizes cross-entropy between high-D and low-D representations
Preserves both local and global structure better than t-SNE while running significantly faster on large datasets
Theoretical grounding in Riemannian geometry and algebraic topology provides stronger mathematical justification than t-SNE's heuristics

Compare: t-SNE vs. UMAP—both excel at visualization and cluster separation, but UMAP is faster, scales better, and better preserves global structure. t-SNE remains popular for its well-understood behavior, but UMAP is increasingly preferred for production workflows. Neither produces consistent results across runs without fixing random seeds.

Neural Network and Latent Variable Methods

These approaches learn nonlinear mappings through optimization, either via neural networks or statistical latent variable models. The flexibility of learned representations comes at the cost of interpretability and the need for careful regularization.

Autoencoders

Encoder-decoder architecture—the encoder maps input $x$ to latent code $z$ , the decoder reconstructs $\hat{x}$ , and training minimizes $\|x - \hat{x}\|^2$
Bottleneck layer forces compression; the dimensionality of $z$ determines the reduction, and nonlinear activations enable learning curved manifolds
Variational autoencoders (VAEs) add probabilistic structure by learning $p(z|x)$ and $p(x|z)$ , enabling generative modeling alongside dimensionality reduction

Factor Analysis

Generative latent variable model—assumes observed variables $x = \Lambda f + \epsilon$ where $f$ are latent factors and $\epsilon$ is noise with diagonal covariance
Distinguishes signal from noise unlike PCA; the noise model allows factors to explain only the shared variance among variables
Maximum likelihood estimation or EM algorithm fits the model; factor loadings $\Lambda$ reveal how latent constructs influence observed measurements

Independent Component Analysis (ICA)

Blind source separation—recovers independent source signals from observed mixtures assuming $x = As$ where sources $s$ are statistically independent
Non-Gaussianity is essential—the Central Limit Theorem means mixtures are more Gaussian than sources, so ICA maximizes non-Gaussianity to find original signals
Preprocessing with PCA is standard to whiten the data and reduce dimensionality before applying ICA

Compare: PCA vs. Factor Analysis—PCA finds orthogonal directions of maximum variance with no noise model; Factor Analysis explicitly models measurement error and seeks latent constructs. For psychometric scales or survey instruments where measurement noise matters, Factor Analysis is more appropriate.

Kernel Methods

When linear techniques fail to capture the structure in your data, kernel methods offer a principled way to work in higher-dimensional feature spaces without explicitly computing the transformation.

Kernel PCA

Kernel trick application—computes PCA in a feature space $\phi(x)$ implicitly through kernel evaluations $k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$
Eigendecomposition of the kernel matrix $K$ yields principal components in feature space; common kernels include RBF ( $e^{-\gamma\|x-y\|^2}$ ) and polynomial
No explicit low-dimensional coordinates—new points require kernel evaluations with training data, making out-of-sample extension more complex than linear PCA

Compare: PCA vs. Kernel PCA—standard PCA finds linear subspaces; Kernel PCA can capture nonlinear structure by implicitly working in high-dimensional feature spaces. The trade-off is computational cost ( $O(n^3)$ for kernel matrix eigendecomposition) and the need to choose an appropriate kernel.

Quick Reference Table

Concept	Best Examples
Variance maximization (unsupervised)	PCA, SVD, Truncated SVD
Class separation (supervised)	LDA
Matrix factorization with constraints	NMF (non-negativity), ICA (independence)
Local neighborhood preservation	LLE, t-SNE, UMAP
Global distance preservation	Isomap, MDS
Probabilistic/generative models	Factor Analysis, Autoencoders (VAE)
Nonlinear via kernels	Kernel PCA
Computational efficiency at scale	Random Projections, Truncated SVD, UMAP

Self-Check Questions

Both PCA and Factor Analysis identify latent structure in data. What is the key modeling difference between them, and when would you choose Factor Analysis over PCA?
You have a dataset with 10,000 samples and 50,000 features that you need to reduce to 100 dimensions quickly. Which technique would you apply first, and why does the Johnson-Lindenstrauss lemma justify this choice?
Compare t-SNE and Isomap: both are nonlinear dimensionality reduction techniques, but they preserve different types of structure. What does each prioritize, and how does this affect their typical use cases?
An FRQ asks you to explain why NMF produces more interpretable components than SVD for image data. What mathematical constraint drives this difference, and what is sacrificed in exchange?
You're applying ICA to separate mixed audio signals. Why would PCA fail at this task, and what statistical property of the source signals does ICA exploit that PCA ignores?