🧠Machine Learning Engineering

Dimensionality Reduction Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In machine learning, you'll constantly face the curse of dimensionality—as features increase, models become slower, more prone to overfitting, and harder to interpret. Dimensionality reduction techniques solve this by compressing data while preserving what matters most. You're being tested on understanding when to use each method, what structure it preserves, and the tradeoffs involved. These concepts appear in system design interviews, ML engineering assessments, and real-world pipeline decisions.

The key insight is that not all methods do the same thing. Some preserve global variance, others maintain local neighborhoods, and still others maximize class separation. Don't just memorize algorithm names—know whether a method is linear vs. nonlinear, supervised vs. unsupervised, and what geometric property it optimizes. This conceptual understanding will help you select the right tool and explain your reasoning under pressure.

Linear Variance-Based Methods

These techniques assume data lies on or near a linear subspace and work by finding directions that capture maximum variance or signal. They project data onto hyperplanes using matrix decompositions.

Principal Component Analysis (PCA)

Maximizes variance—finds orthogonal directions (principal components) that capture the most spread in the data
Eigenvalue decomposition of the covariance matrix identifies these directions; components are ranked by explained variance
Unsupervised and linear—excellent for preprocessing, noise reduction, and visualizing high-dimensional data when linear assumptions hold

Truncated Singular Value Decomposition (SVD)

Matrix factorization into $U \Sigma V^T$ —retains only the top- $k$ singular values for dimensionality reduction
Works on sparse matrices—unlike PCA, doesn't require centering, making it ideal for NLP (LSA) and recommender systems
Identifies latent structure—the singular vectors reveal hidden patterns while filtering noise from the smaller components

Factor Analysis

Models observed variables as linear combinations of latent factors plus noise, assuming $X = Lf + \epsilon$
Separates common variance from unique variance—unlike PCA, explicitly accounts for measurement error
Interpretable factors—widely used in psychometrics and social sciences to identify underlying constructs like "intelligence" or "satisfaction"

Compare: PCA vs. Factor Analysis—both find linear combinations of features, but PCA maximizes total variance while Factor Analysis models shared variance and assumes latent factors cause observations. Use PCA for general compression; use Factor Analysis when you believe hidden constructs drive your measurements.

Supervised Projection Methods

When you have labeled data, you can do better than unsupervised variance maximization. These methods optimize for class separability, not just spread.

Linear Discriminant Analysis (LDA)

Maximizes between-class variance while minimizing within-class variance—finds projections that best separate known classes
Projects to at most $k-1$ dimensions where $k$ is the number of classes; combines dimensionality reduction with classification
Assumes Gaussian classes with equal covariance—works well for preprocessing before classifiers, reducing overfitting on high-dimensional data

Compare: PCA vs. LDA—PCA is unsupervised and maximizes variance regardless of labels; LDA is supervised and maximizes class separation. If you have labels and want features for classification, LDA often outperforms PCA as a preprocessing step.

Nonlinear Manifold Learning

Real-world data often lies on curved surfaces (manifolds) embedded in high-dimensional space. These methods preserve geometric relationships that linear techniques miss.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Preserves local neighborhoods—converts pairwise similarities to probabilities and minimizes KL divergence between high-D and low-D distributions
Heavy-tailed t-distribution in the low-D space prevents crowding and reveals cluster structure clearly
Visualization only—computationally expensive, non-deterministic, and doesn't provide a mapping for new points; use for exploration, not feature engineering

Isomap

Geodesic distances—approximates the true manifold distance by computing shortest paths through a neighborhood graph
Applies classical MDS to the geodesic distance matrix, preserving global geometry better than local methods
Sensitive to noise and holes—works beautifully on clean manifolds like the Swiss roll but can fail with sparse or noisy data

Locally Linear Embedding (LLE)

Reconstructs points from neighbors—finds weights that express each point as a linear combination of its $k$ nearest neighbors
Preserves local geometry—the same weights are used to embed points in the low-dimensional space
Efficient and elegant—solves a sparse eigenvalue problem, but sensitive to neighborhood size and struggles with non-uniform sampling

Compare: t-SNE vs. Isomap vs. LLE—all handle nonlinear structure, but t-SNE optimizes for visualization and local clusters, Isomap preserves global geodesic distances, and LLE preserves local linear relationships. For publication-quality cluster visualization, use t-SNE; for understanding global manifold structure, try Isomap first.

Signal Separation Methods

Sometimes the goal isn't compression but unmixing—separating independent sources that have been combined. These methods assume the observed data is a mixture of hidden signals.

Independent Component Analysis (ICA)

Maximizes statistical independence—finds components that are as non-Gaussian and independent as possible
Blind source separation—the classic "cocktail party problem" where mixed audio signals are separated without knowing the sources
Assumes non-Gaussian sources—fails if sources are Gaussian (where PCA and ICA become equivalent); order of components is arbitrary

Multidimensional Scaling (MDS)

Preserves pairwise distances—given a dissimilarity matrix, finds coordinates that best reproduce those distances in low-D space
Classical vs. metric vs. non-metric—different variants optimize for exact distances, monotonic relationships, or ordinal rankings
Flexible input—works directly from distance matrices, useful when you have similarities but not feature vectors

Compare: ICA vs. PCA—PCA finds uncorrelated components ordered by variance; ICA finds statistically independent components with no natural ordering. Use ICA when you believe your data is a mixture of independent sources (EEG signals, audio); use PCA for general-purpose compression.

Neural Network Approaches

Deep learning offers flexible, learnable dimensionality reduction that can capture complex nonlinear relationships. The encoder-decoder paradigm learns compressed representations end-to-end.

Autoencoders

Encoder-decoder architecture—compresses input to a bottleneck layer (latent representation), then reconstructs; trained to minimize reconstruction error
Nonlinear and flexible—with enough capacity, can learn arbitrarily complex manifolds; variants include denoising, sparse, and variational autoencoders
Feature learning powerhouse—the latent space can be used for downstream tasks, anomaly detection (high reconstruction error), or generative modeling (VAEs)

Compare: PCA vs. Linear Autoencoder—a single-layer autoencoder with linear activations learns the same subspace as PCA. Add nonlinear activations and depth to capture structure that PCA misses. If your data has complex nonlinear relationships and you have enough samples, autoencoders often outperform classical methods.

Quick Reference Table

Concept	Best Examples
Linear variance maximization	PCA, Truncated SVD
Supervised class separation	LDA
Local neighborhood preservation	t-SNE, LLE
Global manifold geometry	Isomap, MDS
Source separation / independence	ICA
Latent factor modeling	Factor Analysis
Learnable nonlinear compression	Autoencoders
Works on sparse/distance matrices	Truncated SVD, MDS

Self-Check Questions

You have a labeled dataset with 3 classes and want to reduce 100 features before training a classifier. Would you choose PCA or LDA, and why?
Which two methods both preserve local neighborhood structure but differ in their optimization approach? What's the key difference in what they minimize?
Your colleague used t-SNE to reduce features for a downstream classifier and got poor results. What's wrong with this approach, and what would you recommend instead?
Compare and contrast: When would you choose ICA over PCA? Give a specific example where ICA would succeed and PCA would fail.
You're building a pipeline to process new data points after training. Which methods from this guide cannot transform new points without retraining, and what alternatives would you use in production?

🧠Machine Learning Engineering

Dimensionality Reduction Methods

Why This Matters

Linear Variance-Based Methods

Principal Component Analysis (PCA)

Truncated Singular Value Decomposition (SVD)

Factor Analysis

Supervised Projection Methods

Linear Discriminant Analysis (LDA)

Nonlinear Manifold Learning

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Isomap

Locally Linear Embedding (LLE)

Signal Separation Methods

Independent Component Analysis (ICA)

Multidimensional Scaling (MDS)

Neural Network Approaches

Autoencoders

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes