upgrade
upgrade

Linear Algebra for Data Science

Principal Component Analysis Steps

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

PCA is the workhorse of dimensionality reduction, and you'll encounter it everywhere in data science—from preprocessing high-dimensional datasets to building recommendation systems and visualizing complex data. But here's what exams really test: your understanding of the linear algebra mechanics underneath the algorithm. You're being tested on eigendecomposition, variance maximization, orthogonal projections, and matrix transformations—PCA just happens to be the perfect vehicle for all of these concepts.

Don't just memorize "standardize, then find eigenvectors." Know why each step exists and what linear algebra principle it demonstrates. When an FRQ asks you to explain why we use the covariance matrix or what eigenvalues actually represent, you need to connect the dots between the algorithm and the underlying mathematics. Master these connections, and PCA questions become straightforward applications of concepts you already understand.


Data Preparation: Setting Up the Matrix

Before any linear algebra magic happens, your data needs to be in the right form. Standardization ensures that the covariance matrix reflects true relationships between features, not artifacts of measurement scale.

Data Standardization

  • Z-score transformation—converts each feature to have μ=0\mu = 0 and σ=1\sigma = 1 using z=xμσz = \frac{x - \mu}{\sigma}
  • Equal feature contribution ensures that variables measured in different units (dollars vs. counts) don't dominate the analysis due to scale alone
  • Covariance matrix sensitivity makes this step non-negotiable; without standardization, PCA finds directions of maximum scale, not maximum variance

The Covariance Matrix: Capturing Relationships

The covariance matrix is the foundation of PCA—it encodes everything about how your features relate to each other. This symmetric matrix contains all the information needed to find directions of maximum spread in your data.

Covariance Matrix Calculation

  • Pairwise relationships—element CijC_{ij} of the covariance matrix measures how features ii and jj vary together, computed as Cov(Xi,Xj)=1n1(xixˉi)(xjxˉj)\text{Cov}(X_i, X_j) = \frac{1}{n-1}\sum(x_i - \bar{x}_i)(x_j - \bar{x}_j)
  • Symmetric positive semi-definite structure guarantees real, non-negative eigenvalues—a critical property for the next steps
  • Diagonal entries represent each feature's variance; off-diagonal entries reveal correlations that PCA will exploit

Compare: Covariance matrix vs. Correlation matrix—both capture feature relationships, but the correlation matrix is already standardized (values between -1 and 1). If your data is standardized first, they're identical. FRQs may ask when you'd use one over the other.


Eigendecomposition: Finding the Principal Directions

This is where core linear algebra takes center stage. Eigenvectors of the covariance matrix point in directions of maximum variance; eigenvalues tell you how much variance each direction captures.

Eigenvalue and Eigenvector Computation

  • Characteristic equation det(CλI)=0\det(C - \lambda I) = 0 yields eigenvalues λ\lambda, each representing variance along its corresponding eigenvector
  • Eigenvectors define axes—these orthogonal vectors form the new coordinate system where your data will live after transformation
  • Spectral theorem guarantees that symmetric matrices (like covariance matrices) have orthogonal eigenvectors and real eigenvalues

Sorting Eigenvectors by Eigenvalues

  • Descending order ranking—the eigenvector with the largest eigenvalue captures the most variance and becomes PC1
  • Variance hierarchy means each subsequent component captures progressively less information, enabling principled dimensionality reduction
  • Matrix formation stacks the top kk eigenvectors as columns to create the projection matrix WW

Compare: Eigenvalues vs. Singular values—in PCA on standardized data, singular values from SVD equal the square root of eigenvalues from eigendecomposition. SVD is often more numerically stable, which is why libraries like scikit-learn use it internally.


Dimensionality Reduction: Making the Cut

Choosing how many components to keep is both art and science. The goal is to retain enough variance to preserve meaningful structure while eliminating noise and redundancy.

Selecting Principal Components

  • Explained variance threshold—common choices include retaining components until 90-95% cumulative variance is captured
  • Elbow method plots eigenvalues and looks for a "bend" where additional components add diminishing returns
  • Trade-off awareness is essential: too few components loses signal, too many defeats the purpose of dimensionality reduction

Calculating Explained Variance Ratio

  • Individual ratio for component ii equals λij=1nλj\frac{\lambda_i}{\sum_{j=1}^{n} \lambda_j}, showing each component's contribution
  • Cumulative sum helps determine the minimum number of components needed to hit your variance threshold
  • Scree plots visualize these ratios, making it easy to identify where variance drops off sharply

Compare: Keeping 2 components vs. keeping 10—with 2 components, you can visualize data in 2D but may lose important structure. With 10, you preserve more information but lose interpretability. The right choice depends on your downstream task.


Projection: Transforming the Data

The final step applies everything you've computed. Projection is a linear transformation that maps your original data onto the subspace spanned by the selected principal components.

Projecting Data onto Principal Components

  • Matrix multiplication Z=XWZ = XW transforms standardized data XX using projection matrix WW (columns are top eigenvectors)
  • Dimensionality change—if XX is n×pn \times p and you keep kk components, ZZ is n×kn \times k where k<pk < p
  • Orthogonal projection ensures the transformed features (principal components) are uncorrelated—no redundant information

Interpreting Results and Dimensionality Reduction

  • Loadings analysis examines eigenvector entries to understand which original features contribute most to each PC
  • Reconstruction is possible via X^=ZWT\hat{X} = ZW^T, though information lost to discarded components cannot be recovered
  • Downstream applications include visualization (2-3 PCs), noise reduction, and preprocessing for models sensitive to multicollinearity

Compare: PCA projection vs. feature selection—PCA creates new composite features (linear combinations), while feature selection keeps original features intact. PCA is better for correlated features; selection preserves interpretability.


Quick Reference Table

ConceptBest Examples
Matrix centering/scalingData standardization, z-score transformation
Symmetric matrix propertiesCovariance matrix calculation, guaranteed orthogonal eigenvectors
EigendecompositionEigenvalue/eigenvector computation, characteristic equation
Variance maximizationSorting by eigenvalues, explained variance ratio
Orthogonal projectionProjecting data onto PCs, matrix multiplication Z=XWZ = XW
Dimensionality trade-offsSelecting principal components, elbow method
Linear transformationFinal projection, reconstruction via X^=ZWT\hat{X} = ZW^T

Self-Check Questions

  1. Why must data be standardized before computing the covariance matrix, and what would happen to your principal components if you skipped this step with features on different scales?

  2. Which two PCA steps both rely directly on the eigenvalues of the covariance matrix, and how does each step use them differently?

  3. Compare and contrast the information contained in eigenvectors versus eigenvalues—if someone gave you only the eigenvectors, what could you determine about the data, and what would be missing?

  4. If your first three principal components explain 95% of the variance, what does this tell you about the effective dimensionality of your original dataset? How might this inform your modeling choices?

  5. Explain why the projected data Z=XWZ = XW has uncorrelated features. What property of the eigenvectors guarantees this, and why is it useful for downstream analysis?