upgrade
upgrade

👩‍💻Foundations of Data Science

Key Regression Analysis Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Regression analysis is the backbone of predictive modeling in data science—it's how we quantify relationships between variables and make data-driven predictions. You're being tested not just on knowing these methods exist, but on understanding when to use each one and why. The core concepts you need to master include linearity assumptions, regularization techniques, dimensionality reduction, and model selection trade-offs.

Think of regression methods as tools in a toolkit: a simple linear regression is your basic hammer, but sometimes you need the precision of regularization or the flexibility of polynomial terms. Don't just memorize formulas—know what problem each method solves, what assumptions it requires, and how it compares to alternatives. That's what separates strong exam responses from weak ones.


Linear Foundation Methods

These methods assume a linear relationship between predictors and outcomes. They're your starting point for regression analysis and the foundation for understanding more complex techniques.

Simple Linear Regression

  • Models the relationship between exactly two variables—one independent (X) and one dependent (Y), fitting the best straight line through your data
  • Core equation: Y=β0+β1X+εY = \beta_0 + \beta_1X + \varepsilon, where β0\beta_0 is the intercept, β1\beta_1 is the slope, and ε\varepsilon represents error
  • Key assumptions: linearity, homoscedasticity (constant variance), and normally distributed residuals—violations compromise your predictions

Multiple Linear Regression

  • Extends simple regression to multiple predictors—the equation becomes Y=β0+β1X1+β2X2+...+βnXn+εY = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \varepsilon
  • Controls for confounding variables by isolating each predictor's unique contribution while holding others constant
  • Watch for multicollinearity—when predictors are highly correlated, coefficient estimates become unstable and interpretation suffers

Compare: Simple Linear Regression vs. Multiple Linear Regression—both assume linearity, but multiple regression handles real-world complexity where outcomes depend on several factors. If an FRQ asks about controlling for confounding variables, multiple regression is your answer.


Non-Linear Relationship Methods

When your data curves, linear methods fail. These approaches capture non-linear patterns while still using regression frameworks you already understand.

Polynomial Regression

  • Captures curved relationships by adding polynomial terms: Y=β0+β1X+β2X2+...+βnXn+εY = \beta_0 + \beta_1X + \beta_2X^2 + ... + \beta_nX^n + \varepsilon
  • Higher degrees increase flexibility but dramatically increase overfitting risk—the bias-variance trade-off in action
  • Still technically linear regression—it's linear in the coefficients, just non-linear in the features

Generalized Linear Models (GLMs)

  • Extends regression beyond normal distributions—handles count data (Poisson), binary outcomes (logistic), and other response types
  • Uses a link function to connect the linear predictor to the mean of the response variable, adapting to different data structures
  • Encompasses multiple regression types including logistic and Poisson regression—it's a framework, not a single method

Compare: Polynomial Regression vs. GLMs—polynomial regression handles curved relationships with continuous outcomes, while GLMs handle different types of outcomes (binary, count, etc.). Know which problem each solves.


Classification Methods

When your outcome is categorical rather than continuous, you need methods designed for classification problems.

Logistic Regression

  • Predicts probability of categorical outcomes—despite the name, it's used for classification, not continuous prediction
  • Uses the logit function: logit(P)=β0+β1X1+β2X2+...+βnXn\text{logit}(P) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n, where P is the probability of the event
  • Assumes linearity in log-odds—the relationship between predictors and the logarithm of the odds must be linear

Regularization Methods

When you have many predictors or multicollinearity issues, standard regression overfits. Regularization adds penalties to keep coefficients in check.

Ridge Regression

  • Adds L2 penalty to the loss function—penalizes the square of coefficient magnitudes to shrink them toward zero
  • Handles multicollinearity by distributing weight across correlated predictors rather than arbitrarily choosing one
  • Never eliminates variables—coefficients shrink but don't reach exactly zero, so all predictors stay in the model

Lasso Regression

  • Adds L1 penalty using absolute values of coefficients—can shrink coefficients all the way to zero
  • Performs automatic variable selection—effectively removes irrelevant predictors, creating simpler, more interpretable models
  • Trade-off: introduces bias to reduce variance, but the resulting model often generalizes better to new data

Compare: Ridge vs. Lasso—both regularize, but Ridge keeps all variables (just shrinks them) while Lasso can eliminate variables entirely. Use Lasso when you suspect many predictors are irrelevant; use Ridge when you want to keep everything but reduce multicollinearity effects.


Dimensionality Reduction Methods

When you have more predictors than observations—or severe multicollinearity—these methods transform your feature space before regression.

Principal Component Regression (PCR)

  • Combines PCA with regression—transforms correlated predictors into uncorrelated principal components, then regresses on those
  • Reduces dimensionality while retaining most variance in the predictors—but ignores the response variable during transformation
  • Addresses multicollinearity by construction, since principal components are orthogonal (uncorrelated) by definition

Partial Least Squares Regression (PLS)

  • Maximizes covariance with the response—unlike PCR, it considers the outcome variable when creating components
  • Ideal for high-dimensional data where predictors outnumber observations or are highly collinear
  • Balances two goals: reducing dimensionality and explaining variance in Y—often outperforms PCR for prediction

Compare: PCR vs. PLS—both reduce dimensions, but PCR ignores Y when creating components while PLS specifically optimizes for predicting Y. PLS typically performs better for prediction; PCR is simpler to interpret.


Model Selection Methods

These approaches help you choose which predictors to include, balancing model complexity against predictive power.

Stepwise Regression

  • Iteratively adds or removes predictors based on statistical criteria like AIC or BIC—can be forward, backward, or bidirectional
  • Builds parsimonious models by testing whether each variable improves fit enough to justify its inclusion
  • Known limitations: can overfit to sample-specific patterns and may miss the globally optimal subset—use with caution

Compare: Stepwise Regression vs. Lasso—both perform variable selection, but stepwise uses discrete add/remove decisions while Lasso uses continuous shrinkage. Lasso is generally more stable and less prone to overfitting.


Quick Reference Table

ConceptBest Examples
Linear relationshipsSimple Linear Regression, Multiple Linear Regression
Non-linear relationshipsPolynomial Regression, GLMs
Binary/categorical outcomesLogistic Regression, GLMs
Regularization (keeps all variables)Ridge Regression
Regularization (variable selection)Lasso Regression
Multicollinearity solutionsRidge, PCR, PLS
Dimensionality reductionPCR, PLS
Model selectionStepwise Regression, Lasso

Self-Check Questions

  1. Which two regression methods both use regularization but differ in whether they can eliminate variables entirely? What mathematical difference causes this?

  2. You have a dataset with 50 observations and 200 predictors. Which methods would be appropriate, and why would simple multiple regression fail?

  3. Compare and contrast PCR and PLS: What do they share, and what key difference affects their predictive performance?

  4. A colleague wants to predict whether customers will churn (yes/no). They suggest using multiple linear regression. What method should they use instead, and why?

  5. If an FRQ asks you to address multicollinearity in a regression model, what three distinct approaches could you discuss, and how does each solve the problem differently?