👩‍💻Foundations of Data Science

Key Regression Analysis Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Regression analysis is the backbone of predictive modeling in data science—it's how we quantify relationships between variables and make data-driven predictions. You're being tested not just on knowing these methods exist, but on understanding when to use each one and why. The core concepts you need to master include linearity assumptions, regularization techniques, dimensionality reduction, and model selection trade-offs.

Think of regression methods as tools in a toolkit: a simple linear regression is your basic hammer, but sometimes you need the precision of regularization or the flexibility of polynomial terms. Don't just memorize formulas—know what problem each method solves, what assumptions it requires, and how it compares to alternatives. That's what separates strong exam responses from weak ones.

Linear Foundation Methods

These methods assume a linear relationship between predictors and outcomes. They're your starting point for regression analysis and the foundation for understanding more complex techniques.

Simple Linear Regression

Models the relationship between exactly two variables—one independent (X) and one dependent (Y), fitting the best straight line through your data
Core equation: $Y = \beta_0 + \beta_1X + \varepsilon$ , where $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\varepsilon$ represents error
Key assumptions: linearity, homoscedasticity (constant variance), and normally distributed residuals—violations compromise your predictions

Multiple Linear Regression

Extends simple regression to multiple predictors—the equation becomes $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \varepsilon$
Controls for confounding variables by isolating each predictor's unique contribution while holding others constant
Watch for multicollinearity—when predictors are highly correlated, coefficient estimates become unstable and interpretation suffers

Compare: Simple Linear Regression vs. Multiple Linear Regression—both assume linearity, but multiple regression handles real-world complexity where outcomes depend on several factors. If an FRQ asks about controlling for confounding variables, multiple regression is your answer.

Non-Linear Relationship Methods

When your data curves, linear methods fail. These approaches capture non-linear patterns while still using regression frameworks you already understand.

Polynomial Regression

Captures curved relationships by adding polynomial terms: $Y = \beta_0 + \beta_1X + \beta_2X^2 + ... + \beta_nX^n + \varepsilon$
Higher degrees increase flexibility but dramatically increase overfitting risk—the bias-variance trade-off in action
Still technically linear regression—it's linear in the coefficients, just non-linear in the features

Generalized Linear Models (GLMs)

Extends regression beyond normal distributions—handles count data (Poisson), binary outcomes (logistic), and other response types
Uses a link function to connect the linear predictor to the mean of the response variable, adapting to different data structures
Encompasses multiple regression types including logistic and Poisson regression—it's a framework, not a single method

Compare: Polynomial Regression vs. GLMs—polynomial regression handles curved relationships with continuous outcomes, while GLMs handle different types of outcomes (binary, count, etc.). Know which problem each solves.

Classification Methods

When your outcome is categorical rather than continuous, you need methods designed for classification problems.

Logistic Regression

Predicts probability of categorical outcomes—despite the name, it's used for classification, not continuous prediction
Uses the logit function: $\text{logit}(P) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n$ , where P is the probability of the event
Assumes linearity in log-odds—the relationship between predictors and the logarithm of the odds must be linear

Regularization Methods

When you have many predictors or multicollinearity issues, standard regression overfits. Regularization adds penalties to keep coefficients in check.

Ridge Regression

Adds L2 penalty to the loss function—penalizes the square of coefficient magnitudes to shrink them toward zero
Handles multicollinearity by distributing weight across correlated predictors rather than arbitrarily choosing one
Never eliminates variables—coefficients shrink but don't reach exactly zero, so all predictors stay in the model

Lasso Regression

Adds L1 penalty using absolute values of coefficients—can shrink coefficients all the way to zero
Performs automatic variable selection—effectively removes irrelevant predictors, creating simpler, more interpretable models
Trade-off: introduces bias to reduce variance, but the resulting model often generalizes better to new data

Compare: Ridge vs. Lasso—both regularize, but Ridge keeps all variables (just shrinks them) while Lasso can eliminate variables entirely. Use Lasso when you suspect many predictors are irrelevant; use Ridge when you want to keep everything but reduce multicollinearity effects.

Dimensionality Reduction Methods

When you have more predictors than observations—or severe multicollinearity—these methods transform your feature space before regression.

Principal Component Regression (PCR)

Combines PCA with regression—transforms correlated predictors into uncorrelated principal components, then regresses on those
Reduces dimensionality while retaining most variance in the predictors—but ignores the response variable during transformation
Addresses multicollinearity by construction, since principal components are orthogonal (uncorrelated) by definition

Partial Least Squares Regression (PLS)

Maximizes covariance with the response—unlike PCR, it considers the outcome variable when creating components
Ideal for high-dimensional data where predictors outnumber observations or are highly collinear
Balances two goals: reducing dimensionality and explaining variance in Y—often outperforms PCR for prediction

Compare: PCR vs. PLS—both reduce dimensions, but PCR ignores Y when creating components while PLS specifically optimizes for predicting Y. PLS typically performs better for prediction; PCR is simpler to interpret.

Model Selection Methods

These approaches help you choose which predictors to include, balancing model complexity against predictive power.

Stepwise Regression

Iteratively adds or removes predictors based on statistical criteria like AIC or BIC—can be forward, backward, or bidirectional
Builds parsimonious models by testing whether each variable improves fit enough to justify its inclusion
Known limitations: can overfit to sample-specific patterns and may miss the globally optimal subset—use with caution

Compare: Stepwise Regression vs. Lasso—both perform variable selection, but stepwise uses discrete add/remove decisions while Lasso uses continuous shrinkage. Lasso is generally more stable and less prone to overfitting.

Quick Reference Table

Concept	Best Examples
Linear relationships	Simple Linear Regression, Multiple Linear Regression
Non-linear relationships	Polynomial Regression, GLMs
Binary/categorical outcomes	Logistic Regression, GLMs
Regularization (keeps all variables)	Ridge Regression
Regularization (variable selection)	Lasso Regression
Multicollinearity solutions	Ridge, PCR, PLS
Dimensionality reduction	PCR, PLS
Model selection	Stepwise Regression, Lasso

Self-Check Questions

Which two regression methods both use regularization but differ in whether they can eliminate variables entirely? What mathematical difference causes this?
You have a dataset with 50 observations and 200 predictors. Which methods would be appropriate, and why would simple multiple regression fail?
Compare and contrast PCR and PLS: What do they share, and what key difference affects their predictive performance?
A colleague wants to predict whether customers will churn (yes/no). They suggest using multiple linear regression. What method should they use instead, and why?
If an FRQ asks you to address multicollinearity in a regression model, what three distinct approaches could you discuss, and how does each solve the problem differently?

👩‍💻Foundations of Data Science

Key Regression Analysis Methods

Why This Matters

Linear Foundation Methods

Simple Linear Regression

Multiple Linear Regression

Non-Linear Relationship Methods

Polynomial Regression

Generalized Linear Models (GLMs)

Classification Methods

Logistic Regression

Regularization Methods

Ridge Regression

Lasso Regression

Dimensionality Reduction Methods

Principal Component Regression (PCR)

Partial Least Squares Regression (PLS)

Model Selection Methods

Stepwise Regression

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes