Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Regression analysis is the backbone of predictive modeling in data science—it's how we quantify relationships between variables and make data-driven predictions. You're being tested not just on knowing these methods exist, but on understanding when to use each one and why. The core concepts you need to master include linearity assumptions, regularization techniques, dimensionality reduction, and model selection trade-offs.
Think of regression methods as tools in a toolkit: a simple linear regression is your basic hammer, but sometimes you need the precision of regularization or the flexibility of polynomial terms. Don't just memorize formulas—know what problem each method solves, what assumptions it requires, and how it compares to alternatives. That's what separates strong exam responses from weak ones.
These methods assume a linear relationship between predictors and outcomes. They're your starting point for regression analysis and the foundation for understanding more complex techniques.
Compare: Simple Linear Regression vs. Multiple Linear Regression—both assume linearity, but multiple regression handles real-world complexity where outcomes depend on several factors. If an FRQ asks about controlling for confounding variables, multiple regression is your answer.
When your data curves, linear methods fail. These approaches capture non-linear patterns while still using regression frameworks you already understand.
Compare: Polynomial Regression vs. GLMs—polynomial regression handles curved relationships with continuous outcomes, while GLMs handle different types of outcomes (binary, count, etc.). Know which problem each solves.
When your outcome is categorical rather than continuous, you need methods designed for classification problems.
When you have many predictors or multicollinearity issues, standard regression overfits. Regularization adds penalties to keep coefficients in check.
Compare: Ridge vs. Lasso—both regularize, but Ridge keeps all variables (just shrinks them) while Lasso can eliminate variables entirely. Use Lasso when you suspect many predictors are irrelevant; use Ridge when you want to keep everything but reduce multicollinearity effects.
When you have more predictors than observations—or severe multicollinearity—these methods transform your feature space before regression.
Compare: PCR vs. PLS—both reduce dimensions, but PCR ignores Y when creating components while PLS specifically optimizes for predicting Y. PLS typically performs better for prediction; PCR is simpler to interpret.
These approaches help you choose which predictors to include, balancing model complexity against predictive power.
Compare: Stepwise Regression vs. Lasso—both perform variable selection, but stepwise uses discrete add/remove decisions while Lasso uses continuous shrinkage. Lasso is generally more stable and less prone to overfitting.
| Concept | Best Examples |
|---|---|
| Linear relationships | Simple Linear Regression, Multiple Linear Regression |
| Non-linear relationships | Polynomial Regression, GLMs |
| Binary/categorical outcomes | Logistic Regression, GLMs |
| Regularization (keeps all variables) | Ridge Regression |
| Regularization (variable selection) | Lasso Regression |
| Multicollinearity solutions | Ridge, PCR, PLS |
| Dimensionality reduction | PCR, PLS |
| Model selection | Stepwise Regression, Lasso |
Which two regression methods both use regularization but differ in whether they can eliminate variables entirely? What mathematical difference causes this?
You have a dataset with 50 observations and 200 predictors. Which methods would be appropriate, and why would simple multiple regression fail?
Compare and contrast PCR and PLS: What do they share, and what key difference affects their predictive performance?
A colleague wants to predict whether customers will churn (yes/no). They suggest using multiple linear regression. What method should they use instead, and why?
If an FRQ asks you to address multicollinearity in a regression model, what three distinct approaches could you discuss, and how does each solve the problem differently?