Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Regression analysis is the backbone of statistical inference and prediction—it's how we move from observing patterns to quantifying relationships and making data-driven decisions. In your intro stats course, you're being tested on your ability to choose the right regression method for different data types and research questions, interpret output correctly (coefficients, significance, fit), and recognize when assumptions are violated. These skills show up repeatedly in both multiple-choice questions and FRQs.
The methods below demonstrate core statistical principles: linearity vs. non-linearity, continuous vs. categorical outcomes, model complexity vs. interpretability, and the bias-variance tradeoff. Don't just memorize which regression does what—understand why you'd pick one over another and what the output actually tells you. That conceptual understanding is what separates a 3 from a 5.
These foundational methods assume your variables have a straight-line relationship. The key mechanism is minimizing the sum of squared residuals to find the best-fitting line through your data.
Compare: Simple vs. Multiple Linear Regression—both predict continuous outcomes using least squares, but multiple regression lets you isolate individual predictor effects and control for confounding. If an FRQ gives you several potential explanatory variables, multiple regression is your go-to.
When your scatterplot shows curves rather than lines, these methods capture relationships that linear models miss. The underlying principle is adding flexibility to your model while watching for overfitting.
Compare: Polynomial vs. Time Series Regression—both handle non-linearity, but polynomial regression addresses curved relationships between variables, while time series regression addresses temporal dependence. Choose based on whether your "X" is time itself or another continuous variable.
When your dependent variable isn't continuous—especially when it's binary—you need methods designed for classification rather than prediction of values.
Compare: Linear vs. Logistic Regression—linear regression predicts continuous values and can give impossible predictions (like probabilities below 0 or above 1) for binary outcomes. Logistic regression constrains predictions between 0 and 1. Always use logistic when your outcome is categorical.
These methods help you decide which predictors to include and prevent your model from becoming overly complex. The core tradeoff is between fitting your current data well and generalizing to new data.
Compare: Ridge vs. Lasso Regression—both add penalties to prevent overfitting, but ridge keeps all predictors (just shrinks them) while lasso can eliminate predictors entirely. Use ridge when you believe all predictors contribute; use lasso when you suspect many predictors are irrelevant.
| Concept | Best Examples |
|---|---|
| Continuous outcome, linear relationship | Simple Linear Regression, Multiple Linear Regression |
| Categorical/binary outcome | Logistic Regression |
| Non-linear relationships | Polynomial Regression |
| Time-dependent data | Time Series Regression |
| Variable selection | Stepwise Regression, Lasso Regression |
| Handling multicollinearity | Ridge Regression |
| Preventing overfitting | Ridge Regression, Lasso Regression |
| High-dimensional data | Lasso Regression |
You have a dataset with 50 observations and 100 potential predictor variables. Which regression method would best help you identify the most important predictors while avoiding overfitting?
Compare and contrast ridge and lasso regression: What penalty does each use, and how does this affect which predictors remain in the final model?
A researcher wants to predict whether customers will churn (yes/no) based on their usage patterns. Why would logistic regression be more appropriate than linear regression for this problem?
Which two regression methods both address non-linear patterns in data, and what distinguishes when you'd use each one?
An FRQ describes a multiple regression output where two predictor variables are highly correlated, causing unstable coefficient estimates. What is this problem called, and which regression technique specifically addresses it?