upgrade
upgrade

🎲Intro to Statistics

Types of Regression Analysis

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Regression analysis is the backbone of statistical inference and prediction—it's how we move from observing patterns to quantifying relationships and making data-driven decisions. In your intro stats course, you're being tested on your ability to choose the right regression method for different data types and research questions, interpret output correctly (coefficients, significance, fit), and recognize when assumptions are violated. These skills show up repeatedly in both multiple-choice questions and FRQs.

The methods below demonstrate core statistical principles: linearity vs. non-linearity, continuous vs. categorical outcomes, model complexity vs. interpretability, and the bias-variance tradeoff. Don't just memorize which regression does what—understand why you'd pick one over another and what the output actually tells you. That conceptual understanding is what separates a 3 from a 5.


Modeling Linear Relationships

These foundational methods assume your variables have a straight-line relationship. The key mechanism is minimizing the sum of squared residuals to find the best-fitting line through your data.

Simple Linear Regression

  • One predictor, one outcome—the most basic regression form, modeling the relationship between two continuous variables as Y=a+bXY = a + bX
  • Slope (b) interpretation tells you the expected change in Y for each one-unit increase in X; the y-intercept (a) is Y's predicted value when X equals zero
  • Residual assumptions require that errors are normally distributed, independent, and have constant variance—check these before trusting your results

Multiple Linear Regression

  • Multiple predictors, one outcome—extends simple regression to Y=a+b1X1+b2X2+...+bnXnY = a + b_1X_1 + b_2X_2 + ... + b_nX_n, allowing you to model more realistic scenarios
  • Controlling for confounders is the major advantage; each coefficient represents that predictor's effect while holding all others constant
  • Same four assumptions apply: linearity, independence, homoscedasticity (constant variance), and normality of residuals—often remembered as LINE

Compare: Simple vs. Multiple Linear Regression—both predict continuous outcomes using least squares, but multiple regression lets you isolate individual predictor effects and control for confounding. If an FRQ gives you several potential explanatory variables, multiple regression is your go-to.


Handling Non-Linear Patterns

When your scatterplot shows curves rather than lines, these methods capture relationships that linear models miss. The underlying principle is adding flexibility to your model while watching for overfitting.

Polynomial Regression

  • Captures curves in data—uses powers of X to model non-linear relationships: Y=a+b1X+b2X2+...+bnXnY = a + b_1X + b_2X^2 + ... + b_nX^n
  • Degree selection matters—higher degrees fit training data better but risk overfitting, where the model captures noise rather than true patterns
  • Still technically linear in its parameters (the b coefficients), which means ordinary least squares estimation still works

Time Series Regression

  • Designed for sequential data—accounts for the fact that observations close in time tend to be related (autocorrelation)
  • Lagged variables include past values as predictors, capturing how yesterday's outcome influences today's
  • Stationarity requirement means the data's statistical properties shouldn't change over time; you may need to difference or transform data first

Compare: Polynomial vs. Time Series Regression—both handle non-linearity, but polynomial regression addresses curved relationships between variables, while time series regression addresses temporal dependence. Choose based on whether your "X" is time itself or another continuous variable.


Categorical Outcomes

When your dependent variable isn't continuous—especially when it's binary—you need methods designed for classification rather than prediction of values.

Logistic Regression

  • Binary outcomes only—predicts the probability of yes/no, success/failure, or any two-category outcome using an S-shaped logistic function
  • Odds ratios are the key output—coefficients represent the change in log-odds; exponentiate them to get how much the odds multiply for each unit increase in the predictor
  • No residual normality assumption—but requires independent observations and benefits from large samples for stable estimates

Compare: Linear vs. Logistic Regression—linear regression predicts continuous values and can give impossible predictions (like probabilities below 0 or above 1) for binary outcomes. Logistic regression constrains predictions between 0 and 1. Always use logistic when your outcome is categorical.


Model Selection and Simplification

These methods help you decide which predictors to include and prevent your model from becoming overly complex. The core tradeoff is between fitting your current data well and generalizing to new data.

Stepwise Regression

  • Automated variable selection—systematically adds or removes predictors based on statistical criteria like p-values, AIC, or BIC
  • Three approaches: forward selection (start empty, add predictors), backward elimination (start full, remove predictors), or bidirectional (both)
  • Improves interpretability by keeping only meaningful predictors, but can miss important variables if selection criteria are too strict

Ridge Regression

  • Adds a penalty for large coefficients—the penalty term (λb2\lambda \sum b^2) shrinks coefficients toward zero without eliminating them
  • Solves multicollinearity problems—when predictors are highly correlated, ordinary least squares gives unstable estimates; ridge regression stabilizes them
  • Introduces bias deliberately—the estimates are biased but often have lower overall prediction error due to reduced variance

Lasso Regression

  • Performs automatic variable selection—uses an L1 penalty (λb\lambda \sum |b|) that can shrink coefficients all the way to zero
  • Creates sparse models—by eliminating weak predictors entirely, lasso produces simpler, more interpretable results
  • Ideal for high-dimensional data—when you have more predictors than observations, lasso helps identify which variables actually matter

Compare: Ridge vs. Lasso Regression—both add penalties to prevent overfitting, but ridge keeps all predictors (just shrinks them) while lasso can eliminate predictors entirely. Use ridge when you believe all predictors contribute; use lasso when you suspect many predictors are irrelevant.


Quick Reference Table

ConceptBest Examples
Continuous outcome, linear relationshipSimple Linear Regression, Multiple Linear Regression
Categorical/binary outcomeLogistic Regression
Non-linear relationshipsPolynomial Regression
Time-dependent dataTime Series Regression
Variable selectionStepwise Regression, Lasso Regression
Handling multicollinearityRidge Regression
Preventing overfittingRidge Regression, Lasso Regression
High-dimensional dataLasso Regression

Self-Check Questions

  1. You have a dataset with 50 observations and 100 potential predictor variables. Which regression method would best help you identify the most important predictors while avoiding overfitting?

  2. Compare and contrast ridge and lasso regression: What penalty does each use, and how does this affect which predictors remain in the final model?

  3. A researcher wants to predict whether customers will churn (yes/no) based on their usage patterns. Why would logistic regression be more appropriate than linear regression for this problem?

  4. Which two regression methods both address non-linear patterns in data, and what distinguishes when you'd use each one?

  5. An FRQ describes a multiple regression output where two predictor variables are highly correlated, causing unstable coefficient estimates. What is this problem called, and which regression technique specifically addresses it?