๐ŸŽฒIntro to Statistics

Types of Regression Analysis

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Regression analysis is how we move from observing patterns to quantifying relationships and making predictions. In intro stats, you need to choose the right regression method for different data types, interpret output correctly (coefficients, significance, fit), and recognize when assumptions are violated.

The methods below cover core statistical principles: linearity vs. non-linearity, continuous vs. categorical outcomes, and model complexity vs. interpretability. Don't just memorize which regression does what. Understanding why you'd pick one over another and what the output actually tells you is what matters most on exams.


Modeling Linear Relationships

These foundational methods assume your variables have a straight-line relationship. They work by minimizing the sum of squared residuals to find the best-fitting line through your data.

Simple Linear Regression

This is the most basic regression form: one predictor, one outcome, both continuous. The model is:

Y=a+bXY = a + bX

The slope (b) tells you the expected change in Y for each one-unit increase in X. The y-intercept (a) is Y's predicted value when X equals zero (which sometimes doesn't make practical sense, so interpret it carefully).

Before trusting your results, check that the residuals are normally distributed, independent, and have constant variance.

Multiple Linear Regression

Multiple regression extends simple regression by including more than one predictor:

Y=a+b1X1+b2X2+...+bnXnY = a + b_1X_1 + b_2X_2 + ... + b_nX_n

The major advantage here is controlling for confounders. Each coefficient represents that predictor's effect while holding all other predictors constant. This lets you isolate individual relationships in a way simple regression can't.

The same four assumptions apply, often remembered as LINE:

  • Linearity between predictors and outcome
  • Independence of observations
  • Normality of residuals
  • Equal variance of residuals (homoscedasticity)

Compare: Simple vs. Multiple Linear Regression: both predict continuous outcomes using least squares, but multiple regression lets you isolate individual predictor effects and control for confounding. If a problem gives you several potential explanatory variables, multiple regression is your go-to.


Handling Non-Linear Patterns

When your scatterplot shows curves rather than lines, these methods capture relationships that linear models miss. The tradeoff is adding flexibility to your model while watching for overfitting.

Polynomial Regression

Polynomial regression uses powers of X to model curved relationships:

Y=a+b1X+b2X2+...+bnXnY = a + b_1X + b_2X^2 + ... + b_nX^n

A quadratic (X2X^2) term captures a single curve, a cubic (X3X^3) captures an S-shape, and so on. Degree selection matters: higher degrees fit your current data better but risk overfitting, where the model captures random noise rather than the true underlying pattern.

One thing that surprises students: polynomial regression is still technically "linear" in its parameters (the b coefficients), so ordinary least squares estimation still works.

Time Series Regression

Time series regression is designed for sequential data where observations close in time tend to be related. This tendency is called autocorrelation, and it violates the independence assumption of standard regression.

To handle this, time series models use lagged variables, which include past values as predictors. For example, last month's sales might help predict this month's sales.

A key requirement is stationarity, meaning the data's statistical properties (like mean and variance) shouldn't change over time. If they do, you may need to difference the data (subtract consecutive values) or apply other transformations before fitting the model.

Compare: Polynomial vs. Time Series Regression: both handle non-linearity, but polynomial regression addresses curved relationships between variables, while time series regression addresses temporal dependence. Choose based on whether your "X" is time itself or another continuous variable.


Categorical Outcomes

When your dependent variable isn't continuous, you need a different approach. Standard linear regression can produce impossible predictions for categorical data, like probabilities below 0 or above 1.

Logistic Regression

Logistic regression predicts the probability of a binary outcome (yes/no, success/failure, 0/1) using an S-shaped logistic function that constrains predictions between 0 and 1.

The key output is odds ratios. The raw coefficients represent changes in log-odds, which aren't very intuitive. To interpret them, you exponentiate the coefficient: ebe^b tells you how much the odds multiply for each one-unit increase in the predictor. For example, if eb=1.5e^b = 1.5, the odds of the outcome increase by 50% for each unit increase in that predictor.

Logistic regression does not require normally distributed residuals, but it does require independent observations and benefits from large samples for stable estimates.

Compare: Linear vs. Logistic Regression: linear regression predicts continuous values and can give impossible predictions for binary outcomes. Logistic regression constrains predictions between 0 and 1. Always use logistic when your outcome is categorical.


Model Selection and Simplification

These methods help you decide which predictors to include and prevent your model from becoming overly complex. The core tradeoff: fitting your current data well vs. generalizing to new data.

Stepwise Regression

Stepwise regression is an automated approach to variable selection. It systematically adds or removes predictors based on statistical criteria like p-values, AIC (Akaike Information Criterion), or BIC (Bayesian Information Criterion).

Three approaches exist:

  1. Forward selection: start with no predictors, add the most significant one at each step
  2. Backward elimination: start with all predictors, remove the least significant one at each step
  3. Bidirectional: combine both, adding and removing at each step

This improves interpretability by keeping only meaningful predictors, but it can miss important variables if the selection criteria are too strict.

Ridge Regression

When predictors are highly correlated with each other (multicollinearity), ordinary least squares gives wildly unstable coefficient estimates. Ridge regression fixes this by adding a penalty term for large coefficients:

ฮปโˆ‘b2\lambda \sum b^2

This L2 penalty shrinks all coefficients toward zero but never eliminates them entirely. The parameter ฮป\lambda controls how strong the penalty is. Higher ฮป\lambda means more shrinkage.

Ridge regression deliberately introduces a small amount of bias into the estimates, but this tradeoff is often worth it because the reduced variance leads to lower overall prediction error.

Lasso Regression

Lasso regression uses an L1 penalty instead:

ฮปโˆ‘โˆฃbโˆฃ\lambda \sum |b|

The critical difference from ridge: lasso can shrink coefficients all the way to exactly zero, effectively removing those predictors from the model. This makes lasso a tool for automatic variable selection that produces simpler, more interpretable models.

Lasso is especially useful for high-dimensional data, where you have many predictors (possibly more than observations) and need to identify which variables actually matter.

Compare: Ridge vs. Lasso Regression: both add penalties to prevent overfitting, but ridge keeps all predictors (just shrinks them) while lasso can eliminate predictors entirely. Use ridge when you believe all predictors contribute; use lasso when you suspect many predictors are irrelevant.


Quick Reference Table

ConceptBest Examples
Continuous outcome, linear relationshipSimple Linear Regression, Multiple Linear Regression
Categorical/binary outcomeLogistic Regression
Non-linear relationshipsPolynomial Regression
Time-dependent dataTime Series Regression
Variable selectionStepwise Regression, Lasso Regression
Handling multicollinearityRidge Regression
Preventing overfittingRidge Regression, Lasso Regression
High-dimensional dataLasso Regression

Self-Check Questions

  1. You have a dataset with 50 observations and 100 potential predictor variables. Which regression method would best help you identify the most important predictors while avoiding overfitting?

  2. Compare and contrast ridge and lasso regression: What penalty does each use, and how does this affect which predictors remain in the final model?

  3. A researcher wants to predict whether customers will churn (yes/no) based on their usage patterns. Why would logistic regression be more appropriate than linear regression for this problem?

  4. Which two regression methods both address non-linear patterns in data, and what distinguishes when you'd use each one?

  5. A multiple regression output shows two predictor variables that are highly correlated, causing unstable coefficient estimates. What is this problem called, and which regression technique specifically addresses it?