Fundamentals of Regression
Regression analysis lets you quantify the relationship between variables so you can make predictions from data. If you've ever drawn a "line of best fit" through a scatter plot, you've already done a basic version of regression. The real power comes from understanding how that line is chosen, what it tells you, and when you can trust it.
Types of Regression Models
Different data patterns call for different regression approaches:
- Linear regression models relationships using a straight line. It's the simplest and most common starting point.
- Polynomial regression fits curves instead of lines, handling relationships that bend or level off.
- Logistic regression predicts binary outcomes (yes/no, pass/fail) by modeling probabilities rather than continuous values.
- Time series regression analyzes data collected over time, accounting for trends and seasonal patterns.
You'll spend most of your time on linear regression in this course, but knowing these other types exist helps you recognize when a straight line isn't enough.
Dependent vs. Independent Variables
- The dependent variable (Y) is the outcome you're trying to predict or explain.
- The independent variable (X) is the predictor you think influences Y.
- The relationship flows from X to Y: changes in X are associated with changes in Y.
- In multiple regression, several independent variables can influence a single dependent variable at the same time.
A quick example: if you're predicting a student's exam score (Y) based on hours studied (X), the exam score is dependent and hours studied is independent.
Correlation vs. Causation
Correlation measures the strength and direction of a relationship between two variables. Causation means changes in one variable directly produce changes in another.
These are not the same thing. Ice cream sales and drowning rates are positively correlated, but ice cream doesn't cause drowning. Both increase in summer because of a lurking third variable: hot weather. This is called a spurious correlation.
Establishing causation typically requires controlled experiments or specialized statistical techniques, not just regression output.
Simple Linear Regression
Simple linear regression models the relationship between one independent variable and one dependent variable using a straight line. It's the foundation for everything else in this unit.
Equation of a Line
The regression equation takes the form:
- is the predicted value of the dependent variable
- is the independent variable
- is the slope, representing the change in Y for each one-unit increase in X
- is the y-intercept, the predicted value of Y when X equals 0
For example, if and , then each additional hour of studying predicts a 3.2-point increase in exam score, starting from a baseline of 50.
Least Squares Method
The least squares method finds the line that minimizes the total squared distance between each observed data point and the line's prediction. Here's the process:
- For each data point, calculate the residual: the difference between the observed Y value and the predicted .
- Square each residual (so negative and positive errors don't cancel out).
- Sum all the squared residuals.
- Find the values of and that make this sum as small as possible.
The squaring step is why it's called "least squares." This method produces the single best-fitting straight line through your data.
Regression Coefficients
- The slope () tells you the direction and rate of the relationship. A positive slope means Y increases as X increases; a negative slope means Y decreases as X increases.
- The intercept () is the predicted Y when X = 0. Sometimes this has a meaningful interpretation; other times (like predicting weight from height when height = 0), it's just a mathematical anchor for the line.
- Each coefficient has a standard error that tells you how precisely it's estimated. Smaller standard errors mean more reliable estimates.
Coefficient of Determination
R-squared () tells you what proportion of the variation in Y is explained by X.
- ranges from 0 to 1.
- An of 0.85 means 85% of the variability in Y can be accounted for by the linear relationship with X.
- An of 0.20 means the model explains only 20% of the variation, so other factors are likely at play.
is calculated as the ratio of explained variance to total variance. It's your primary quick measure of how well the model fits the data.
Multiple Regression
Multiple regression extends simple linear regression by including two or more independent variables. This lets you model more realistic situations where outcomes depend on several factors at once.
Multiple Independent Variables
The general form is:
Each coefficient represents the effect of its corresponding variable while holding all other variables constant. The term represents the error (the part of Y the model can't explain).
For example, predicting house price from both square footage () and number of bedrooms () gives you a more complete picture than using either variable alone. It also lets you control for confounding variables by including them in the model.
Interaction Effects
Sometimes the effect of one variable depends on the value of another. For instance, the effect of study hours on exam scores might be stronger for students who also attend office hours.
You model this by adding a product term (e.g., ) to the equation. Interaction effects can reveal complex relationships that main effects alone would miss, but they require careful interpretation.
Multicollinearity
Multicollinearity occurs when independent variables are highly correlated with each other. This creates problems because the model can't cleanly separate each variable's individual effect.
- Symptoms: coefficient estimates become unstable, standard errors inflate, and signs may flip unexpectedly.
- Detection: use the variance inflation factor (VIF). A VIF above 5 or 10 (depending on the convention) signals concern.
- Remedies: remove one of the correlated variables, combine them, or use techniques like ridge regression.

Assumptions of Regression
Regression results are only trustworthy if certain assumptions hold. Violating these assumptions can produce misleading coefficients and invalid predictions. Always check them before interpreting your model.
Linearity
The relationship between X and Y should be approximately linear. Check this by looking at a scatter plot of the data or a plot of residuals vs. fitted values. If you see a clear curve, a straight line isn't capturing the true pattern. You may need to transform your variables or use a nonlinear model.
Independence of Errors
Residuals should be independent of one another. This assumption is most often violated with time series data, where today's error tends to be similar to yesterday's. The Durbin-Watson statistic tests for this. Values near 2 suggest independence; values near 0 or 4 suggest positive or negative autocorrelation.
Homoscedasticity
The spread (variance) of residuals should stay roughly constant across all levels of X. If residuals fan out or funnel in as fitted values increase, you have heteroscedasticity. This doesn't bias your coefficients, but it makes your standard errors unreliable. Weighted least squares or robust standard errors can help.
Normality of Residuals
Residuals should be approximately normally distributed. You can check this with a Q-Q plot (points should fall close to the diagonal line) or a formal test like the Shapiro-Wilk test. This assumption matters most for hypothesis tests and confidence intervals. With large samples, mild non-normality is usually not a serious problem.
Model Evaluation
Once you've built a regression model, you need to assess whether it actually works well. Model evaluation involves both overall fit and the significance of individual predictors.
R-Squared vs. Adjusted R-Squared
- R-squared always increases when you add more predictors, even if those predictors are meaningless. This makes it misleading for comparing models of different sizes.
- Adjusted R-squared penalizes for each additional predictor. It only increases if the new variable improves the model more than you'd expect by chance.
When comparing models with different numbers of predictors, use adjusted R-squared.
F-Test for Overall Significance
The F-test asks: Does this model, as a whole, explain a significant amount of variance in Y?
- Null hypothesis: all regression coefficients equal zero (the model is no better than just using the mean of Y).
- A low p-value (typically < 0.05) means at least one predictor has a real relationship with Y.
- The F-test doesn't tell you which predictors matter, just that the model collectively does.
T-Tests for Individual Predictors
While the F-test evaluates the whole model, t-tests evaluate each predictor individually.
- Null hypothesis: the coefficient for a given predictor is zero (that variable has no effect).
- Each predictor gets its own p-value and confidence interval.
- A predictor with a high p-value (e.g., > 0.05) may not be contributing meaningfully and could be a candidate for removal.
Diagnostics and Remedies
After fitting a model, diagnostic checks help you spot problems and decide how to fix them.
Residual Analysis
Plot your residuals in several ways:
- Residuals vs. fitted values: look for patterns. A random scatter is good. Curves suggest nonlinearity; funneling suggests heteroscedasticity.
- Residuals vs. each predictor: helps pinpoint which variable is causing problems.
- Residuals over time (if applicable): checks for autocorrelation.
Standardized residuals (residuals divided by their standard deviation) help you spot outliers. Points beyond ±2 or ±3 deserve a closer look.
Outliers and Leverage Points
- Outliers have unusual Y values given their X values. They pull the regression line toward them.
- Leverage points have unusual X values. They have the potential to strongly influence the line, though they don't always.
- Cook's distance combines both concepts into a single measure of how much each point influences the overall model. Points with a Cook's distance greater than 1 (or sometimes ) warrant investigation.
Don't automatically delete outliers. Investigate whether they're data errors, or whether they represent real but unusual observations.

Transformation of Variables
When assumptions are violated, transforming variables can help:
- Log transformation (): useful when data is right-skewed or when the relationship is multiplicative rather than additive.
- Square root transformation (): a milder option for moderate skew.
- Box-Cox transformation: a systematic method that finds the optimal power transformation for your data.
Transformations can improve linearity, stabilize variance, and make residuals more normal, but they change how you interpret the coefficients.
Advanced Regression Techniques
These methods extend basic regression to handle situations where a simple linear model falls short.
Polynomial Regression
Polynomial regression adds squared, cubed, or higher-order terms of a predictor to capture curved relationships. For example:
This fits a parabola instead of a straight line. Higher-order polynomials can fit more complex curves, but they carry a serious risk of overfitting: the model may chase noise in the training data and perform poorly on new data.
Logistic Regression
Logistic regression is used when the dependent variable is binary (e.g., admitted/not admitted, default/no default). Instead of predicting Y directly, it predicts the probability that Y = 1 using the logistic function:
Coefficients are interpreted as changes in log-odds. For example, in credit scoring, a logistic regression might predict the probability of loan default based on income and credit history.
Ridge vs. Lasso Regression
Both are regularization techniques that add a penalty to the size of coefficients, which helps with multicollinearity and overfitting.
- Ridge regression shrinks all coefficients toward zero but never sets any exactly to zero. It keeps all variables in the model.
- Lasso regression can shrink some coefficients to exactly zero, effectively performing variable selection by dropping unimportant predictors.
The choice between them depends on whether you think all your predictors matter (ridge) or suspect some are irrelevant (lasso).
Applications of Regression
Prediction and Forecasting
Regression models are widely used to estimate future outcomes:
- Finance: forecasting revenue based on advertising spend
- Real estate: predicting home prices from location, size, and features
- Healthcare: estimating patient outcomes based on treatment variables
Be cautious about extrapolation (predicting beyond the range of your data). A model trained on houses between 1,000 and 3,000 square feet may not predict well for a 10,000-square-foot mansion.
Causal Inference
Regression alone doesn't prove causation. Establishing causal claims requires either controlled experiments or specialized techniques like instrumental variables or difference-in-differences analysis. These methods are common in policy evaluation and medical research, where you can't always run a randomized experiment.
The biggest challenge is confounding: unmeasured variables that influence both X and Y, creating a misleading association.
Model Selection
Choosing the best model means balancing simplicity with explanatory power. Common approaches:
- Stepwise regression: adds or removes predictors one at a time based on statistical criteria.
- Best subset selection: evaluates all possible combinations of predictors (computationally expensive with many variables).
- Information criteria (AIC, BIC): score models by balancing fit against complexity. Lower values indicate better models. BIC penalizes complexity more heavily than AIC.
Limitations and Alternatives
Non-Linear Relationships
Linear regression assumes a straight-line relationship. When the true pattern is curved or more complex, alternatives include generalized additive models (GAMs), which fit flexible smooth curves, and neural networks, which can capture highly complex patterns at the cost of interpretability.
Always plot your data first. A scatter plot can immediately reveal whether a linear model is reasonable.
Time Series Considerations
Standard regression assumes observations are independent, but time series data often violates this. Consecutive observations tend to be correlated (autocorrelation), and trends or seasonal patterns can create non-stationarity. Specialized models like ARIMA and vector autoregression (VAR) are designed to handle these features.
Machine Learning Approaches
Machine learning methods like decision trees, random forests, and support vector machines offer flexible alternatives to regression. They can capture complex nonlinear relationships and interactions automatically. The trade-off: these models are often harder to interpret. Regression gives you clear coefficients with straightforward meaning; a random forest gives you better predictions but less insight into why.