Simple Linear Regression
Concept and Purpose
Simple linear regression models the relationship between two continuous variables by fitting a straight line through the data. You use one variable (the independent or predictor variable, ) to predict the other (the dependent or response variable, ). The goal is to find the line that best captures how changes as changes.
The regression line follows this equation:
- is the y-intercept: the predicted value of when
- is the slope: the predicted change in for a one-unit increase in
These coefficients are estimated using the least squares method, which finds the line that minimizes the sum of squared residuals (the squared differences between observed values and the values the line predicts).
Simple linear regression serves several purposes:
- Describing relationships: Quantifying the strength and direction of a linear association (e.g., the link between height and weight)
- Prediction: Estimating for a given (e.g., forecasting sales based on advertising spend)
- Identifying outliers: Flagging observations that fall far from the regression line
- Assessing explanatory power: Determining how much of the variation in is accounted for by
Assumptions of Linear Regression
Core Assumptions
For the results of a simple linear regression to be trustworthy, four key conditions need to hold. These apply to the residuals (the differences between observed and predicted values), not to the raw data itself.
-
Linearity: The true relationship between and is linear. A one-unit change in produces a constant change in , regardless of where you are on the axis. If the relationship is curved, a straight line will systematically miss the pattern.
-
Independence: Each observation is independent of the others. The value of one data point doesn't influence or predict another. This is largely determined by how the data were collected (e.g., random sampling supports independence; repeated measurements on the same subject may violate it).
-
Homoscedasticity (constant variance): The spread of the residuals stays roughly the same across all values of . If the residuals fan out (getting larger as increases, for instance), this assumption is violated, and your standard errors become unreliable.
-
Normality of residuals: The residuals are approximately normally distributed with a mean of zero. This matters most for hypothesis tests and confidence intervals. With large samples, mild departures from normality are less of a concern.

Additional Model Considerations
- No autocorrelation: The residuals should not be correlated with each other. This is especially relevant for time-series data, where consecutive observations may share a pattern. The Durbin-Watson test can help detect this.
- No multicollinearity: With only one predictor, multicollinearity isn't an issue. It becomes important when you move to multiple regression with two or more predictors.
- Measurement error: The model assumes is measured without error. Substantial measurement error in the predictor can bias the slope estimate toward zero (this is called attenuation bias).
Assessing Linear Regression Appropriateness
Data Suitability
Before fitting a model, confirm that simple linear regression makes sense for your situation.
- Your research question should involve predicting one continuous variable from another continuous variable.
- Create a scatterplot of against . Look for a roughly linear pattern. If you see a clear curve, simple linear regression isn't the right tool without a transformation.
- After fitting the model, examine the residual plot (residuals vs. fitted values). The points should scatter randomly around zero with no obvious pattern. A funnel shape suggests heteroscedasticity; a curve suggests nonlinearity.
- Check normality of residuals with a Q-Q plot or histogram. Points on a Q-Q plot should fall close to the diagonal line. A formal test like the Shapiro-Wilk test can supplement visual inspection.
- Verify independence based on your study design. If data were collected over time or within clusters, independence may not hold.

Practical Considerations
- Sample size: Simple linear regression needs enough data to produce stable coefficient estimates. A common rule of thumb is at least 15–20 observations, though more is better for precise estimates and adequate statistical power.
- Outliers and influential points: A single extreme observation can pull the regression line substantially. Examine leverage and Cook's distance values. Depending on the situation, you might remove the point, use a robust regression method (like median regression), or report results with and without it.
- Practical significance: A statistically significant slope doesn't always mean the relationship is meaningful. Consider whether the effect size matters in context. A model predicting housing prices from square footage might explain a lot of variation; a model predicting GPA from shoe size probably won't, even if the slope is technically significant.
Modeling with Linear Regression
Model Formulation
Building a simple linear regression model follows a clear sequence:
-
Identify your variables. Based on your research question, determine which variable is the response () and which is the predictor ().
-
Collect and examine your data. You need paired observations of and . Create a scatterplot to check for linearity and spot any obvious outliers.
-
Estimate the coefficients using the least squares method:
- Calculate the sample means and
- Calculate , the total squared deviation of
- Calculate , the cross-deviation of and
- Compute the slope:
- Compute the intercept:
-
Check assumptions. Plot the residuals and verify the conditions described above before interpreting results.
Model Interpretation
Once you have the fitted equation , interpretation involves three pieces:
Slope (): This tells you the predicted change in for each one-unit increase in . For example, if you're modeling salary (in thousands of dollars) against years of experience and , then each additional year of experience is associated with a $2,300 increase in predicted salary.
Intercept (): This is the predicted value of when . Sometimes this has a real-world meaning (e.g., a starting salary with zero experience). Other times falls outside the range of your data, and the intercept is just a mathematical anchor for the line rather than something you'd interpret literally.
Coefficient of determination (): This measures the proportion of variance in that the model explains. An of 0.72 means 72% of the variability in is accounted for by its linear relationship with . The remaining 28% is unexplained. Keep in mind that alone doesn't tell you whether the model is appropriate; always check the residual plots alongside it.