Fiveable

🥖Linear Modeling Theory Unit 1 Review

QR code for Linear Modeling Theory practice questions

1.2 Simple Linear Regression: Concept and Assumptions

1.2 Simple Linear Regression: Concept and Assumptions

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Simple Linear Regression

Concept and Purpose

Simple linear regression models the relationship between two continuous variables by fitting a straight line through the data. You use one variable (the independent or predictor variable, xx) to predict the other (the dependent or response variable, yy). The goal is to find the line that best captures how yy changes as xx changes.

The regression line follows this equation:

y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x

  • β0\beta_0 is the y-intercept: the predicted value of yy when x=0x = 0
  • β1\beta_1 is the slope: the predicted change in yy for a one-unit increase in xx

These coefficients are estimated using the least squares method, which finds the line that minimizes the sum of squared residuals (the squared differences between observed yy values and the values the line predicts).

Simple linear regression serves several purposes:

  • Describing relationships: Quantifying the strength and direction of a linear association (e.g., the link between height and weight)
  • Prediction: Estimating yy for a given xx (e.g., forecasting sales based on advertising spend)
  • Identifying outliers: Flagging observations that fall far from the regression line
  • Assessing explanatory power: Determining how much of the variation in yy is accounted for by xx

Assumptions of Linear Regression

Core Assumptions

For the results of a simple linear regression to be trustworthy, four key conditions need to hold. These apply to the residuals (the differences between observed and predicted values), not to the raw data itself.

  1. Linearity: The true relationship between xx and yy is linear. A one-unit change in xx produces a constant change in yy, regardless of where you are on the xx axis. If the relationship is curved, a straight line will systematically miss the pattern.

  2. Independence: Each observation is independent of the others. The value of one data point doesn't influence or predict another. This is largely determined by how the data were collected (e.g., random sampling supports independence; repeated measurements on the same subject may violate it).

  3. Homoscedasticity (constant variance): The spread of the residuals stays roughly the same across all values of xx. If the residuals fan out (getting larger as xx increases, for instance), this assumption is violated, and your standard errors become unreliable.

  4. Normality of residuals: The residuals are approximately normally distributed with a mean of zero. This matters most for hypothesis tests and confidence intervals. With large samples, mild departures from normality are less of a concern.

Concept and Purpose, Simple Linear Regression Analysis - ReliaWiki

Additional Model Considerations

  • No autocorrelation: The residuals should not be correlated with each other. This is especially relevant for time-series data, where consecutive observations may share a pattern. The Durbin-Watson test can help detect this.
  • No multicollinearity: With only one predictor, multicollinearity isn't an issue. It becomes important when you move to multiple regression with two or more predictors.
  • Measurement error: The model assumes xx is measured without error. Substantial measurement error in the predictor can bias the slope estimate toward zero (this is called attenuation bias).

Assessing Linear Regression Appropriateness

Data Suitability

Before fitting a model, confirm that simple linear regression makes sense for your situation.

  • Your research question should involve predicting one continuous variable from another continuous variable.
  • Create a scatterplot of yy against xx. Look for a roughly linear pattern. If you see a clear curve, simple linear regression isn't the right tool without a transformation.
  • After fitting the model, examine the residual plot (residuals vs. fitted values). The points should scatter randomly around zero with no obvious pattern. A funnel shape suggests heteroscedasticity; a curve suggests nonlinearity.
  • Check normality of residuals with a Q-Q plot or histogram. Points on a Q-Q plot should fall close to the diagonal line. A formal test like the Shapiro-Wilk test can supplement visual inspection.
  • Verify independence based on your study design. If data were collected over time or within clusters, independence may not hold.
Concept and Purpose, Linear Regression (3 of 4) | Concepts in Statistics

Practical Considerations

  • Sample size: Simple linear regression needs enough data to produce stable coefficient estimates. A common rule of thumb is at least 15–20 observations, though more is better for precise estimates and adequate statistical power.
  • Outliers and influential points: A single extreme observation can pull the regression line substantially. Examine leverage and Cook's distance values. Depending on the situation, you might remove the point, use a robust regression method (like median regression), or report results with and without it.
  • Practical significance: A statistically significant slope doesn't always mean the relationship is meaningful. Consider whether the effect size matters in context. A model predicting housing prices from square footage might explain a lot of variation; a model predicting GPA from shoe size probably won't, even if the slope is technically significant.

Modeling with Linear Regression

Model Formulation

Building a simple linear regression model follows a clear sequence:

  1. Identify your variables. Based on your research question, determine which variable is the response (yy) and which is the predictor (xx).

  2. Collect and examine your data. You need paired observations of xx and yy. Create a scatterplot to check for linearity and spot any obvious outliers.

  3. Estimate the coefficients using the least squares method:

    • Calculate the sample means xˉ\bar{x} and yˉ\bar{y}
    • Calculate Sxx=(xixˉ)2S_{xx} = \sum(x_i - \bar{x})^2, the total squared deviation of xx
    • Calculate Sxy=(xixˉ)(yiyˉ)S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y}), the cross-deviation of xx and yy
    • Compute the slope: β^1=SxySxx\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}
    • Compute the intercept: β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
  4. Check assumptions. Plot the residuals and verify the conditions described above before interpreting results.

Model Interpretation

Once you have the fitted equation y^=β^0+β^1x\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x, interpretation involves three pieces:

Slope (β^1\hat{\beta}_1): This tells you the predicted change in yy for each one-unit increase in xx. For example, if you're modeling salary (in thousands of dollars) against years of experience and β^1=2.3\hat{\beta}_1 = 2.3, then each additional year of experience is associated with a $2,300 increase in predicted salary.

Intercept (β^0\hat{\beta}_0): This is the predicted value of yy when x=0x = 0. Sometimes this has a real-world meaning (e.g., a starting salary with zero experience). Other times x=0x = 0 falls outside the range of your data, and the intercept is just a mathematical anchor for the line rather than something you'd interpret literally.

Coefficient of determination (R2R^2): This measures the proportion of variance in yy that the model explains. An R2R^2 of 0.72 means 72% of the variability in yy is accounted for by its linear relationship with xx. The remaining 28% is unexplained. Keep in mind that R2R^2 alone doesn't tell you whether the model is appropriate; always check the residual plots alongside it.