Regression Analysis: Distance from School and Academic Performance
Regression analysis lets you model how one variable (like distance from school) relates to another (like academic performance). In this optional activity, you'll apply the core regression tools from Unit 12 to a specific dataset, practicing how to build a model, interpret its output, and recognize its limitations.

Distance and Academic Performance Relationship
Simple linear regression analyzes the relationship between two variables. Here, distance from school is the predictor (independent) variable, and academic performance is the response (dependent) variable.
The correlation coefficient () measures the strength and direction of the linear relationship:
- ranges from to
- A positive means performance tends to increase as distance increases (direct relationship)
- A negative means performance tends to decrease as distance increases (inverse relationship)
- An near 0 suggests little or no linear association
The coefficient of determination () tells you the proportion of variance in performance that distance explains. It ranges from 0 to 1. For example, an of 0.36 means 36% of the variability in academic performance is accounted for by distance from school. The remaining 64% comes from other factors not in the model.
The regression equation takes the form:
- is the predicted performance
- is the y-intercept: the predicted performance when distance is 0 (i.e., a student who lives at the school)
- is the slope: the predicted change in performance for each one-unit increase in distance

Building and Using the Regression Model
Least squares regression finds the values of and that minimize the sum of squared residuals. A residual is the difference between an observed value and the predicted value: . By squaring these differences and minimizing the total, the method produces the best-fitting line through the data.
Once you have the equation, making predictions is straightforward:
- Take a specific distance value ().
- Plug it into the regression equation .
- The result is the model's predicted performance for that distance.
Two types of intervals come up when making predictions:
- A confidence interval gives a range for the mean performance of all students at a given distance.
- A prediction interval gives a range for an individual student's performance at that distance. Prediction intervals are always wider because individual outcomes vary more than group averages.
Be careful with extrapolation: predicting beyond the range of your observed data. If your dataset only includes students living 1 to 15 miles from school, predicting performance for a student 40 miles away is risky. The linear pattern may not hold outside the observed range.

Limitations of Distance as a Predictor
Regression results require careful interpretation. Several issues can undermine your conclusions:
- Omitted variable bias happens when factors left out of the model (socioeconomic status, parental education, school quality) are correlated with both distance and performance. If these are omitted, the slope may overstate or understate the true effect of distance.
- Measurement error in either variable reduces the accuracy of your estimates. Self-reported distances, for instance, may be imprecise.
- Non-linearity: the true relationship might not be a straight line. Performance could drop sharply for the first few miles and then level off. Transformations (logarithmic, quadratic) or non-linear models may fit better.
- Correlation is not causation. Even a strong doesn't prove that distance causes lower performance. Establishing causation requires a different study design, such as a randomized controlled trial.
- Generalizability depends on how the data were collected. Results from one school district may not apply to another with different geography, transportation, or demographics.
Model Assessment and Diagnostics
After fitting a regression, you need to check whether the model's assumptions actually hold:
- Outliers can pull the regression line toward them, distorting , , and . Always plot your data and investigate any points with unusually large residuals.
- Heteroscedasticity means the spread of residuals changes across levels of the predictor. If residuals fan out as distance increases, the model's standard errors (and therefore confidence/prediction intervals) may be unreliable. A residual plot makes this easy to spot.
- Residual plots are your primary diagnostic tool. Plot residuals against predicted values. You want to see a random scatter with no obvious pattern. Any curve or funnel shape signals a problem.
- Cross-validation tests how well the model predicts new data by fitting it on a subset and checking predictions on the rest. This guards against overfitting.
- Multicollinearity becomes a concern if you expand to multiple regression with predictors that are correlated with each other (e.g., distance and commute time). It inflates standard errors and makes individual coefficients hard to interpret.