Regression Analysis: Distance from School and Student Performance
Regression analysis gives us a way to quantify the relationship between two variables and make predictions based on that relationship. In this context, we're asking: does how far a student lives from school relate to their academic performance? To answer that, you'll use three connected tools: scatterplots to visualize the data, correlation coefficients to measure the relationship's strength, and linear regression equations to make predictions.
Scatterplot Interpretation for Student Performance
A scatterplot is a graph that shows the relationship between two quantitative variables. Each point on the plot represents one student, with their distance from school on the x-axis (independent variable) and their academic performance on the y-axis (dependent variable).
The overall pattern of the points tells you about the relationship:
- Positive linear relationship: Points trend upward left to right (farther distance, higher performance)
- Negative linear relationship: Points trend downward left to right (farther distance, lower performance)
- No linear relationship: Points show no clear straight-line pattern, or they follow a curve
You can also gauge the strength of the relationship visually. If the points cluster tightly around an imaginary straight line, the relationship is strong. If they're scattered widely, it's weak.
Two more things to look for on a scatterplot:
- Outliers are points that fall far from the overall pattern. These deserve attention because they can distort your analysis.
- Residuals are the vertical distances between each data point and the regression line. On a scatterplot with a regression line drawn, you can see residuals as the gap between a point and the line directly above or below it.

Correlation Coefficient of Distance and Achievement
The correlation coefficient () puts a number on what the scatterplot shows you visually. It measures both the strength and direction of the linear relationship between two variables.
Key properties of :
- It ranges from to
- : perfect positive linear relationship
- : perfect negative linear relationship
- : no linear relationship
- The sign tells you direction (positive or negative trend)
- The absolute value tells you strength (closer to 1 means stronger)
The formula for the correlation coefficient is:
Where and are individual data values, and are the means of each variable, and is the number of data points. You likely won't calculate this by hand on an exam, but understanding the formula helps: the numerator captures how the two variables move together, while the denominator standardizes it so the result always falls between and .
A related value is the coefficient of determination (). This tells you the proportion of variance in the dependent variable that's explained by the independent variable. For example, if , then , meaning 64% of the variation in student performance can be explained by distance from school.
Correlation does not imply causation. Even if is strong, you can't conclude that distance causes changes in performance. Other factors (transportation stress, socioeconomic differences by neighborhood, etc.) could be driving the pattern.

Linear Regression for Performance Prediction
Linear regression takes the relationship one step further by giving you an equation to predict values. The regression equation takes this form:
- = predicted value of the dependent variable (predicted performance)
- = y-intercept (the predicted performance when distance is 0)
- = slope (how much predicted performance changes for each one-unit increase in distance)
- = value of the independent variable (distance)
Calculating the slope and intercept:
- Find the slope: , where is the correlation coefficient, is the standard deviation of the dependent variable, and is the standard deviation of the independent variable.
- Find the y-intercept:
For example, suppose , , , miles, and points. Then and . The equation would be . This means for every additional mile from school, predicted performance drops by 1.5 points.
The regression line is found using the least squares method, which positions the line to minimize the sum of squared residuals. In other words, it finds the line that collectively gets as close as possible to all the data points.
Limitations to keep in mind:
- The equation only models linear relationships. If the scatterplot shows a curve, a straight line won't capture the pattern well.
- Outliers can pull the line toward them, distorting the slope and intercept.
- Predictions should stay within the range of your original data. Predicting performance for a student who lives 50 miles away when your data only goes up to 15 miles is extrapolation, and it's unreliable.
- The model doesn't account for other factors that affect performance (socioeconomic status, school quality, study habits).
How do you know if the regression equation is useful? Check the strength of the linear relationship. A high value (and therefore a high ) means the equation explains a good portion of the variation and can give you more trustworthy predictions. A low means the line doesn't fit the data well, and predictions from it won't be very reliable.
Additional Regression Considerations
At the intro level, two issues are worth being aware of:
- Heteroscedasticity occurs when the spread of residuals isn't consistent across all values of the independent variable. For instance, if predictions are accurate for students living close to school but wildly off for students living far away, the residuals have unequal spread. This can make statistical tests less reliable.
- Confounding variables are unmeasured factors that could explain the relationship you're seeing. If wealthier families tend to live closer to school and their children perform better, wealth could be confounding the distance-performance relationship. This ties back to the key point: correlation does not imply causation.