Regression Analysis for Fuel Efficiency
Regression analysis helps you understand how one variable (like vehicle weight) affects another (like fuel efficiency). In this section, you'll learn to visualize that relationship with scatterplots, measure it with correlation coefficients, and use a regression equation to make predictions.
Scatterplots for Fuel Efficiency Relationships
A scatterplot is the first tool you should reach for when exploring the relationship between two quantitative variables. For fuel efficiency data, you'd plot vehicle weight (in lbs) on the x-axis and fuel efficiency (in mpg) on the y-axis. Each point represents a single vehicle.
Once the points are plotted, look at the overall pattern:
- A roughly linear pattern suggests a linear relationship between the two variables.
- A downward slope indicates a negative relationship: as weight increases, fuel efficiency decreases. This is what you'd expect here, since heavier cars generally need more fuel to move.
- The tightness of the points around an imaginary line tells you about strength. Points clustered closely around a line suggest a strong relationship, while widely scattered points suggest a weaker one.

Correlation Coefficients in Efficiency Data
The correlation coefficient () puts a number on what the scatterplot shows you visually. It measures the strength and direction of a linear relationship between two quantitative variables.
- ranges from to .
- Values close to or indicate a strong linear relationship; values close to indicate a weak one.
- A positive means both variables tend to increase together. A negative means one tends to decrease as the other increases.
For fuel efficiency vs. vehicle weight, you'd expect a negative . A typical dataset might give something like , which would indicate a strong negative linear relationship.
The formula for is:
where and are individual data values, and are their respective means, and is the number of data points. In practice, you'll usually compute this with a calculator or software, but understanding the formula helps: the numerator captures how and move together, while the denominator standardizes everything so the result falls between and .

Linear Regression for Efficiency Predictions
Linear regression finds the line of best fit through your scatterplot. This line minimizes the sum of the squared differences between the observed values and the predicted values, which is why the method is called least squares regression.
The regression equation takes the form:
- = predicted value of the dependent variable (fuel efficiency in mpg)
- = value of the independent variable (vehicle weight in lbs)
- = y-intercept (the predicted mpg when weight is zero, though this often isn't meaningful in context)
- = slope (the predicted change in mpg for each additional pound of weight)
Calculating the slope and intercept:
where is the correlation coefficient, and are the standard deviations of the dependent and independent variables, and and are their means.
Making a prediction:
- Plug the vehicle's weight into the equation for .
- Solve for to get the predicted fuel efficiency.
For example, if your regression equation is and you want to predict the mpg for a 3,000 lb car: mpg.
Residuals are the differences between observed and predicted values: . If a car actually gets 30 mpg but the model predicted 27.6, the residual is mpg. Residuals help you assess how well the model fits the data. A good model will have residuals that are small and show no obvious pattern.
Advanced Regression Techniques
Two concepts worth knowing at this level:
- Multiple regression extends simple linear regression by including more than one independent variable. For fuel efficiency, you might include both weight and engine size as predictors, which can improve the model's accuracy.
- Extrapolation means using the regression equation to predict values outside the range of your original data. This is risky. If your data only includes cars weighing 2,000 to 5,000 lbs, predicting mpg for a 500 lb vehicle could give a nonsensical result. Stick to predictions within (or very close to) the range of your data.