Linear regression lets you model how one variable (like vehicle weight) predicts another (like fuel efficiency in MPG). In this optional section, you'll apply regression concepts from Unit 12 to a concrete dataset: figuring out how much a car's weight drives its gas mileage, and how strongly other vehicle characteristics correlate with fuel consumption.

more resources to help you study

practice questions

Fuel Efficiency vs. Vehicle Weight

The core idea here is straightforward: heavier cars generally get worse gas mileage. Linear regression gives you a precise way to quantify that relationship.

The regression equation is:

$\hat{y} = \beta_0 + \beta_1 x + \epsilon$

$\hat{y}$ is the predicted fuel efficiency (MPG)
$x$ is the vehicle weight (typically in pounds)
$\beta_0$ is the y-intercept: the predicted MPG when weight equals zero. This is a mathematical artifact, not a real-world scenario, since no car weighs zero pounds.
$\beta_1$ is the slope: the predicted change in MPG for each one-pound increase in weight
$\epsilon$ is the error term, capturing variability from factors the model doesn't include (engine type, aerodynamics, driving conditions, etc.)

How are the coefficients estimated? The least squares method finds the values of $\beta_0$ and $\beta_1$ that minimize the sum of squared residuals. A residual is the difference between an observed MPG value and the value the line predicts. Squaring these differences and minimizing the total gives you the "best fit" line.

The coefficient of determination $R^2$ tells you what proportion of the variability in MPG is explained by weight alone. It ranges from 0 to 1. An $R^2$ of 0.75, for example, means 75% of the variation in fuel efficiency across the dataset can be accounted for by differences in vehicle weight. The remaining 25% comes from other factors.

Fuel efficiency vs vehicle weight, Calculate linear regression, test it and plot it [David Zelený]

Regression Line Interpretation for MPG

The slope $\beta_1$ is where the real insight lives. For vehicle weight vs. MPG, you should expect a negative slope, since heavier cars tend to be less fuel-efficient.

The magnitude matters too. A slope of $-0.008$ means that for every additional pound of weight, predicted MPG drops by 0.008. That sounds tiny, but it adds up: a car that's 500 pounds heavier than another would be predicted to get about 4 fewer MPG.

Making a prediction (step-by-step):

Start with your fitted equation. Suppose you found $\hat{y} = 45 - 0.008x$ .
Plug in the car's weight. For a 3,500-pound vehicle: $\hat{y} = 45 - 0.008(3500)$ .
Calculate: $\hat{y} = 45 - 28 = 17$ MPG.

That 17 MPG is a point estimate. A prediction interval gives you a range (say, 13 to 21 MPG) that accounts for both uncertainty in the model's coefficients and the natural scatter of individual data points around the line. Prediction intervals are always wider than confidence intervals for the mean response, because they capture individual variability on top of estimation uncertainty.

One caution: don't extrapolate far beyond the range of weights in your data. If your dataset only includes cars from 2,000 to 5,000 pounds, predicting MPG for a 500-pound vehicle using this model would be unreliable.

Fuel efficiency vs vehicle weight, Simple Linear regression algorithm in machine learning with example graph - Codershood

Correlation Strength in Vehicle Characteristics

Beyond weight, you can examine how other vehicle characteristics relate to fuel consumption. The correlation coefficient $r$ measures the strength and direction of a linear relationship between two variables.

$r = -1$ : perfect negative linear relationship
$r = 0$ : no linear relationship
$r = 1$ : perfect positive linear relationship

For interpreting strength, these general guidelines are commonly used:

| $|r|$ Value | Strength | |---|---| | $0.7$ to $1.0$ | Strong | | $0.3$ to $0.7$ | Moderate | | $0.0$ to $0.3$ | Weak |

Some typical correlations you'd find in vehicle data:

Weight and MPG: strong negative correlation. Heavier cars burn more fuel per mile.
Engine displacement and fuel consumption: strong positive correlation. Larger engines consume more fuel.
Horsepower and fuel consumption: typically positive correlation. More powerful engines generally use more fuel. (Be careful here: the original claim that high-horsepower cars have lower fuel consumption is backwards. In most datasets, higher horsepower is associated with higher fuel consumption, not lower.)

Remember that $r$ only captures linear relationships. Two variables could have a strong curved relationship but a modest $r$ value.

Model Diagnostics and Assumptions

Before trusting your regression results, you need to check whether the model's assumptions hold. There are four key things to look at:

Residual patterns: Plot residuals against predicted values. If the model fits well, residuals should scatter randomly around zero with no obvious pattern. A curved pattern suggests a linear model isn't the right fit.
Constant variance (homoscedasticity): The spread of residuals should stay roughly the same across all predicted values. If residuals fan out (getting wider as predicted values increase), that's heteroscedasticity, and it means your standard errors and prediction intervals may be unreliable.
Outliers and influential points: A single extreme data point can pull the regression line substantially. Check whether removing a suspicious point changes your slope or $R^2$ dramatically. If it does, that point is influential and deserves closer investigation.
Multicollinearity: If you're running a multiple regression with several predictors (weight, horsepower, engine size), these predictors are often correlated with each other. High multicollinearity makes individual coefficient estimates unstable, even if the overall model still predicts well. You can detect it by checking correlations among your predictors or computing variance inflation factors (VIF).