The least-squares regression line lets you take a known value of one variable and predict the value of another. This section covers how to make those predictions, how to interpret them, and when it's appropriate (or not) to use them.

more resources to help you study

practice questions

The Regression Equation

The least-squares regression line takes the form:

$\hat{y} = b_0 + b_1 x$

$\hat{y}$ is the predicted value of the response variable
$b_0$ is the y-intercept (the predicted value of $y$ when $x = 0$ )
$b_1$ is the slope (the predicted change in $y$ for each one-unit increase in $x$ )
$x$ is the value of the explanatory variable you're plugging in

To make a prediction, substitute your known $x$ value into the equation and solve for $\hat{y}$ .

Least-squares regression line equation, Least Squares/Rank Regression Equations - ReliaWiki

Interpreting Predicted Values

The predicted value $\hat{y}$ is your best estimate of the response variable for a given $x$ , based on the model. It won't usually match the actual observed value exactly.

Example: Suppose the regression equation for predicting final exam scores from midterm scores is $\hat{y} = 25 + 0.7x$ . If a student scored 80 on the midterm:

$\hat{y} = 25 + 0.7(80) = 25 + 56 = 81$

You'd predict a final exam score of 81 for a student who earned an 80 on the midterm.

The difference between an actual value and its predicted value is called a residual:

$\text{residual} = y - \hat{y}$

If that student actually scored 85 on the final, the residual is $85 - 81 = 4$ . A positive residual means the model underpredicted; a negative residual means it overpredicted.

The standard error of the estimate (often written $s_e$ ) summarizes how far actual values tend to fall from the regression line on average. A smaller $s_e$ means the predictions are generally more precise.

Least-squares regression line equation, The Regression Equation | Introduction to Statistics – Gravina

When Regression Predictions Are Appropriate

Not every regression equation should be used for prediction. Before relying on a prediction, check these conditions:

Linearity — The relationship between $x$ and $y$ should be roughly linear. Check the scatter plot and the residual plot for any curved patterns.
Normally distributed residuals — The residuals should be approximately normal with a mean of zero. A histogram or normal probability plot of the residuals can help you verify this.
Constant variability (homoscedasticity) — The spread of residuals should stay roughly the same across all values of $x$ . If the residual plot fans out or funnels in, this condition is violated.
No influential outliers — Points with high leverage or large residuals can distort the regression line and make predictions unreliable.

Beyond these conditions, two more considerations matter:

Stay within the range of the data. Predictions should only be made for $x$ values within (or very close to) the range of the original data. Predicting outside that range is called extrapolation, and it's risky because you have no evidence the linear pattern continues.
The model should fit well. A higher $R^2$ value means the regression line explains more of the variability in $y$ , which generally makes predictions more trustworthy. A low $R^2$ means the line isn't capturing much of the pattern, so predictions will be imprecise even if all conditions are met.

Additional Regression Analysis Tools

Scatter plot — Plot the raw data to visually assess whether the relationship looks linear and to spot potential outliers.
Confidence interval for the mean response — Gives a range of plausible values for the average $y$ at a particular $x$ . This interval is narrower because it targets the population mean, not a single observation.
Prediction interval for an individual observation — Gives a range of plausible values for a single new $y$ at a particular $x$ . This interval is always wider than the confidence interval because individual observations carry extra variability beyond the mean.

Prediction intervals are wider than confidence intervals at the same $x$ value. The confidence interval accounts for uncertainty in estimating the mean, while the prediction interval also accounts for the natural scatter of individual data points around that mean.