A regression equation finds the best-fitting straight line through a set of data points, letting you describe and predict how one variable responds to changes in another. In this section, you'll learn how to calculate that line, interpret what its parts mean, and evaluate how well it actually fits your data.

more resources to help you study

practice questions

Least-Squares Regression Line Calculation

The goal of least-squares regression is to find the line that minimizes the sum of squared residuals. A residual is the vertical distance between an actual data point and the predicted value on the line: $\text{residual} = y_i - \hat{y}_i$ . By squaring these distances and minimizing their total, we ensure the line sits as close to all the points as possible, and we prevent positive and negative errors from canceling each other out.

The regression equation takes the form:

$\hat{y} = b_0 + b_1 x$

$\hat{y}$ is the predicted value of y for a given x
$b_1$ is the slope (the change in y for each one-unit increase in x)
$b_0$ is the y-intercept (the predicted value of y when x = 0)

How to calculate the slope and intercept:

Find the means $\bar{x}$ and $\bar{y}$ of your x and y data.
Calculate the slope using:

$b_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$

The numerator captures how x and y move together (their co-variation), and the denominator captures how spread out the x-values are.

Calculate the y-intercept by plugging the slope and the means into:

$b_0 = \bar{y} - b_1\bar{x}$

This formula guarantees that the regression line always passes through the point $(\bar{x}, \bar{y})$ , which is a useful fact to remember.

Least-squares regression line calculation, 12.3: The Regression Equation | Introduction to Statistics

Interpretation of Slope and Y-Intercept

Knowing the numbers isn't enough; you need to explain what they mean in context.

Slope ( $b_1$ ): For each one-unit increase in x, the predicted value of y changes by $b_1$ units.

A positive slope means y tends to increase as x increases (direct relationship).
A negative slope means y tends to decrease as x increases (inverse relationship).
Always include units. For example, if x is years of experience and y is salary in dollars, a slope of 2,400 means: "For each additional year of experience, predicted salary increases by $2,400."

Y-intercept ( $b_0$ ): This is the predicted value of y when x = 0.

Sometimes this makes sense: if x is hours studied and y is exam score, $b_0$ is the predicted score with zero hours of studying.
Often it doesn't make sense. If x is height in inches and y is weight, then $b_0$ would predict the weight of a person with zero height. In cases like this, the y-intercept is just a mathematical anchor for the line, not a meaningful prediction. Recognizing this distinction is important on exams.

Least-squares regression line calculation, Linear least squares - Wikipedia

Strength of Linear Relationships

Correlation coefficient (r) measures both the strength and direction of a linear relationship between two variables.

$r$ $r$ ranges from $-1$ $- 1$ to $1$ $1$
- $r = 1$ : perfect positive linear relationship (all points fall exactly on an increasing line)
- $r = -1$ : perfect negative linear relationship (all points fall exactly on a decreasing line)
- $r = 0$ : no linear relationship (points show no linear pattern)
The closer $|r|$ is to 1, the more tightly the points cluster around the line.

The formula for r is:

$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\;\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}$

Notice the numerator is the same as in the slope formula. The denominator standardizes it so that r is always between $-1$ and $1$ .

Coefficient of determination ( $r^2$ ) tells you the proportion of the variation in y that is explained by the linear relationship with x.

$r^2$ ranges from 0 to 1 and is typically reported as a percentage.
If $r = 0.80$ , then $r^2 = 0.64$ , meaning 64% of the variation in y is explained by x. The remaining 36% is due to other factors or randomness.
An $r^2$ close to 1 means the model captures most of the variability; close to 0 means it captures very little.

A scatter plot is always your first step for visualizing the relationship. It helps you confirm that the relationship is actually linear before you trust r or the regression equation.

Assessing Model Fit and Reliability

Calculating a regression line doesn't guarantee it's a good model. You need to check whether the line is trustworthy.

Residual analysis is the primary diagnostic tool:

Plot the residuals ( $y_i - \hat{y}_i$ ) against the predicted values or against x.
Look for randomness. A good model produces residuals that scatter randomly around zero with no visible pattern.
Watch for non-linear patterns (curves in the residual plot), which suggest a straight line isn't the right model.
Check for homoscedasticity, meaning the spread of residuals stays roughly constant across all x-values. If the residuals fan out (get wider) as x increases, the model's predictions are less reliable for larger x-values.

Outliers and influential points deserve special attention. A single unusual point, especially one with an extreme x-value, can pull the entire regression line toward it. Always check whether removing a suspicious point substantially changes the slope or intercept.

Standard error of the estimate ( $s_e$ ) measures the average size of the residuals. Think of it as the typical amount by which actual y-values deviate from the predicted values. A smaller $s_e$ means predictions are more precise.

Confidence intervals for the slope and intercept give you a range of plausible values rather than a single estimate. A confidence interval for $b_1$ that does not contain zero provides evidence that there is a real linear relationship between x and y.