Simple linear regression models the relationship between two numerical variables using a straight line. In this section, you'll see how to build that line, measure how well it fits, and use it to make predictions, all in the context of a textbook cost example.

The regression line is defined by its slope and y-intercept. The correlation coefficient tells you how strong the linear pattern is. And the regression equation lets you plug in a value of $x$ to predict $y$ , though those predictions come with real limitations.

Simple Linear Regression

Slope and Y-Intercept Calculation

The regression line is written as $\hat{y} = b_0 + b_1 x$ , where $b_1$ is the slope and $b_0$ is the y-intercept. These two values completely define the line.

Slope ( $b_1$ ) tells you how much $y$ changes for every one-unit increase in $x$ . For example, if $b_1 = 3.5$ in a textbook cost model, then each additional credit hour is associated with a $3.50 increase in textbook cost.
Y-intercept ( $b_0$ ) is the predicted value of $y$ when $x = 0$ . Sometimes this has a real-world meaning (a base cost before any credit hours), and sometimes it doesn't make practical sense. Either way, you need it to position the line correctly.

Calculating the slope:

$b_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$

$x_i$ and $y_i$ are individual data points (e.g., number of credit hours and textbook cost for each student)
$\bar{x}$ and $\bar{y}$ are the means of the $x$ and $y$ variables
$n$ is the number of data points

The numerator measures how $x$ and $y$ move together. The denominator measures how spread out the $x$ values are. Dividing gives you the rate of change.

Calculating the y-intercept:

$b_0 = \bar{y} - b_1 \bar{x}$

This formula guarantees that the regression line passes through the point $(\bar{x}, \bar{y})$ , which is the center of your data.

Correlation Coefficient Interpretation

The correlation coefficient ( $r$ ) measures the strength and direction of the linear relationship between two variables. It ranges from $-1$ to $1$ .

$r = 1$ : perfect positive linear relationship (as $x$ goes up, $y$ goes up in a perfectly straight line)
$r = -1$ : perfect negative linear relationship (as $x$ goes up, $y$ goes down in a perfectly straight line)
$r = 0$ : no linear relationship

The sign tells you the direction, and the magnitude tells you the strength. An $r$ of $0.9$ is a strong positive relationship. An $r$ of $-0.4$ is a moderate negative relationship. Values close to zero mean the data points don't follow a linear pattern.

The formula for $r$ is:

$r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}$

Notice the numerator is the same as in the slope formula. The difference is that $r$ standardizes by the spread of both variables, which is why it always falls between $-1$ and $1$ .

Coefficient of determination ( $r^2$ ): Square the correlation coefficient and you get the proportion of variance in $y$ that's explained by $x$ . If $r = 0.9$ , then $r^2 = 0.81$ , meaning 81% of the variation in textbook cost can be accounted for by the number of credit hours. The remaining 19% is due to other factors the model doesn't capture.

Slope and y-intercept calculation, Linear Regression (3 of 4) | Statistics for the Social Sciences

Regression Equation Predictions

To predict a value, plug your $x$ into the regression equation:

$\hat{y} = b_0 + b_1 x$

For example, if your textbook cost model is $\hat{y} = 20 + 3.5x$ and a student is taking 12 credit hours, the predicted textbook cost is $\hat{y} = 20 + 3.5(12) = 62$ dollars.

Limitations you need to know:

Extrapolation is risky. The equation is only reliable within the range of your observed data. If your data covers 6 to 18 credit hours, predicting the cost for 30 credit hours is extrapolation, and the linear pattern may not hold that far out.
The model assumes linearity. If the true relationship curves (e.g., textbook costs might level off at high credit loads because of bundled pricing), a straight line won't capture that and your predictions will be off.
Outliers can distort the line. A few unusual data points (say, one student who spent $500 on a single rare textbook) can pull the slope and intercept away from where they'd otherwise be, making the line less representative of most students.
Omitted variables matter. Textbook cost probably depends on more than just credit hours. The subject area, whether students buy new or used, and the edition of the book all play a role. A simple linear regression with one predictor can't account for those factors.

Residual Analysis

After fitting a regression line, you should check whether the model's assumptions actually hold. You do this by examining residuals, which are the differences between observed and predicted values: $y_i - \hat{y}_i$ .

A residual plot graphs residuals against predicted values (or against $x$ ). If the model fits well, the residuals should scatter randomly around zero with no obvious pattern.
If the residuals fan out (get wider) as $x$ increases, that's called heteroscedasticity, and it means the variability of $y$ isn't constant across all values of $x$ . This violates an assumption of linear regression and can make your predictions less trustworthy at certain ranges.
If the residuals show a curve, the relationship probably isn't linear, and a straight line isn't the right model.