Simple linear regression models the relationship between two numerical variables using a straight line. In this section, you'll see how to build that line, measure how well it fits, and use it to make predictions, all in the context of a textbook cost example.
The regression line is defined by its slope and y-intercept. The correlation coefficient tells you how strong the linear pattern is. And the regression equation lets you plug in a value of to predict , though those predictions come with real limitations.
Simple Linear Regression
Slope and Y-Intercept Calculation
The regression line is written as , where is the slope and is the y-intercept. These two values completely define the line.
- Slope () tells you how much changes for every one-unit increase in . For example, if in a textbook cost model, then each additional credit hour is associated with a $3.50 increase in textbook cost.
- Y-intercept () is the predicted value of when . Sometimes this has a real-world meaning (a base cost before any credit hours), and sometimes it doesn't make practical sense. Either way, you need it to position the line correctly.
Calculating the slope:
- and are individual data points (e.g., number of credit hours and textbook cost for each student)
- and are the means of the and variables
- is the number of data points
The numerator measures how and move together. The denominator measures how spread out the values are. Dividing gives you the rate of change.
Calculating the y-intercept:
This formula guarantees that the regression line passes through the point , which is the center of your data.
Correlation Coefficient Interpretation
The correlation coefficient () measures the strength and direction of the linear relationship between two variables. It ranges from to .
- : perfect positive linear relationship (as goes up, goes up in a perfectly straight line)
- : perfect negative linear relationship (as goes up, goes down in a perfectly straight line)
- : no linear relationship
The sign tells you the direction, and the magnitude tells you the strength. An of is a strong positive relationship. An of is a moderate negative relationship. Values close to zero mean the data points don't follow a linear pattern.
The formula for is:
Notice the numerator is the same as in the slope formula. The difference is that standardizes by the spread of both variables, which is why it always falls between and .
Coefficient of determination (): Square the correlation coefficient and you get the proportion of variance in that's explained by . If , then , meaning 81% of the variation in textbook cost can be accounted for by the number of credit hours. The remaining 19% is due to other factors the model doesn't capture.

Regression Equation Predictions
To predict a value, plug your into the regression equation:
For example, if your textbook cost model is and a student is taking 12 credit hours, the predicted textbook cost is dollars.
Limitations you need to know:
- Extrapolation is risky. The equation is only reliable within the range of your observed data. If your data covers 6 to 18 credit hours, predicting the cost for 30 credit hours is extrapolation, and the linear pattern may not hold that far out.
- The model assumes linearity. If the true relationship curves (e.g., textbook costs might level off at high credit loads because of bundled pricing), a straight line won't capture that and your predictions will be off.
- Outliers can distort the line. A few unusual data points (say, one student who spent $500 on a single rare textbook) can pull the slope and intercept away from where they'd otherwise be, making the line less representative of most students.
- Omitted variables matter. Textbook cost probably depends on more than just credit hours. The subject area, whether students buy new or used, and the edition of the book all play a role. A simple linear regression with one predictor can't account for those factors.
Residual Analysis
After fitting a regression line, you should check whether the model's assumptions actually hold. You do this by examining residuals, which are the differences between observed and predicted values: .
- A residual plot graphs residuals against predicted values (or against ). If the model fits well, the residuals should scatter randomly around zero with no obvious pattern.
- If the residuals fan out (get wider) as increases, that's called heteroscedasticity, and it means the variability of isn't constant across all values of . This violates an assumption of linear regression and can make your predictions less trustworthy at certain ranges.
- If the residuals show a curve, the relationship probably isn't linear, and a straight line isn't the right model.