Fiveable

🎲Intro to Statistics Unit 12 Review

QR code for Intro to Statistics practice questions

12.8 Regression (Textbook Cost)

12.8 Regression (Textbook Cost)

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🎲Intro to Statistics
Unit & Topic Study Guides

Simple linear regression models the relationship between two numerical variables using a straight line. In this section, you'll see how to build that line, measure how well it fits, and use it to make predictions, all in the context of a textbook cost example.

The regression line is defined by its slope and y-intercept. The correlation coefficient tells you how strong the linear pattern is. And the regression equation lets you plug in a value of xx to predict yy, though those predictions come with real limitations.

Simple Linear Regression

Slope and Y-Intercept Calculation

The regression line is written as y^=b0+b1x\hat{y} = b_0 + b_1 x, where b1b_1 is the slope and b0b_0 is the y-intercept. These two values completely define the line.

  • Slope (b1b_1) tells you how much yy changes for every one-unit increase in xx. For example, if b1=3.5b_1 = 3.5 in a textbook cost model, then each additional credit hour is associated with a $3.50 increase in textbook cost.
  • Y-intercept (b0b_0) is the predicted value of yy when x=0x = 0. Sometimes this has a real-world meaning (a base cost before any credit hours), and sometimes it doesn't make practical sense. Either way, you need it to position the line correctly.

Calculating the slope:

b1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2b_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}

  • xix_i and yiy_i are individual data points (e.g., number of credit hours and textbook cost for each student)
  • xˉ\bar{x} and yˉ\bar{y} are the means of the xx and yy variables
  • nn is the number of data points

The numerator measures how xx and yy move together. The denominator measures how spread out the xx values are. Dividing gives you the rate of change.

Calculating the y-intercept:

b0=yˉb1xˉb_0 = \bar{y} - b_1 \bar{x}

This formula guarantees that the regression line passes through the point (xˉ,yˉ)(\bar{x}, \bar{y}), which is the center of your data.

Correlation Coefficient Interpretation

The correlation coefficient (rr) measures the strength and direction of the linear relationship between two variables. It ranges from 1-1 to 11.

  • r=1r = 1: perfect positive linear relationship (as xx goes up, yy goes up in a perfectly straight line)
  • r=1r = -1: perfect negative linear relationship (as xx goes up, yy goes down in a perfectly straight line)
  • r=0r = 0: no linear relationship

The sign tells you the direction, and the magnitude tells you the strength. An rr of 0.90.9 is a strong positive relationship. An rr of 0.4-0.4 is a moderate negative relationship. Values close to zero mean the data points don't follow a linear pattern.

The formula for rr is:

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}

Notice the numerator is the same as in the slope formula. The difference is that rr standardizes by the spread of both variables, which is why it always falls between 1-1 and 11.

Coefficient of determination (r2r^2): Square the correlation coefficient and you get the proportion of variance in yy that's explained by xx. If r=0.9r = 0.9, then r2=0.81r^2 = 0.81, meaning 81% of the variation in textbook cost can be accounted for by the number of credit hours. The remaining 19% is due to other factors the model doesn't capture.

Slope and y-intercept calculation, Linear Regression (3 of 4) | Statistics for the Social Sciences

Regression Equation Predictions

To predict a value, plug your xx into the regression equation:

y^=b0+b1x\hat{y} = b_0 + b_1 x

For example, if your textbook cost model is y^=20+3.5x\hat{y} = 20 + 3.5x and a student is taking 12 credit hours, the predicted textbook cost is y^=20+3.5(12)=62\hat{y} = 20 + 3.5(12) = 62 dollars.

Limitations you need to know:

  1. Extrapolation is risky. The equation is only reliable within the range of your observed data. If your data covers 6 to 18 credit hours, predicting the cost for 30 credit hours is extrapolation, and the linear pattern may not hold that far out.
  2. The model assumes linearity. If the true relationship curves (e.g., textbook costs might level off at high credit loads because of bundled pricing), a straight line won't capture that and your predictions will be off.
  3. Outliers can distort the line. A few unusual data points (say, one student who spent $500 on a single rare textbook) can pull the slope and intercept away from where they'd otherwise be, making the line less representative of most students.
  4. Omitted variables matter. Textbook cost probably depends on more than just credit hours. The subject area, whether students buy new or used, and the edition of the book all play a role. A simple linear regression with one predictor can't account for those factors.

Residual Analysis

After fitting a regression line, you should check whether the model's assumptions actually hold. You do this by examining residuals, which are the differences between observed and predicted values: yiy^iy_i - \hat{y}_i.

  • A residual plot graphs residuals against predicted values (or against xx). If the model fits well, the residuals should scatter randomly around zero with no obvious pattern.
  • If the residuals fan out (get wider) as xx increases, that's called heteroscedasticity, and it means the variability of yy isn't constant across all values of xx. This violates an assumption of linear regression and can make your predictions less trustworthy at certain ranges.
  • If the residuals show a curve, the relationship probably isn't linear, and a straight line isn't the right model.