Scatter plots let you visualize how two variables relate to each other, and linear regression gives you a way to model that relationship with a straight line. Together, these tools help you make predictions and measure how well your model actually fits the data.

more resources to help you study

practice questions

Scatter plots for variable relationships

A scatter plot places data points on a coordinate plane, where each point represents a pair of $(x, y)$ values. The independent variable goes on the x-axis and the dependent variable on the y-axis.

Once the points are plotted, you're looking for three things:

Direction of correlation:
- Positive correlation: As x increases, y tends to increase
- Negative correlation: As x increases, y tends to decrease
- No correlation: No apparent pattern between x and y
Strength of correlation:
- Strong: Data points cluster tightly around a clear pattern
- Weak: Data points are more scattered and deviate from the pattern
Outliers: Individual points that fall far from the general trend. These can pull a regression line toward them, so it's worth noting when they appear.

Linear vs nonlinear relationships

Not every scatter plot calls for a straight line. Before you run a regression, check whether the data actually looks linear.

Linear relationships show data points that roughly follow a straight line. The change in y is proportional to the change in x. A classic example: if you drive at a constant 60 mph, distance increases by 60 miles for every hour, so a plot of time vs. distance forms a line.

Nonlinear relationships show data points that curve or bend. The change in y is not proportional to the change in x. Some common types:

Exponential: Population growth over time (slow start, then rapid increase)
Quadratic: Height of a thrown ball over time (rises, peaks, then falls)
Logarithmic: Perceived loudness vs. actual sound intensity (big jumps early, then levels off)

If the scatter plot shows curvature, fitting a straight line to it will give misleading results.

Scatter plots for variable relationships, Types of Outliers in Linear Regression | Introduction to Statistics

Linear Regression and Predictions

Line of best fit interpretation

The line of best fit (also called the least squares regression line) is the straight line that minimizes the sum of the squared vertical distances between each data point and the line. Its equation takes the familiar form:

$y = mx + b$

$m$ (slope): the change in y for each one-unit increase in x
$b$ (y-intercept): the predicted value of y when x equals zero

To calculate the slope and intercept from raw data:

$m = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$

$b = \bar{y} - m\bar{x}$

where $\bar{x}$ and $\bar{y}$ are the means of the x- and y-values, $x_i$ and $y_i$ are individual data points, and $n$ is the number of data points.

In practice, your calculator or software handles these computations. What matters is that you can interpret the result. For instance, if a regression for study hours vs. exam score gives $y = 5.2x + 48$ , the slope tells you that each additional hour of studying is associated with about a 5.2-point increase in score, and a student who studied zero hours would be predicted to score 48.

Residuals are the differences between observed y-values and predicted y-values: $\text{residual} = y_i - \hat{y}_i$ . A positive residual means the actual value was above the line; a negative residual means it was below. If your model is a good fit, residuals should be small and scattered randomly (no obvious pattern).

Scatter plots for variable relationships, 9.1 Introduction to Bivariate Data and Scatterplots – Significant Statistics

Linear models for predictions

Making a prediction with a linear model is straightforward:

Find (or be given) the equation of the line of best fit, $y = mx + b$ .
Plug in your x-value and solve for y.

The harder question is how much should you trust that prediction?

Coefficient of determination ( $R^2$ ) tells you the proportion of the variance in y that your linear model explains. It ranges from 0 to 1:

$R^2 = 0.95$ means 95% of the variation in y is accounted for by the model. That's a strong fit.
$R^2 = 0.40$ means only 40% is explained. The model captures some trend but misses a lot.

The formula:

$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$

where $y_i$ is the actual y-value, $\hat{y}_i$ is the predicted y-value, and $\bar{y}$ is the mean of all y-values.

Limitations to keep in mind:

A linear model forced onto a nonlinear relationship will give poor predictions regardless of how you calculate it.
Extrapolation (predicting outside the range of your data) is risky. A model built on data from 0–10 hours of study may not hold at 50 hours.
Interpolation (predicting within the range of your data) is generally more reliable.

Measures of Model Fit and Correlation

Beyond $R^2$ , two other measures come up frequently:

Pearson correlation coefficient ( $r$ ): Measures the strength and direction of a linear relationship on a scale from $-1$ to $1$ . A value near $+1$ indicates a strong positive linear relationship, near $-1$ a strong negative one, and near $0$ a weak or nonexistent linear relationship. Note that $R^2 = r^2$ , so if $r = -0.9$ , then $R^2 = 0.81$ .
Standard error of the estimate: Measures the average amount that observed y-values deviate from the regression line's predicted values. A smaller standard error means your predictions tend to be closer to the actual data.

Together, $r$ , $R^2$ , and the standard error give you a fuller picture of how well your linear model captures the data.