Scatterplots are the foundation of graphical analysis in linear modeling. They let you see the relationship between two quantitative variables at a glance, revealing patterns, outliers, and trends that raw numbers alone can't communicate. Before fitting any regression model, you should always start by looking at the data.

Creating Scatterplots

To build a scatterplot, you plot one variable on the x-axis and the other on the y-axis. Each point represents a single observation, positioned according to its values on both variables.

A few conventions to follow:

The explanatory variable (the one you think does the predicting) goes on the x-axis, and the response variable (the outcome you're interested in) goes on the y-axis
You can create scatterplots by hand, but statistical software (R, Python, Excel) makes it faster and more precise
Additional features like color-coding, point shapes, or labels can represent a third variable or group membership. For example, you might color points by gender or region to see whether the relationship differs across subgroups.

Uses of Scatterplots

Scatterplots serve several purposes in the modeling process:

Visualizing relationships between two variables before running any formal analysis
Identifying patterns and trends that suggest what type of model might be appropriate (if the points follow a straight-line pattern, linear regression is a reasonable choice)
Spotting outliers that could distort your results
Comparing subgroups by overlaying different groups on the same plot
Communicating results to others in a clear, intuitive way

Interpreting Scatterplot Patterns

Types of Associations

The overall pattern of points tells you about the direction and nature of the relationship between two variables.

Positive association: Higher values of one variable tend to go with higher values of the other. The points cluster along an upward-sloping direction. Example: hours studied and exam score.
Negative association: Higher values of one variable tend to go with lower values of the other. The points cluster along a downward-sloping direction. Example: price of a product and quantity demanded.
No association: The points appear randomly scattered with no clear direction or trend. Knowing the value of one variable doesn't help you predict the other.

Strength and Outliers

The strength of an association depends on how tightly the points cluster around an imaginary line through the data. A tight cluster means a strong association; a wide, diffuse cloud means a weak one.

Outliers are data points that fall far from the overall pattern. They matter because a single outlier can pull the regression line toward it, distorting your results. Always investigate outliers: they might reflect data entry errors, or they might represent genuinely unusual cases worth understanding.

Watch for nonlinear patterns too. If the points follow a curve or form distinct clusters, a simple linear model won't capture the relationship well. You'd need to consider transformations or alternative models.

Creating Scatterplots, Scatterplots (2 of 5) | Concepts in Statistics

Linear Regression Line Representation

Characteristics of a Regression Line

A simple linear regression line is a straight line superimposed on the scatterplot that best summarizes the relationship between the two variables. It passes through the center of the point cloud.

The equation takes the form:

$\hat{y} = b_0 + b_1 x$

where:

$b_0$ is the y-intercept, the predicted value of $y$ when $x = 0$
$b_1$ is the slope, the predicted change in $y$ for every one-unit increase in $x$

A positive slope means the line tilts upward (positive association), and a negative slope means it tilts downward (negative association). Note that the y-intercept doesn't always have a meaningful real-world interpretation, especially if $x = 0$ falls outside the range of your data.

Determining the Regression Line

The regression line is found using the least squares method, which works through the following logic:

For any candidate line, calculate the residual for each data point: the vertical distance between the observed $y$ value and the value the line predicts at that $x$
Square each residual (so that positive and negative misses don't cancel out)
Sum all the squared residuals
Choose the line (the specific $b_0$ and $b_1$ ) that makes this sum as small as possible

This gives you the "best fit" line in the least squares sense. Once you have the equation, you can plug in any value of $x$ to get a predicted $\hat{y}$ . Just be cautious about extrapolation: predicting beyond the range of your observed data is unreliable because you have no evidence the linear pattern continues.

Assessing Linear Model Fit

Residual Plots

After fitting a regression line, you need to check whether the linear model is actually appropriate. A residual plot graphs the residuals (observed $y$ minus predicted $\hat{y}$ ) against the explanatory variable or the predicted values.

Here's what to look for:

Good fit: The residuals are randomly and evenly scattered around the horizontal line at zero, with no obvious pattern. This suggests the linear model captures the relationship well.
Curved pattern: The residuals form a U-shape or arc, indicating the true relationship is nonlinear. A straight line is missing systematic structure in the data.
Funnel shape (residuals spread out or narrow as $x$ increases): This signals heteroscedasticity, meaning the variability of $y$ isn't constant across values of $x$ . This violates a key assumption of linear regression.
Isolated points far from zero may be outliers or influential observations worth investigating.

Goodness-of-Fit Measures

The coefficient of determination ( $R^2$ ) quantifies how well the linear model explains variability in the response variable. It ranges from 0 to 1 and represents the proportion of variance in $y$ that the model accounts for.

An $R^2$ of 0.85 means the model explains 85% of the variability in the response variable. The remaining 15% is unexplained (captured in the residuals).
A high $R^2$ (close to 1) indicates a strong linear fit. A low $R^2$ (close to 0) means the model explains very little of the variation.

Keep in mind that $R^2$ alone doesn't tell you whether the model is appropriate. You can have a decent $R^2$ with a clearly nonlinear relationship. Always check the residual plot alongside $R^2$ .

Confidence and prediction intervals appear as bands around the regression line on a graph:

Confidence intervals (usually the narrower band) reflect uncertainty about the average value of $y$ at a given $x$
Prediction intervals (the wider band) reflect uncertainty about an individual observation's $y$ value at a given $x$

Narrower bands mean more precise estimates; wider bands mean greater uncertainty. Both bands tend to widen as you move away from the center of the data, which is another reason extrapolation is risky.

2,589 studying →