Scatterplots for Data Visualization
Scatterplots are the foundation of graphical analysis in linear modeling. They let you see the relationship between two quantitative variables at a glance, revealing patterns, outliers, and trends that raw numbers alone can't communicate. Before fitting any regression model, you should always start by looking at the data.
Creating Scatterplots
To build a scatterplot, you plot one variable on the x-axis and the other on the y-axis. Each point represents a single observation, positioned according to its values on both variables.
A few conventions to follow:
- The explanatory variable (the one you think does the predicting) goes on the x-axis, and the response variable (the outcome you're interested in) goes on the y-axis
- You can create scatterplots by hand, but statistical software (R, Python, Excel) makes it faster and more precise
- Additional features like color-coding, point shapes, or labels can represent a third variable or group membership. For example, you might color points by gender or region to see whether the relationship differs across subgroups.
Uses of Scatterplots
Scatterplots serve several purposes in the modeling process:
- Visualizing relationships between two variables before running any formal analysis
- Identifying patterns and trends that suggest what type of model might be appropriate (if the points follow a straight-line pattern, linear regression is a reasonable choice)
- Spotting outliers that could distort your results
- Comparing subgroups by overlaying different groups on the same plot
- Communicating results to others in a clear, intuitive way
Interpreting Scatterplot Patterns
Types of Associations
The overall pattern of points tells you about the direction and nature of the relationship between two variables.
- Positive association: Higher values of one variable tend to go with higher values of the other. The points cluster along an upward-sloping direction. Example: hours studied and exam score.
- Negative association: Higher values of one variable tend to go with lower values of the other. The points cluster along a downward-sloping direction. Example: price of a product and quantity demanded.
- No association: The points appear randomly scattered with no clear direction or trend. Knowing the value of one variable doesn't help you predict the other.
Strength and Outliers
The strength of an association depends on how tightly the points cluster around an imaginary line through the data. A tight cluster means a strong association; a wide, diffuse cloud means a weak one.
Outliers are data points that fall far from the overall pattern. They matter because a single outlier can pull the regression line toward it, distorting your results. Always investigate outliers: they might reflect data entry errors, or they might represent genuinely unusual cases worth understanding.
Watch for nonlinear patterns too. If the points follow a curve or form distinct clusters, a simple linear model won't capture the relationship well. You'd need to consider transformations or alternative models.

Linear Regression Line Representation
Characteristics of a Regression Line
A simple linear regression line is a straight line superimposed on the scatterplot that best summarizes the relationship between the two variables. It passes through the center of the point cloud.
The equation takes the form:
where:
- is the y-intercept, the predicted value of when
- is the slope, the predicted change in for every one-unit increase in
A positive slope means the line tilts upward (positive association), and a negative slope means it tilts downward (negative association). Note that the y-intercept doesn't always have a meaningful real-world interpretation, especially if falls outside the range of your data.
Determining the Regression Line
The regression line is found using the least squares method, which works through the following logic:
- For any candidate line, calculate the residual for each data point: the vertical distance between the observed value and the value the line predicts at that
- Square each residual (so that positive and negative misses don't cancel out)
- Sum all the squared residuals
- Choose the line (the specific and ) that makes this sum as small as possible
This gives you the "best fit" line in the least squares sense. Once you have the equation, you can plug in any value of to get a predicted . Just be cautious about extrapolation: predicting beyond the range of your observed data is unreliable because you have no evidence the linear pattern continues.
Assessing Linear Model Fit
Residual Plots
After fitting a regression line, you need to check whether the linear model is actually appropriate. A residual plot graphs the residuals (observed minus predicted ) against the explanatory variable or the predicted values.
Here's what to look for:
- Good fit: The residuals are randomly and evenly scattered around the horizontal line at zero, with no obvious pattern. This suggests the linear model captures the relationship well.
- Curved pattern: The residuals form a U-shape or arc, indicating the true relationship is nonlinear. A straight line is missing systematic structure in the data.
- Funnel shape (residuals spread out or narrow as increases): This signals heteroscedasticity, meaning the variability of isn't constant across values of . This violates a key assumption of linear regression.
- Isolated points far from zero may be outliers or influential observations worth investigating.
Goodness-of-Fit Measures
The coefficient of determination () quantifies how well the linear model explains variability in the response variable. It ranges from 0 to 1 and represents the proportion of variance in that the model accounts for.
- An of 0.85 means the model explains 85% of the variability in the response variable. The remaining 15% is unexplained (captured in the residuals).
- A high (close to 1) indicates a strong linear fit. A low (close to 0) means the model explains very little of the variation.
Keep in mind that alone doesn't tell you whether the model is appropriate. You can have a decent with a clearly nonlinear relationship. Always check the residual plot alongside .
Confidence and prediction intervals appear as bands around the regression line on a graph:
- Confidence intervals (usually the narrower band) reflect uncertainty about the average value of at a given
- Prediction intervals (the wider band) reflect uncertainty about an individual observation's value at a given
Narrower bands mean more precise estimates; wider bands mean greater uncertainty. Both bands tend to widen as you move away from the center of the data, which is another reason extrapolation is risky.