Why This Matters
Correlation is one of the most useful tools in AP Statistics for understanding how two quantitative variables relate to each other. The correlation coefficient r tells you both the direction (positive or negative) and strength (weak, moderate, or strong) of a linear relationship. This concept connects directly to scatterplot interpretation, least-squares regression, residual analysis, and the coefficient of determination r2, all of which are heavily tested on the AP exam.
What you're really being tested on: Can you interpret what r means in context? Can you recognize when correlation is misleading (outliers, nonlinear patterns, lurking variables)? Can you connect r to the regression slope formula b=rโ
sxโsyโโ? Don't just memorize that r ranges from -1 to 1. Know why certain values matter, what affects them, and what correlation can and cannot tell you about causation.
The Pearson correlation coefficient r is the standard measure of linear association on the AP exam. It quantifies how closely data points cluster around a straight line by comparing the standardized values (z-scores) of both variables.
Pearson Correlation Coefficient (r)
- Measures the strength and direction of linear relationships between two quantitative variables
- Formula uses z-scores: r=nโ11โโ(sxโxiโโxหโ)(syโyiโโyหโโ), though you'll typically use technology to compute it
- Unit-free and bounded: ranges from โ1 to 1, making it comparable across different contexts regardless of the original units
Because r is calculated from z-scores, changing the units of either variable (say, converting inches to centimeters) has no effect on its value. This is a common multiple-choice trap.
Coefficient of Determination (r2)
- Represents the proportion of variation explained. If r=0.8, then r2=0.64, meaning 64% of the variation in y is explained by the linear relationship with x.
- Always between 0 and 1: since it's a squared value, r2 is never negative and gives you an intuitive percentage interpretation.
- Key for FRQs: when asked to interpret a regression, a standard response is "r2 percent of the variability in [response variable] is explained by the linear relationship with [explanatory variable]."
Compare: r vs. r2โboth describe linear relationships, but r gives direction and strength while r2 gives the proportion of variation explained. On FRQs, you'll often need to interpret both: use r for "strong positive/negative" language and r2 for "percent of variability explained."
Interpreting Strength and Direction
Understanding what correlation values actually mean is essential for multiple-choice questions and FRQ interpretations. The magnitude (absolute value) indicates strength, while the sign indicates direction.
Strength Categories
- Strong correlation: โฃrโฃโฅ0.8 indicates data points cluster tightly around the regression line with little scatter
- Moderate correlation: โฃrโฃ between approximately 0.5 and 0.8 shows a clear trend but with noticeable spread
- Weak correlation: โฃrโฃ<0.5 means the linear pattern is hard to see and data points are widely scattered
These cutoffs are rough guidelines, not rigid rules. Context matters. In some fields (like psychology), r=0.5 is considered quite strong, while in physics you might expect r values very close to 1.
Direction of Association
- Positive correlation (r>0): as x increases, y tends to increase. The scatterplot slopes upward left to right.
- Negative correlation (r<0): as x increases, y tends to decrease. The scatterplot slopes downward left to right.
- No linear correlation (rโ0): no consistent linear pattern exists, though a nonlinear relationship might still be present.
Perfect Correlation
- r=1 or r=โ1: all data points fall exactly on the regression line with zero residuals. This is rare in real data.
- Indicates a deterministic relationship: knowing x perfectly predicts y with no error.
- Exam context: perfect correlations typically appear in theoretical questions or as benchmarks for comparison.
Compare: Strong positive (r=0.9) vs. strong negative (r=โ0.9)โboth indicate equally strong linear relationships with the same r2=0.81. The difference is only in direction. Don't assume negative correlations are "weaker" than positive ones.
Connecting Correlation to Regression
The correlation coefficient isn't just a standalone statistic. It's mathematically linked to the least-squares regression line, and understanding this connection helps you move between correlation and regression problems.
Slope Formula Using r
- The slope formula b=rโ
sxโsyโโ directly connects correlation to the regression line. Know this relationship cold.
- The sign of the slope always matches the sign of r: a positive correlation produces a positive slope, and vice versa. (This makes sense because syโ and sxโ are always positive.)
- Standardized interpretation: when both variables are converted to z-scores, the slope of the regression line equals r.
Regression Line Properties
- Always passes through (xห,yหโ): the point of averages lies exactly on the least-squares regression line.
- Minimizes the sum of squared residuals: this is the "least squares" criterion that defines the LSRL.
- Y-intercept formula: a=yหโโbxห, derived from the fact that the line passes through the mean point.
Residual Connection
- Residuals sum to zero: โ(yiโโy^โiโ)=0 for any least-squares regression line.
- Residual plots reveal fit quality: a patternless scatter of residuals indicates the linear model is appropriate; curves suggest nonlinearity.
- r2 relates to residual variation: higher r2 means smaller residuals relative to total variation in y.
Compare: Correlation r vs. slope bโboth indicate direction, but r is unit-free (always between -1 and 1) while b has units of unitsย ofย xunitsย ofย yโ. You can have a strong correlation with a small slope if syโ is much smaller than sxโ.
What Affects Correlation: Outliers and Influential Points
One of the most tested concepts is how outliers and influential points can distort correlation. A single unusual observation can dramatically change r, which is why visual inspection of scatterplots is so important.
Outliers in Bivariate Data
- Can inflate or deflate r depending on their position. An outlier far from the overall pattern typically weakens correlation, while an outlier that happens to fall along the trend can strengthen it.
- Identified through scatterplots and residual plots: look for points with unusually large residuals or unusual x-values.
- Always investigate before removing: outliers may represent data errors, but they might also be legitimate and important observations.
High-Leverage Points
- Have substantially larger or smaller x-values than other observations. They "pull" the regression line toward themselves.
- May or may not be influential: a high-leverage point that follows the existing pattern has little effect on the line; one that deviates from the pattern is highly influential.
- Critical for interpretation: if removing a point substantially changes r, slope, or intercept, that point is influential.
Influential Points
- Change the regression results substantially when removed, including changes to slope, intercept, and/or correlation.
- Often are both outliers and high-leverage: the combination of unusual x and unusual y creates maximum influence.
- Exam strategy: if asked about the effect of removing a point, consider whether it's pulling the line toward or away from the overall pattern of the remaining data.
Compare: High-leverage point vs. influential pointโall influential points have leverage, but not all high-leverage points are influential. A point with an extreme x-value that falls exactly on the regression line has high leverage but isn't influential because removing it wouldn't change the line much.
Limitations and Common Misconceptions
Understanding what correlation cannot tell you is just as important as knowing what it measures. These limitations appear frequently in multiple-choice questions designed to test conceptual understanding.
Correlation Does Not Imply Causation
This is perhaps the most important statistical principle to internalize. Two variables can be strongly correlated without one causing the other.
- Lurking (confounding) variables may drive the relationship between both variables, creating a spurious correlation. For example, ice cream sales and drowning deaths are correlated because hot weather (the lurking variable) increases both.
- Establishing causation requires experiments: only randomized controlled experiments can demonstrate cause-and-effect relationships. Observational studies, no matter how strong the correlation, cannot establish causation on their own.
Nonlinear Relationships
- r only measures linear association: a perfect curved relationship (like a parabola) can have rโ0 because Pearson correlation doesn't detect nonlinear patterns.
- Always examine scatterplots: before interpreting r, verify that a linear model is appropriate for the data.
- Transformations may help: if data show curvature, applying log, square root, or power transformations might linearize the relationship so that r becomes meaningful.
Other Assumptions and Limitations
- Requires quantitative, paired data: both variables must be numerical and measured on the same individuals or cases.
- Sensitive to outliers: a single extreme point can shift r dramatically, as discussed above.
- Switching x and y doesn't change r: correlation is symmetric, but regression is not. The regression of y on x gives a different line than the regression of x on y.
Compare: Correlation with a lurking variable vs. true associationโif ice cream sales and sunburn rates are correlated, temperature is the lurking variable driving both. Recognizing this distinction is essential for FRQs asking you to critique causal claims from observational data.
Visualizing Correlation with Scatterplots
Scatterplots are your primary tool for assessing relationships before calculating correlation. Visual inspection reveals patterns, outliers, and potential problems that numbers alone cannot show.
Scatterplot Interpretation
When describing a scatterplot, cover these four features in order:
- Direction: positive, negative, or neither
- Form: linear, curved, or no clear pattern
- Strength: how tightly the points follow the form
- Unusual features: outliers, clusters, or gaps
The explanatory variable goes on the x-axis and the response on the y-axis. This convention matters for regression, but not for correlation itself (since r is symmetric).
Residual Plots
- Plot residuals vs. fitted values (or vs. x): this diagnostic tool reveals whether a linear model is appropriate.
- Random scatter indicates good fit: if residuals show no pattern, the linear model captures the relationship well.
- Curved patterns indicate nonlinearity: a U-shape or other systematic pattern in the residual plot means the linear model is inadequate, even if r looks decent.
Using Technology
- Graphing calculators compute r automatically: on the TI-84, use
LinReg(ax+b) after entering data in lists. Make sure "DiagnosticOn" is enabled, or r and r2 won't display.
- Always plot first: technology will calculate r for any paired data, even when correlation is meaningless (nonlinear data, categorical data miscoded as numbers).
- Report r in context: stating "r=0.85" alone is incomplete. Interpret what this means for the specific variables in the problem.
Quick Reference Table
|
| Measuring linear strength | Pearson r, coefficient of determination r2 |
| Interpreting strength | Strong (โฅrโฅโฅ0.8), Moderate (0.5โ0.8), Weak (<0.5) |
| Connecting to regression | Slope formula b=r(syโ/sxโ), line through (xห,yหโ) |
| Proportion of variation | r2 interpretation: "X% of variability in y explained by x" |
| Factors affecting r | Outliers, influential points, high-leverage points |
| Limitations of r | Only detects linear relationships, sensitive to outliers |
| Correlation โ causation | Lurking variables, confounding, spurious correlation |
| Visual assessment | Scatterplots (direction, form, strength, unusual features), residual plots |
Self-Check Questions
-
If r=โ0.92 for the relationship between hours of TV watched and GPA, what does this tell you about the strength and direction of the relationship? What does r2 tell you that r alone doesn't?
-
Two datasets both have r2=0.64. One has r=0.8 and the other has r=โ0.8. How do these relationships differ, and how are they similar?
-
A scatterplot shows a clear U-shaped pattern, but the calculated correlation is r=0.05. Explain why this happens and what it reveals about the limitations of correlation.
-
Compare how an outlier in the middle of the x-range versus an outlier at an extreme x-value would affect the correlation coefficient and regression line.
-
A study finds a strong positive correlation (r=0.78) between ice cream sales and sunburn rates. A newspaper headline claims "Eating ice cream causes sunburns." Using the concept of lurking variables, explain why this causal claim is flawed and what study design would be needed to establish causation.