Why This Matters
Correlation is one of the most powerful tools you'll use in AP Statistics—it's the gateway to understanding how two quantitative variables relate to each other. When you're analyzing bivariate data, the correlation coefficient r tells you both the direction (positive or negative) and strength (weak, moderate, or strong) of a linear relationship. This concept connects directly to scatterplot interpretation, least-squares regression, residual analysis, and the coefficient of determination r2—all of which are heavily tested on the AP exam.
Here's what you're really being tested on: Can you interpret what r means in context? Can you recognize when correlation is misleading (outliers, nonlinear patterns, lurking variables)? Can you connect r to the regression slope formula b=r⋅sxsy? Don't just memorize that r ranges from -1 to 1—know why certain values matter, what affects them, and what correlation can and cannot tell you about causation.
The Pearson correlation coefficient r is the standard measure of linear association on the AP exam. It quantifies how closely data points cluster around a straight line by comparing standardized values (z-scores) of both variables.
Pearson Correlation Coefficient (r)
- Measures the strength and direction of linear relationships between two quantitative variables—this is the correlation you'll calculate and interpret most often
- Formula uses z-scores: r=n−11∑(sxxi−xˉ)(syyi−yˉ), though you'll typically use technology to compute it
- Unit-free and bounded: ranges from −1 to 1, making it comparable across different contexts regardless of the original units
Coefficient of Determination (r2)
- Represents the proportion of variation explained—if r=0.8, then r2=0.64, meaning 64% of the variation in y is explained by the linear relationship with x
- Always between 0 and 1: since it's a squared value, r2 is never negative and provides an intuitive percentage interpretation
- Key for FRQs: when asked to interpret a regression, stating that "r2 percent of the variability in [response] is explained by [explanatory]" is a standard response
Compare: r vs. r2—both describe linear relationships, but r gives direction and strength while r2 gives the proportion of variation explained. On FRQs, you'll often need to interpret both: use r for "strong positive/negative" language and r2 for "percent of variability explained."
Interpreting Strength and Direction
Understanding what correlation values actually mean is essential for multiple-choice questions and FRQ interpretations. The magnitude (absolute value) indicates strength, while the sign indicates direction.
Strength Categories
- Strong correlation: ∣r∣≥0.8 indicates data points cluster tightly around the regression line with little scatter
- Moderate correlation: ∣r∣ between approximately 0.5 and 0.8 shows a clear trend but with noticeable spread around the line
- Weak correlation: ∣r∣<0.5 means the linear pattern is present but data points are widely scattered—other factors likely influence y
Direction of Association
- Positive correlation (r>0): as x increases, y tends to increase—the scatterplot slopes upward from left to right
- Negative correlation (r<0): as x increases, y tends to decrease—the scatterplot slopes downward from left to right
- No linear correlation (r≈0): no consistent linear pattern exists, though a nonlinear relationship might still be present
Perfect Correlation
- r=1 or r=−1: all data points fall exactly on the regression line with zero residuals—rare in real data
- Indicates deterministic relationship: knowing x perfectly predicts y with no error
- Exam context: perfect correlations typically appear in theoretical questions or as benchmarks for comparison
Compare: Strong positive (r=0.9) vs. strong negative (r=−0.9)—both indicate equally strong linear relationships with the same r2=0.81. The difference is only in direction. Don't assume negative correlations are "weaker" than positive ones.
Connecting Correlation to Regression
The correlation coefficient isn't just a standalone statistic—it's mathematically linked to the least-squares regression line. Understanding this connection helps you move between correlation and regression problems seamlessly.
Slope Formula Using r
- The slope formula b=r⋅sxsy directly connects correlation to the regression line—know this relationship cold
- Sign of slope matches sign of r: a positive correlation always produces a positive slope, and vice versa
- Standardized interpretation: when both variables are converted to z-scores, the slope of the regression line equals r
Regression Line Properties
- Always passes through (xˉ,yˉ): the point of averages lies exactly on the least-squares regression line
- Minimizes sum of squared residuals: this is the "least squares" criterion that defines the LSRL
- Y-intercept formula: a=yˉ−bxˉ, derived from the fact that the line passes through the mean point
Residual Connection
- Residuals sum to zero: ∑(yi−y^i)=0 is a property of least-squares regression
- Residual plots reveal fit quality: a patternless scatter indicates the linear model is appropriate; curves suggest nonlinearity
- r2 relates to residual variation: higher r2 means smaller residuals relative to total variation in y
Compare: Correlation r vs. slope b—both indicate direction, but r is unit-free (always between -1 and 1) while b has units of units of xunits of y. You can have a strong correlation with a small slope if sy is much smaller than sx.
What Affects Correlation: Outliers and Influential Points
One of the most tested concepts is how outliers and influential points can distort correlation. A single unusual observation can dramatically change r, making visual inspection of scatterplots essential.
Outliers in Bivariate Data
- Can inflate or deflate r depending on their position—an outlier far from the overall pattern typically weakens correlation
- Identified through scatterplots and residual plots: look for points with unusually large residuals or unusual x-values
- Always investigate before removing: outliers may represent data errors, but they might also be legitimate and important observations
High-Leverage Points
- Have substantially larger or smaller x-values than other observations—they "pull" the regression line toward themselves
- May or may not be influential: a high-leverage point that follows the existing pattern has little effect; one that deviates is highly influential
- Critical for interpretation: if removing a point substantially changes r, slope, or intercept, it's influential
Influential Points
- Change the relationship substantially when removed—this includes changes to slope, intercept, and/or correlation
- Often are both outliers and high-leverage: the combination of unusual x and unusual y creates maximum influence
- Exam strategy: if asked about the effect of removing a point, consider whether it's pulling the line toward or away from the overall pattern
Compare: High-leverage point vs. influential point—all influential points have leverage, but not all high-leverage points are influential. A point with extreme x that falls exactly on the regression line has high leverage but isn't influential because removing it wouldn't change the line.
Limitations and Common Misconceptions
Understanding what correlation cannot tell you is just as important as knowing what it measures. These limitations appear frequently in multiple-choice questions designed to test conceptual understanding.
Correlation Does Not Imply Causation
- Two variables can be strongly correlated without one causing the other—this is perhaps the most important statistical principle to internalize
- Lurking (confounding) variables may drive the relationship between both variables, creating a spurious correlation
- Establishing causation requires experiments: only randomized controlled experiments can demonstrate cause-and-effect relationships
Nonlinear Relationships
- r only measures linear association: a perfect curved relationship can have r≈0 because Pearson correlation doesn't detect nonlinear patterns
- Always examine scatterplots: before calculating r, verify that a linear model is appropriate for the data
- Transformations may help: if data show curvature, applying log, square root, or other transformations might linearize the relationship
Assumptions for Pearson Correlation
- Requires quantitative, paired data: both variables must be numerical and measured on the same individuals/cases
- Assumes linearity: the relationship should be approximately linear for r to be meaningful
- Sensitive to outliers: unlike rank-based alternatives, Pearson correlation can be heavily influenced by extreme values
Compare: Correlation with lurking variable vs. true association—if ice cream sales and drowning deaths are correlated, temperature is the lurking variable causing both. Recognizing this distinction is essential for FRQs asking you to critique causal claims from observational data.
Visualizing Correlation with Scatterplots
Scatterplots are your primary tool for assessing relationships before calculating correlation. Visual inspection reveals patterns, outliers, and potential problems that numbers alone cannot show.
Scatterplot Interpretation
- Explanatory variable on x-axis, response on y-axis: this convention matters for regression but not for correlation (r is symmetric)
- Assess form, direction, strength, and unusual features: describe the overall pattern before summarizing with statistics
- Look for clusters, gaps, and outliers: these features affect interpretation and may suggest subgroups in the data
Residual Plots
- Plot residuals vs. fitted values (or vs. x): this diagnostic tool reveals whether a linear model is appropriate
- Random scatter indicates good fit: if residuals show no pattern, the linear model captures the relationship well
- Curved patterns indicate nonlinearity: a U-shape or other systematic pattern suggests the linear model is inadequate
Using Technology
- Graphing calculators and software compute r automatically: on the TI-84, use LinReg(ax+b) after entering data in lists
- Always plot first: technology will calculate r for any paired data, even when correlation is meaningless (nonlinear data, categorical data miscoded as numbers)
- Report r in context: stating "r=0.85" alone is incomplete—interpret what this means for the specific variables
Quick Reference Table
|
| Measuring linear strength | Pearson r, coefficient of determination r2 |
| Interpreting strength | Strong ($$ |
| Connecting to regression | Slope formula b=r(sy/sx), line through (xˉ,yˉ) |
| Proportion of variation | r2 interpretation ("X% of variability in y explained by x") |
| Factors affecting r | Outliers, influential points, high-leverage points |
| Limitations of r | Only detects linear relationships, sensitive to outliers |
| Correlation ≠ causation | Lurking variables, confounding, spurious correlation |
| Visual assessment | Scatterplots, residual plots, identifying patterns |
Self-Check Questions
-
If r=−0.92 for the relationship between hours of TV watched and GPA, what does this tell you about the strength and direction of the relationship? What does r2 tell you that r alone doesn't?
-
Two datasets both have r2=0.64. One has r=0.8 and the other has r=−0.8. How do these relationships differ, and how are they similar?
-
A scatterplot shows a clear U-shaped pattern, but the calculated correlation is r=0.05. Explain why this happens and what it reveals about the limitations of correlation.
-
Compare and contrast how an outlier in the middle of the x-range versus an outlier at an extreme x-value would affect the correlation coefficient and regression line.
-
A study finds a strong positive correlation (r=0.78) between ice cream sales and sunburn rates. A newspaper headline claims "Eating ice cream causes sunburns." Using the concept of lurking variables, explain why this causal claim is flawed and what additional study design would be needed to establish causation.