๐Ÿ“ŠAP Statistics

Correlation Coefficients

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Correlation is one of the most useful tools in AP Statistics for understanding how two quantitative variables relate to each other. The correlation coefficient rr tells you both the direction (positive or negative) and strength (weak, moderate, or strong) of a linear relationship. This concept connects directly to scatterplot interpretation, least-squares regression, residual analysis, and the coefficient of determination r2r^2, all of which are heavily tested on the AP exam.

What you're really being tested on: Can you interpret what rr means in context? Can you recognize when correlation is misleading (outliers, nonlinear patterns, lurking variables)? Can you connect rr to the regression slope formula b=rโ‹…sysxb = r \cdot \frac{s_y}{s_x}? Don't just memorize that rr ranges from -1 to 1. Know why certain values matter, what affects them, and what correlation can and cannot tell you about causation.


The Pearson Correlation Coefficient: Your Primary Tool

The Pearson correlation coefficient rr is the standard measure of linear association on the AP exam. It quantifies how closely data points cluster around a straight line by comparing the standardized values (z-scores) of both variables.

Pearson Correlation Coefficient (rr)

  • Measures the strength and direction of linear relationships between two quantitative variables
  • Formula uses z-scores: r=1nโˆ’1โˆ‘(xiโˆ’xห‰sx)(yiโˆ’yห‰sy)r = \frac{1}{n-1} \sum \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right), though you'll typically use technology to compute it
  • Unit-free and bounded: ranges from โˆ’1-1 to 11, making it comparable across different contexts regardless of the original units

Because rr is calculated from z-scores, changing the units of either variable (say, converting inches to centimeters) has no effect on its value. This is a common multiple-choice trap.

Coefficient of Determination (r2r^2)

  • Represents the proportion of variation explained. If r=0.8r = 0.8, then r2=0.64r^2 = 0.64, meaning 64% of the variation in yy is explained by the linear relationship with xx.
  • Always between 0 and 1: since it's a squared value, r2r^2 is never negative and gives you an intuitive percentage interpretation.
  • Key for FRQs: when asked to interpret a regression, a standard response is "r2r^2 percent of the variability in [response variable] is explained by the linear relationship with [explanatory variable]."

Compare: rr vs. r2r^2โ€”both describe linear relationships, but rr gives direction and strength while r2r^2 gives the proportion of variation explained. On FRQs, you'll often need to interpret both: use rr for "strong positive/negative" language and r2r^2 for "percent of variability explained."


Interpreting Strength and Direction

Understanding what correlation values actually mean is essential for multiple-choice questions and FRQ interpretations. The magnitude (absolute value) indicates strength, while the sign indicates direction.

Strength Categories

  • Strong correlation: โˆฃrโˆฃโ‰ฅ0.8|r| \geq 0.8 indicates data points cluster tightly around the regression line with little scatter
  • Moderate correlation: โˆฃrโˆฃ|r| between approximately 0.5 and 0.8 shows a clear trend but with noticeable spread
  • Weak correlation: โˆฃrโˆฃ<0.5|r| < 0.5 means the linear pattern is hard to see and data points are widely scattered

These cutoffs are rough guidelines, not rigid rules. Context matters. In some fields (like psychology), r=0.5r = 0.5 is considered quite strong, while in physics you might expect rr values very close to 1.

Direction of Association

  • Positive correlation (r>0r > 0): as xx increases, yy tends to increase. The scatterplot slopes upward left to right.
  • Negative correlation (r<0r < 0): as xx increases, yy tends to decrease. The scatterplot slopes downward left to right.
  • No linear correlation (rโ‰ˆ0r \approx 0): no consistent linear pattern exists, though a nonlinear relationship might still be present.

Perfect Correlation

  • r=1r = 1 or r=โˆ’1r = -1: all data points fall exactly on the regression line with zero residuals. This is rare in real data.
  • Indicates a deterministic relationship: knowing xx perfectly predicts yy with no error.
  • Exam context: perfect correlations typically appear in theoretical questions or as benchmarks for comparison.

Compare: Strong positive (r=0.9r = 0.9) vs. strong negative (r=โˆ’0.9r = -0.9)โ€”both indicate equally strong linear relationships with the same r2=0.81r^2 = 0.81. The difference is only in direction. Don't assume negative correlations are "weaker" than positive ones.


Connecting Correlation to Regression

The correlation coefficient isn't just a standalone statistic. It's mathematically linked to the least-squares regression line, and understanding this connection helps you move between correlation and regression problems.

Slope Formula Using rr

  • The slope formula b=rโ‹…sysxb = r \cdot \frac{s_y}{s_x} directly connects correlation to the regression line. Know this relationship cold.
  • The sign of the slope always matches the sign of rr: a positive correlation produces a positive slope, and vice versa. (This makes sense because sys_y and sxs_x are always positive.)
  • Standardized interpretation: when both variables are converted to z-scores, the slope of the regression line equals rr.

Regression Line Properties

  • Always passes through (xห‰,yห‰)(\bar{x}, \bar{y}): the point of averages lies exactly on the least-squares regression line.
  • Minimizes the sum of squared residuals: this is the "least squares" criterion that defines the LSRL.
  • Y-intercept formula: a=yห‰โˆ’bxห‰a = \bar{y} - b\bar{x}, derived from the fact that the line passes through the mean point.

Residual Connection

  • Residuals sum to zero: โˆ‘(yiโˆ’y^i)=0\sum(y_i - \hat{y}_i) = 0 for any least-squares regression line.
  • Residual plots reveal fit quality: a patternless scatter of residuals indicates the linear model is appropriate; curves suggest nonlinearity.
  • r2r^2 relates to residual variation: higher r2r^2 means smaller residuals relative to total variation in yy.

Compare: Correlation rr vs. slope bbโ€”both indicate direction, but rr is unit-free (always between -1 and 1) while bb has units of unitsย ofย yunitsย ofย x\frac{\text{units of } y}{\text{units of } x}. You can have a strong correlation with a small slope if sys_y is much smaller than sxs_x.


What Affects Correlation: Outliers and Influential Points

One of the most tested concepts is how outliers and influential points can distort correlation. A single unusual observation can dramatically change rr, which is why visual inspection of scatterplots is so important.

Outliers in Bivariate Data

  • Can inflate or deflate rr depending on their position. An outlier far from the overall pattern typically weakens correlation, while an outlier that happens to fall along the trend can strengthen it.
  • Identified through scatterplots and residual plots: look for points with unusually large residuals or unusual xx-values.
  • Always investigate before removing: outliers may represent data errors, but they might also be legitimate and important observations.

High-Leverage Points

  • Have substantially larger or smaller xx-values than other observations. They "pull" the regression line toward themselves.
  • May or may not be influential: a high-leverage point that follows the existing pattern has little effect on the line; one that deviates from the pattern is highly influential.
  • Critical for interpretation: if removing a point substantially changes rr, slope, or intercept, that point is influential.

Influential Points

  • Change the regression results substantially when removed, including changes to slope, intercept, and/or correlation.
  • Often are both outliers and high-leverage: the combination of unusual xx and unusual yy creates maximum influence.
  • Exam strategy: if asked about the effect of removing a point, consider whether it's pulling the line toward or away from the overall pattern of the remaining data.

Compare: High-leverage point vs. influential pointโ€”all influential points have leverage, but not all high-leverage points are influential. A point with an extreme xx-value that falls exactly on the regression line has high leverage but isn't influential because removing it wouldn't change the line much.


Limitations and Common Misconceptions

Understanding what correlation cannot tell you is just as important as knowing what it measures. These limitations appear frequently in multiple-choice questions designed to test conceptual understanding.

Correlation Does Not Imply Causation

This is perhaps the most important statistical principle to internalize. Two variables can be strongly correlated without one causing the other.

  • Lurking (confounding) variables may drive the relationship between both variables, creating a spurious correlation. For example, ice cream sales and drowning deaths are correlated because hot weather (the lurking variable) increases both.
  • Establishing causation requires experiments: only randomized controlled experiments can demonstrate cause-and-effect relationships. Observational studies, no matter how strong the correlation, cannot establish causation on their own.

Nonlinear Relationships

  • rr only measures linear association: a perfect curved relationship (like a parabola) can have rโ‰ˆ0r \approx 0 because Pearson correlation doesn't detect nonlinear patterns.
  • Always examine scatterplots: before interpreting rr, verify that a linear model is appropriate for the data.
  • Transformations may help: if data show curvature, applying log, square root, or power transformations might linearize the relationship so that rr becomes meaningful.

Other Assumptions and Limitations

  • Requires quantitative, paired data: both variables must be numerical and measured on the same individuals or cases.
  • Sensitive to outliers: a single extreme point can shift rr dramatically, as discussed above.
  • Switching xx and yy doesn't change rr: correlation is symmetric, but regression is not. The regression of yy on xx gives a different line than the regression of xx on yy.

Compare: Correlation with a lurking variable vs. true associationโ€”if ice cream sales and sunburn rates are correlated, temperature is the lurking variable driving both. Recognizing this distinction is essential for FRQs asking you to critique causal claims from observational data.


Visualizing Correlation with Scatterplots

Scatterplots are your primary tool for assessing relationships before calculating correlation. Visual inspection reveals patterns, outliers, and potential problems that numbers alone cannot show.

Scatterplot Interpretation

When describing a scatterplot, cover these four features in order:

  1. Direction: positive, negative, or neither
  2. Form: linear, curved, or no clear pattern
  3. Strength: how tightly the points follow the form
  4. Unusual features: outliers, clusters, or gaps

The explanatory variable goes on the x-axis and the response on the y-axis. This convention matters for regression, but not for correlation itself (since rr is symmetric).

Residual Plots

  • Plot residuals vs. fitted values (or vs. xx): this diagnostic tool reveals whether a linear model is appropriate.
  • Random scatter indicates good fit: if residuals show no pattern, the linear model captures the relationship well.
  • Curved patterns indicate nonlinearity: a U-shape or other systematic pattern in the residual plot means the linear model is inadequate, even if rr looks decent.

Using Technology

  • Graphing calculators compute rr automatically: on the TI-84, use LinReg(ax+b) after entering data in lists. Make sure "DiagnosticOn" is enabled, or rr and r2r^2 won't display.
  • Always plot first: technology will calculate rr for any paired data, even when correlation is meaningless (nonlinear data, categorical data miscoded as numbers).
  • Report rr in context: stating "r=0.85r = 0.85" alone is incomplete. Interpret what this means for the specific variables in the problem.

Quick Reference Table

ConceptKey Details
Measuring linear strengthPearson rr, coefficient of determination r2r^2
Interpreting strengthStrong (โˆฅrโˆฅโ‰ฅ0.8\|r\| \geq 0.8), Moderate (0.5โˆ’0.80.5-0.8), Weak (<0.5< 0.5)
Connecting to regressionSlope formula b=r(sy/sx)b = r(s_y/s_x), line through (xห‰,yห‰)(\bar{x}, \bar{y})
Proportion of variationr2r^2 interpretation: "X% of variability in yy explained by xx"
Factors affecting rrOutliers, influential points, high-leverage points
Limitations of rrOnly detects linear relationships, sensitive to outliers
Correlation โ‰  causationLurking variables, confounding, spurious correlation
Visual assessmentScatterplots (direction, form, strength, unusual features), residual plots

Self-Check Questions

  1. If r=โˆ’0.92r = -0.92 for the relationship between hours of TV watched and GPA, what does this tell you about the strength and direction of the relationship? What does r2r^2 tell you that rr alone doesn't?

  2. Two datasets both have r2=0.64r^2 = 0.64. One has r=0.8r = 0.8 and the other has r=โˆ’0.8r = -0.8. How do these relationships differ, and how are they similar?

  3. A scatterplot shows a clear U-shaped pattern, but the calculated correlation is r=0.05r = 0.05. Explain why this happens and what it reveals about the limitations of correlation.

  4. Compare how an outlier in the middle of the xx-range versus an outlier at an extreme xx-value would affect the correlation coefficient and regression line.

  5. A study finds a strong positive correlation (r=0.78r = 0.78) between ice cream sales and sunburn rates. A newspaper headline claims "Eating ice cream causes sunburns." Using the concept of lurking variables, explain why this causal claim is flawed and what study design would be needed to establish causation.