upgrade
upgrade

📊AP Statistics

Correlation Coefficients

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Correlation is one of the most powerful tools you'll use in AP Statistics—it's the gateway to understanding how two quantitative variables relate to each other. When you're analyzing bivariate data, the correlation coefficient rr tells you both the direction (positive or negative) and strength (weak, moderate, or strong) of a linear relationship. This concept connects directly to scatterplot interpretation, least-squares regression, residual analysis, and the coefficient of determination r2r^2—all of which are heavily tested on the AP exam.

Here's what you're really being tested on: Can you interpret what rr means in context? Can you recognize when correlation is misleading (outliers, nonlinear patterns, lurking variables)? Can you connect rr to the regression slope formula b=rsysxb = r \cdot \frac{s_y}{s_x}? Don't just memorize that rr ranges from -1 to 1—know why certain values matter, what affects them, and what correlation can and cannot tell you about causation.


The Pearson Correlation Coefficient: Your Primary Tool

The Pearson correlation coefficient rr is the standard measure of linear association on the AP exam. It quantifies how closely data points cluster around a straight line by comparing standardized values (z-scores) of both variables.

Pearson Correlation Coefficient (rr)

  • Measures the strength and direction of linear relationships between two quantitative variables—this is the correlation you'll calculate and interpret most often
  • Formula uses z-scores: r=1n1(xixˉsx)(yiyˉsy)r = \frac{1}{n-1} \sum \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right), though you'll typically use technology to compute it
  • Unit-free and bounded: ranges from 1-1 to 11, making it comparable across different contexts regardless of the original units

Coefficient of Determination (r2r^2)

  • Represents the proportion of variation explained—if r=0.8r = 0.8, then r2=0.64r^2 = 0.64, meaning 64% of the variation in yy is explained by the linear relationship with xx
  • Always between 0 and 1: since it's a squared value, r2r^2 is never negative and provides an intuitive percentage interpretation
  • Key for FRQs: when asked to interpret a regression, stating that "r2r^2 percent of the variability in [response] is explained by [explanatory]" is a standard response

Compare: rr vs. r2r^2—both describe linear relationships, but rr gives direction and strength while r2r^2 gives the proportion of variation explained. On FRQs, you'll often need to interpret both: use rr for "strong positive/negative" language and r2r^2 for "percent of variability explained."


Interpreting Strength and Direction

Understanding what correlation values actually mean is essential for multiple-choice questions and FRQ interpretations. The magnitude (absolute value) indicates strength, while the sign indicates direction.

Strength Categories

  • Strong correlation: r0.8|r| \geq 0.8 indicates data points cluster tightly around the regression line with little scatter
  • Moderate correlation: r|r| between approximately 0.5 and 0.8 shows a clear trend but with noticeable spread around the line
  • Weak correlation: r<0.5|r| < 0.5 means the linear pattern is present but data points are widely scattered—other factors likely influence yy

Direction of Association

  • Positive correlation (r>0r > 0): as xx increases, yy tends to increase—the scatterplot slopes upward from left to right
  • Negative correlation (r<0r < 0): as xx increases, yy tends to decrease—the scatterplot slopes downward from left to right
  • No linear correlation (r0r \approx 0): no consistent linear pattern exists, though a nonlinear relationship might still be present

Perfect Correlation

  • r=1r = 1 or r=1r = -1: all data points fall exactly on the regression line with zero residuals—rare in real data
  • Indicates deterministic relationship: knowing xx perfectly predicts yy with no error
  • Exam context: perfect correlations typically appear in theoretical questions or as benchmarks for comparison

Compare: Strong positive (r=0.9r = 0.9) vs. strong negative (r=0.9r = -0.9)—both indicate equally strong linear relationships with the same r2=0.81r^2 = 0.81. The difference is only in direction. Don't assume negative correlations are "weaker" than positive ones.


Connecting Correlation to Regression

The correlation coefficient isn't just a standalone statistic—it's mathematically linked to the least-squares regression line. Understanding this connection helps you move between correlation and regression problems seamlessly.

Slope Formula Using rr

  • The slope formula b=rsysxb = r \cdot \frac{s_y}{s_x} directly connects correlation to the regression line—know this relationship cold
  • Sign of slope matches sign of rr: a positive correlation always produces a positive slope, and vice versa
  • Standardized interpretation: when both variables are converted to z-scores, the slope of the regression line equals rr

Regression Line Properties

  • Always passes through (xˉ,yˉ)(\bar{x}, \bar{y}): the point of averages lies exactly on the least-squares regression line
  • Minimizes sum of squared residuals: this is the "least squares" criterion that defines the LSRL
  • Y-intercept formula: a=yˉbxˉa = \bar{y} - b\bar{x}, derived from the fact that the line passes through the mean point

Residual Connection

  • Residuals sum to zero: (yiy^i)=0\sum(y_i - \hat{y}_i) = 0 is a property of least-squares regression
  • Residual plots reveal fit quality: a patternless scatter indicates the linear model is appropriate; curves suggest nonlinearity
  • r2r^2 relates to residual variation: higher r2r^2 means smaller residuals relative to total variation in yy

Compare: Correlation rr vs. slope bb—both indicate direction, but rr is unit-free (always between -1 and 1) while bb has units of units of yunits of x\frac{\text{units of } y}{\text{units of } x}. You can have a strong correlation with a small slope if sys_y is much smaller than sxs_x.


What Affects Correlation: Outliers and Influential Points

One of the most tested concepts is how outliers and influential points can distort correlation. A single unusual observation can dramatically change rr, making visual inspection of scatterplots essential.

Outliers in Bivariate Data

  • Can inflate or deflate rr depending on their position—an outlier far from the overall pattern typically weakens correlation
  • Identified through scatterplots and residual plots: look for points with unusually large residuals or unusual xx-values
  • Always investigate before removing: outliers may represent data errors, but they might also be legitimate and important observations

High-Leverage Points

  • Have substantially larger or smaller xx-values than other observations—they "pull" the regression line toward themselves
  • May or may not be influential: a high-leverage point that follows the existing pattern has little effect; one that deviates is highly influential
  • Critical for interpretation: if removing a point substantially changes rr, slope, or intercept, it's influential

Influential Points

  • Change the relationship substantially when removed—this includes changes to slope, intercept, and/or correlation
  • Often are both outliers and high-leverage: the combination of unusual xx and unusual yy creates maximum influence
  • Exam strategy: if asked about the effect of removing a point, consider whether it's pulling the line toward or away from the overall pattern

Compare: High-leverage point vs. influential point—all influential points have leverage, but not all high-leverage points are influential. A point with extreme xx that falls exactly on the regression line has high leverage but isn't influential because removing it wouldn't change the line.


Limitations and Common Misconceptions

Understanding what correlation cannot tell you is just as important as knowing what it measures. These limitations appear frequently in multiple-choice questions designed to test conceptual understanding.

Correlation Does Not Imply Causation

  • Two variables can be strongly correlated without one causing the other—this is perhaps the most important statistical principle to internalize
  • Lurking (confounding) variables may drive the relationship between both variables, creating a spurious correlation
  • Establishing causation requires experiments: only randomized controlled experiments can demonstrate cause-and-effect relationships

Nonlinear Relationships

  • rr only measures linear association: a perfect curved relationship can have r0r \approx 0 because Pearson correlation doesn't detect nonlinear patterns
  • Always examine scatterplots: before calculating rr, verify that a linear model is appropriate for the data
  • Transformations may help: if data show curvature, applying log, square root, or other transformations might linearize the relationship

Assumptions for Pearson Correlation

  • Requires quantitative, paired data: both variables must be numerical and measured on the same individuals/cases
  • Assumes linearity: the relationship should be approximately linear for rr to be meaningful
  • Sensitive to outliers: unlike rank-based alternatives, Pearson correlation can be heavily influenced by extreme values

Compare: Correlation with lurking variable vs. true association—if ice cream sales and drowning deaths are correlated, temperature is the lurking variable causing both. Recognizing this distinction is essential for FRQs asking you to critique causal claims from observational data.


Visualizing Correlation with Scatterplots

Scatterplots are your primary tool for assessing relationships before calculating correlation. Visual inspection reveals patterns, outliers, and potential problems that numbers alone cannot show.

Scatterplot Interpretation

  • Explanatory variable on x-axis, response on y-axis: this convention matters for regression but not for correlation (rr is symmetric)
  • Assess form, direction, strength, and unusual features: describe the overall pattern before summarizing with statistics
  • Look for clusters, gaps, and outliers: these features affect interpretation and may suggest subgroups in the data

Residual Plots

  • Plot residuals vs. fitted values (or vs. xx): this diagnostic tool reveals whether a linear model is appropriate
  • Random scatter indicates good fit: if residuals show no pattern, the linear model captures the relationship well
  • Curved patterns indicate nonlinearity: a U-shape or other systematic pattern suggests the linear model is inadequate

Using Technology

  • Graphing calculators and software compute rr automatically: on the TI-84, use LinReg(ax+b) after entering data in lists
  • Always plot first: technology will calculate rr for any paired data, even when correlation is meaningless (nonlinear data, categorical data miscoded as numbers)
  • Report rr in context: stating "r=0.85r = 0.85" alone is incomplete—interpret what this means for the specific variables

Quick Reference Table

ConceptBest Examples
Measuring linear strengthPearson rr, coefficient of determination r2r^2
Interpreting strengthStrong ($$
Connecting to regressionSlope formula b=r(sy/sx)b = r(s_y/s_x), line through (xˉ,yˉ)(\bar{x}, \bar{y})
Proportion of variationr2r^2 interpretation ("X% of variability in yy explained by xx")
Factors affecting rrOutliers, influential points, high-leverage points
Limitations of rrOnly detects linear relationships, sensitive to outliers
Correlation ≠ causationLurking variables, confounding, spurious correlation
Visual assessmentScatterplots, residual plots, identifying patterns

Self-Check Questions

  1. If r=0.92r = -0.92 for the relationship between hours of TV watched and GPA, what does this tell you about the strength and direction of the relationship? What does r2r^2 tell you that rr alone doesn't?

  2. Two datasets both have r2=0.64r^2 = 0.64. One has r=0.8r = 0.8 and the other has r=0.8r = -0.8. How do these relationships differ, and how are they similar?

  3. A scatterplot shows a clear U-shaped pattern, but the calculated correlation is r=0.05r = 0.05. Explain why this happens and what it reveals about the limitations of correlation.

  4. Compare and contrast how an outlier in the middle of the xx-range versus an outlier at an extreme xx-value would affect the correlation coefficient and regression line.

  5. A study finds a strong positive correlation (r=0.78r = 0.78) between ice cream sales and sunburn rates. A newspaper headline claims "Eating ice cream causes sunburns." Using the concept of lurking variables, explain why this causal claim is flawed and what additional study design would be needed to establish causation.