📊AP Statistics

Correlation Coefficients

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Correlation is one of the most powerful tools you'll use in AP Statistics—it's the gateway to understanding how two quantitative variables relate to each other. When you're analyzing bivariate data, the correlation coefficient $r$ tells you both the direction (positive or negative) and strength (weak, moderate, or strong) of a linear relationship. This concept connects directly to scatterplot interpretation, least-squares regression, residual analysis, and the coefficient of determination $r^2$ —all of which are heavily tested on the AP exam.

Here's what you're really being tested on: Can you interpret what $r$ means in context? Can you recognize when correlation is misleading (outliers, nonlinear patterns, lurking variables)? Can you connect $r$ to the regression slope formula $b = r \cdot \frac{s_y}{s_x}$ ? Don't just memorize that $r$ ranges from -1 to 1—know why certain values matter, what affects them, and what correlation can and cannot tell you about causation.

The Pearson Correlation Coefficient: Your Primary Tool

The Pearson correlation coefficient $r$ is the standard measure of linear association on the AP exam. It quantifies how closely data points cluster around a straight line by comparing standardized values (z-scores) of both variables.

Pearson Correlation Coefficient ( $r$ )

Measures the strength and direction of linear relationships between two quantitative variables—this is the correlation you'll calculate and interpret most often
Formula uses z-scores: $r = \frac{1}{n-1} \sum \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$ , though you'll typically use technology to compute it
Unit-free and bounded: ranges from $-1$ to $1$ , making it comparable across different contexts regardless of the original units

Coefficient of Determination ( $r^2$ )

Represents the proportion of variation explained—if $r = 0.8$ , then $r^2 = 0.64$ , meaning 64% of the variation in $y$ is explained by the linear relationship with $x$
Always between 0 and 1: since it's a squared value, $r^2$ is never negative and provides an intuitive percentage interpretation
Key for FRQs: when asked to interpret a regression, stating that " $r^2$ percent of the variability in [response] is explained by [explanatory]" is a standard response

Compare: $r$ vs. $r^2$ —both describe linear relationships, but $r$ gives direction and strength while $r^2$ gives the proportion of variation explained. On FRQs, you'll often need to interpret both: use $r$ for "strong positive/negative" language and $r^2$ for "percent of variability explained."

Interpreting Strength and Direction

Understanding what correlation values actually mean is essential for multiple-choice questions and FRQ interpretations. The magnitude (absolute value) indicates strength, while the sign indicates direction.

Strength Categories

Strong correlation: $|r| \geq 0.8$ indicates data points cluster tightly around the regression line with little scatter
Moderate correlation: $|r|$ between approximately 0.5 and 0.8 shows a clear trend but with noticeable spread around the line
Weak correlation: $|r| < 0.5$ means the linear pattern is present but data points are widely scattered—other factors likely influence $y$

Direction of Association

Positive correlation ( $r > 0$ ): as $x$ increases, $y$ tends to increase—the scatterplot slopes upward from left to right
Negative correlation ( $r < 0$ ): as $x$ increases, $y$ tends to decrease—the scatterplot slopes downward from left to right
No linear correlation ( $r \approx 0$ ): no consistent linear pattern exists, though a nonlinear relationship might still be present

Perfect Correlation

$r = 1$ or $r = -1$ : all data points fall exactly on the regression line with zero residuals—rare in real data
Indicates deterministic relationship: knowing $x$ perfectly predicts $y$ with no error
Exam context: perfect correlations typically appear in theoretical questions or as benchmarks for comparison

Compare: Strong positive ( $r = 0.9$ ) vs. strong negative ( $r = -0.9$ )—both indicate equally strong linear relationships with the same $r^2 = 0.81$ . The difference is only in direction. Don't assume negative correlations are "weaker" than positive ones.

Connecting Correlation to Regression

The correlation coefficient isn't just a standalone statistic—it's mathematically linked to the least-squares regression line. Understanding this connection helps you move between correlation and regression problems seamlessly.

Slope Formula Using $r$

The slope formula $b = r \cdot \frac{s_y}{s_x}$ directly connects correlation to the regression line—know this relationship cold
Sign of slope matches sign of $r$ : a positive correlation always produces a positive slope, and vice versa
Standardized interpretation: when both variables are converted to z-scores, the slope of the regression line equals $r$

Regression Line Properties

Always passes through $(\bar{x}, \bar{y})$ : the point of averages lies exactly on the least-squares regression line
Minimizes sum of squared residuals: this is the "least squares" criterion that defines the LSRL
Y-intercept formula: $a = \bar{y} - b\bar{x}$ , derived from the fact that the line passes through the mean point

Residual Connection

Residuals sum to zero: $\sum(y_i - \hat{y}_i) = 0$ is a property of least-squares regression
Residual plots reveal fit quality: a patternless scatter indicates the linear model is appropriate; curves suggest nonlinearity
$r^2$ relates to residual variation: higher $r^2$ means smaller residuals relative to total variation in $y$

Compare: Correlation $r$ vs. slope $b$ —both indicate direction, but $r$ is unit-free (always between -1 and 1) while $b$ has units of $\frac{\text{units of } y}{\text{units of } x}$ . You can have a strong correlation with a small slope if $s_y$ is much smaller than $s_x$ .

What Affects Correlation: Outliers and Influential Points

One of the most tested concepts is how outliers and influential points can distort correlation. A single unusual observation can dramatically change $r$ , making visual inspection of scatterplots essential.

Outliers in Bivariate Data

Can inflate or deflate $r$ depending on their position—an outlier far from the overall pattern typically weakens correlation
Identified through scatterplots and residual plots: look for points with unusually large residuals or unusual $x$ -values
Always investigate before removing: outliers may represent data errors, but they might also be legitimate and important observations

High-Leverage Points

Have substantially larger or smaller $x$ -values than other observations—they "pull" the regression line toward themselves
May or may not be influential: a high-leverage point that follows the existing pattern has little effect; one that deviates is highly influential
Critical for interpretation: if removing a point substantially changes $r$ , slope, or intercept, it's influential

Influential Points

Change the relationship substantially when removed—this includes changes to slope, intercept, and/or correlation
Often are both outliers and high-leverage: the combination of unusual $x$ and unusual $y$ creates maximum influence
Exam strategy: if asked about the effect of removing a point, consider whether it's pulling the line toward or away from the overall pattern

Compare: High-leverage point vs. influential point—all influential points have leverage, but not all high-leverage points are influential. A point with extreme $x$ that falls exactly on the regression line has high leverage but isn't influential because removing it wouldn't change the line.

Limitations and Common Misconceptions

Understanding what correlation cannot tell you is just as important as knowing what it measures. These limitations appear frequently in multiple-choice questions designed to test conceptual understanding.

Correlation Does Not Imply Causation

Two variables can be strongly correlated without one causing the other—this is perhaps the most important statistical principle to internalize
Lurking (confounding) variables may drive the relationship between both variables, creating a spurious correlation
Establishing causation requires experiments: only randomized controlled experiments can demonstrate cause-and-effect relationships

Nonlinear Relationships

$r$ only measures linear association: a perfect curved relationship can have $r \approx 0$ because Pearson correlation doesn't detect nonlinear patterns
Always examine scatterplots: before calculating $r$ , verify that a linear model is appropriate for the data
Transformations may help: if data show curvature, applying log, square root, or other transformations might linearize the relationship

Assumptions for Pearson Correlation

Requires quantitative, paired data: both variables must be numerical and measured on the same individuals/cases
Assumes linearity: the relationship should be approximately linear for $r$ to be meaningful
Sensitive to outliers: unlike rank-based alternatives, Pearson correlation can be heavily influenced by extreme values

Compare: Correlation with lurking variable vs. true association—if ice cream sales and drowning deaths are correlated, temperature is the lurking variable causing both. Recognizing this distinction is essential for FRQs asking you to critique causal claims from observational data.

Visualizing Correlation with Scatterplots

Scatterplots are your primary tool for assessing relationships before calculating correlation. Visual inspection reveals patterns, outliers, and potential problems that numbers alone cannot show.

Scatterplot Interpretation

Explanatory variable on x-axis, response on y-axis: this convention matters for regression but not for correlation ( $r$ is symmetric)
Assess form, direction, strength, and unusual features: describe the overall pattern before summarizing with statistics
Look for clusters, gaps, and outliers: these features affect interpretation and may suggest subgroups in the data

Residual Plots

Plot residuals vs. fitted values (or vs. $x$ ): this diagnostic tool reveals whether a linear model is appropriate
Random scatter indicates good fit: if residuals show no pattern, the linear model captures the relationship well
Curved patterns indicate nonlinearity: a U-shape or other systematic pattern suggests the linear model is inadequate

Using Technology

Graphing calculators and software compute $r$ automatically: on the TI-84, use LinReg(ax+b) after entering data in lists
Always plot first: technology will calculate $r$ for any paired data, even when correlation is meaningless (nonlinear data, categorical data miscoded as numbers)
Report $r$ in context: stating " $r = 0.85$ " alone is incomplete—interpret what this means for the specific variables

Quick Reference Table

Concept	Best Examples
Measuring linear strength	Pearson $r$ , coefficient of determination $r^2$
Interpreting strength	Strong ($$
Connecting to regression	Slope formula $b = r(s_y/s_x)$ , line through $(\bar{x}, \bar{y})$
Proportion of variation	$r^2$ interpretation ("X% of variability in $y$ explained by $x$ ")
Factors affecting $r$	Outliers, influential points, high-leverage points
Limitations of $r$	Only detects linear relationships, sensitive to outliers
Correlation ≠ causation	Lurking variables, confounding, spurious correlation
Visual assessment	Scatterplots, residual plots, identifying patterns

Self-Check Questions

If $r = -0.92$ for the relationship between hours of TV watched and GPA, what does this tell you about the strength and direction of the relationship? What does $r^2$ tell you that $r$ alone doesn't?
Two datasets both have $r^2 = 0.64$ . One has $r = 0.8$ and the other has $r = -0.8$ . How do these relationships differ, and how are they similar?
A scatterplot shows a clear U-shaped pattern, but the calculated correlation is $r = 0.05$ . Explain why this happens and what it reveals about the limitations of correlation.
Compare and contrast how an outlier in the middle of the $x$ -range versus an outlier at an extreme $x$ -value would affect the correlation coefficient and regression line.
A study finds a strong positive correlation ( $r = 0.78$ ) between ice cream sales and sunburn rates. A newspaper headline claims "Eating ice cream causes sunburns." Using the concept of lurking variables, explain why this causal claim is flawed and what additional study design would be needed to establish causation.

📊AP Statistics

Correlation Coefficients

Why This Matters

The Pearson Correlation Coefficient: Your Primary Tool

Pearson Correlation Coefficient (rrr)

Coefficient of Determination (r2r^2r2)

Interpreting Strength and Direction

Strength Categories

Direction of Association

Perfect Correlation

Connecting Correlation to Regression

Slope Formula Using rrr

Regression Line Properties

Residual Connection

What Affects Correlation: Outliers and Influential Points

Outliers in Bivariate Data

High-Leverage Points

Influential Points

Limitations and Common Misconceptions

Correlation Does Not Imply Causation

Nonlinear Relationships

Assumptions for Pearson Correlation

Visualizing Correlation with Scatterplots

Scatterplot Interpretation

Residual Plots

Using Technology

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes

Pearson Correlation Coefficient ( $r$ )

Coefficient of Determination ( $r^2$ )

Slope Formula Using $r$