Correlation is a measure of the strength and direction of the relationship between two variables, and this is numerically represented with the correlation coefficient, which in stats we denote as r.
The correlation coefficient shows the degree to which there is a linear correlation between the two variables, that is, how close the points are to forming a line. It can be positive or negative, and this is the same as the direction of the scatterplot. The coefficient takes a value between -1 and 1, where r = -1 means that the points fall exactly on a decreasing line while r = 1 means that the points fall exactly on an increasing line. A correlation coefficient of 0 means that there is no correlation between the data points.
It's important to note that the correlation coefficient only measures the linear relationship between two variables. It does not indicate the strength or nature of any nonlinear relationships that may exist. Additionally, the correlation coefficient does not indicate the cause-and-effect relationship between the two variables. As mentioned before, correlation =/= causation!
Examples
Here are some scatterplots and their values of r:

Also, there are a few things to keep in mind about correlation.
- Even if r has a high magnitude, the relationship may not be linear; instead, it may be curved. We'll discuss this more in later sections.
- A high magnitude of correlation does not imply causation.
- The correlation coefficient is not resistant to outliers, which makes sense, given that the formula that we shall learn uses the mean and standard deviation, which by themselves are not resistant.
Calculating the Correlation Coefficient
To find the value of r, we have this formula that is found on the formula sheet:
Although this may seem like a complicated formula, it’s not that bad to understand (but harder to compute). To find r, first find the mean and standard deviations of both the x and y variables. Then, for each data point, multiply the x and y z-scores for that point. Finally, add all the individual products up and divide by the number of data points minus 1.
You will seldom need to do this by hand, and most graphing calculators can easily find this. On the most common graphing calculator used in AP Stats (TI-84), you will enter your data into L1 and L2, go to Stats>Calc>LinReg, like below:
To be sure that you get the r-value, verify that "Stats Diagnostics" is on via MODE.
🎥 Watch: AP Stats - Scatterplots and Association
Practice Problems
(1) A study was conducted to examine the relationship between hours of exercise per week and body mass index (BMI). The following scatterplot shows the results of the study for a sample of 25 individuals.
Based on the scatterplot, which of the following statements is true?
(A) There is a strong positive correlation between hours of exercise per week and BMI.
(B) There is a strong negative correlation between hours of exercise per week and BMI.
(C) There is a moderate positive correlation between hours of exercise per week and BMI.
(D) There is a moderate negative correlation between hours of exercise per week and BMI.
(E) There is no correlation between hours of exercise per week and BMI.
(2) TRUE or FALSE
- A scatterplot is a graphical representation of the relationship between two variables.
- A correlation coefficient of 1 indicates a strong positive correlation between two variables.
- A correlation coefficient of -1 indicates a strong positive correlation between two variables.
- A correlation coefficient of 0 indicates no correlation between two variables.
- The correlation coefficient only measures linear relationships between two variables.
- The correlation coefficient indicates the strength and direction of the relationship between two variables.
- The correlation coefficient indicates the cause and effect relationship between two variables.
- Correlation implies causation, meaning that if two variables are correlated, one variable must cause the other.
- A scatterplot can show nonlinear relationships between two variables.
- A scatterplot can be used to predict the value of one variable based on the value of the other variable.
Answers
(1) In this case, the correct answer would be (D) There is a moderate negative correlation between hours of exercise per week and BMI. This can be seen from the scatterplot, which shows that as the hours of exercise per week increase, the BMI decreases. The relationship between the two variables is not perfectly linear, but there is a clear trend in the data. The correlation coefficient could be calculated to quantify the strength of this relationship.
(2) T, T, F, T, T, T, F, F, T, T
[
](https://app.fiveable.me/ap-stats/unit-2/scatterplots-association/watch/KqjaqR3xVBs7M1J6oHJN)[](https://maths.nayland.school.nz/Year_13_Maths/3.9_Bivariate_data/7_Corr_Coefficient_r.html)
Vocabulary
The following words are mentioned explicitly in the College Board Course and Exam Description for this topic.
| Term | Definition |
|---|---|
| causation | A relationship where changes in one variable directly cause changes in another variable. |
| correlation | A numerical measure (r) that describes the strength and direction of a linear relationship between two variables, ranging from -1 to 1. |
| linear model | A mathematical representation of the linear relationship between two variables. |
| linear relationship | A relationship between two variables that can be described by a straight line. |
| quantitative variable | A variable that is measured numerically and can take on a range of values, allowing for mathematical operations and statistical analysis. |
Frequently Asked Questions
How do I calculate the correlation coefficient r using the formula?
Use the Pearson formula: r = (1/(n−1)) Σ [(xi − x̄)/sx]·[(yi − ȳ)/sy]. Steps: 1) compute x̄ and ȳ (means) and sx and sy (sample SDs with denominator n−1). 2) For each pair (xi, yi) compute the standardized scores (xi−x̄)/sx and (yi−ȳ)/sy. 3) Multiply those two standardized scores for each i and sum them. 4) Divide the sum by (n−1). The result is unit-free and lies between −1 and 1; its sign gives direction and its magnitude gives strength of the linear association. Remember AP tip: the CED notes this formula is correct but you’ll usually use technology on the exam (bring a graphing calculator). Also, r near ±1 doesn’t guarantee a linear model fits—check scatterplots and outliers. For more review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and Unit 2 overview (https://library.fiveable.me/ap-statistics/unit-2). For lots of practice problems, try (https://library.fiveable.me/practice/ap-statistics).
What's the difference between positive and negative correlation?
Positive correlation: as one quantitative variable increases, the other tends to increase. On a scatterplot points slope upward; the Pearson correlation r is > 0 (up to +1). A value near +1 means a strong positive linear association. Negative correlation: as one variable increases, the other tends to decrease. On a scatterplot points slope downward; r is < 0 (down to −1). A value near −1 means a strong negative linear association. Key AP points: r is unit-free and always between −1 and 1; r = 0 means no linear association, r = ±1 is a perfect linear relationship (but a high |r| doesn’t prove a linear model is appropriate). Outliers or influential points can change r a lot, and correlation does not imply causation. You’ll usually calculate r with technology on the exam (bring an approved calculator). For a quick review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and more Unit 2 resources (https://library.fiveable.me/ap-statistics/unit-2). For practice, try the AP problems (https://library.fiveable.me/practice/ap-statistics).
When do I use correlation vs causation and how do I know which is which?
Use correlation (r) when you want to measure direction and strength of a linear association between two quantitative variables—you’ll usually check a scatterplot and compute r (−1 ≤ r ≤ 1). Remember: r ≈ 0 means no linear association; r = ±1 means a perfect linear association. But correlation ≠ causation (CED DAT-1.C.2). When can you claim causation? Only when data come from a well-designed randomized experiment (random assignment of treatments) or very strong, well-justified causal inference with controls for confounding. If your data are observational (surveys, retrospective data), you can describe association but not claim one variable causes the other because of possible confounders, reverse causation, or lurking variables. On the AP exam, show this thinking: display the scatterplot, report r, discuss linearity/outliers (CED DAT-1.B.3), and explicitly state whether the study design supports causal claims. For a quick review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and practice problems (https://library.fiveable.me/practice/ap-statistics).
I don't understand why correlation doesn't mean causation - can someone explain this with examples?
Correlation measures the direction and strength of a linear association (r) between two quantitative variables, but it doesn’t prove one causes the other (CED: DAT-1.B, DAT-1.C). Why? three main reasons: a lurking variable, reverse causation, or coincidence. Examples: ice-cream sales and drownings can have r ≈ strong positive, but hotter temperature (a lurking variable) causes both—not ice cream causing drownings. Larger shoe size correlates with better reading scores in kids, but age is the confounder (older kids have bigger feet and read better). You can also get strong correlation from coincidence in small samples or noncausal mechanisms. To claim causation you need a well-designed experiment (random assignment) or strong causal reasoning, not just r (CED: correlation ≠ causation). Review Topic 2.5 for more (study guide: https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and practice problems (https://library.fiveable.me/practice/ap-statistics).
What does it mean when r is close to 0 vs close to 1 or -1?
If r is close to 1 or -1: that tells you a strong linear association and the direction. r ≈ 1 means points lie close to an upward-sloping line (strong positive); r ≈ −1 means they lie close to a downward-sloping line (strong negative). r = 1 or −1 is a perfect linear relationship. If r is close to 0: there’s little or no linear association—the variables don’t follow a straight-line pattern. Keep these AP-specific cautions in mind (CED DAT-1.B / DAT-1.C): r is unit-free and always between −1 and 1; r = 0 means no linear association, but a nonzero r doesn’t prove causation. Also, a high |r| doesn’t automatically mean a linear model is appropriate—outliers or nonlinear patterns can mislead you. Check the scatterplot and residuals before fitting regression; use technology to compute r on the exam when allowed. For a quick review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and try practice problems (https://library.fiveable.me/practice/ap-statistics).
How do I find correlation on my calculator step by step?
Use a graphing calculator (TI-83/84 family steps below—AP allows graphing calculators on the exam). Quick checklist first: make a scatterplot, then compute r and interpret. 1) Enter data: Stat → 1:Edit → put x’s in L1 and y’s in L2. 2) (Turn on diagnostics so r shows) 2nd → 0 (CATALOG), press D to jump, select DiagnosticOn, ENTER twice. 3) Make a scatterplot (optional check): 2nd → Y= (STAT PLOT) → Plot1: ON, choose scatter, Xlist L1, Ylist L2 → Zoom → 9:ZoomStat to view. 4) Compute correlation: Stat → Calc → 4:LinReg(ax+b) → L1 , L2 → ENTER. The output gives a (slope), b (intercept) and r (Pearson correlation). If you see r^2 instead of r, use LinRegTTest or ensure DiagnosticOn is on. 5) Interpret r: direction (sign), strength (|r| near 1 strong, near 0 weak), and remember correlation only measures linear association (CED DAT-1.B/C). For extra review and AP practice problems, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and Unit 2 overview (https://library.fiveable.me/ap-statistics/unit-2).
What's the formula for correlation coefficient and do I need to memorize it for the AP exam?
Formula: r = (1/(n − 1)) Σ [ (xi − x̄)/sx ] [ (yi − ȳ)/sy ]. That is, the average product of the standardized x and y values (Pearson correlation). Remember: r is unit-free and always between −1 and 1; r near 0 means no linear association, and r close to ±1 means a strong linear association but doesn’t guarantee a linear model is appropriate (outliers/nonlinearity can fool you). Do you need to memorize it for the AP exam? Not really. The AP supplies formula sheets and tables during the exam, and you’re expected to use technology for computing r in practice (CED DAT-1.B.2). But you should understand what the formula means, how r relates to slope (b = r·(sy/sx)), and how to interpret strength/direction and limits of correlation. For a quick review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and try practice problems (https://library.fiveable.me/practice/ap-statistics).
Can correlation be greater than 1 or less than -1 and what happens if I get that answer?
No—the Pearson correlation r must always be between −1 and 1 (inclusive). The CED explicitly says r is unit-free and always between −1 and 1; r = ±1 means a perfect linear association and r = 0 means no linear association (DAT-1.C.1, DAT-1.B.1). If you get a value > 1 or < −1, you’ve got a calculation or data issue. Common fixes: recheck your formula or calculator command, make sure you used the correct lists (no extra values), confirm you used sample standard deviations (sx, sy) and n−1 in the denominator, and check that neither sx nor sy is zero (constant data makes r undefined). If you’re using tech, re-enter the data or rerun the stats routine. For a quick refresher on correlation and how to compute it with technology, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM). For more practice, try problems at https://library.fiveable.me/practice/ap-statistics.
I'm confused about when a high correlation doesn't mean the linear model is good - how is that possible?
A high correlation (r near 1 or −1) only measures the strength of a *linear* association—so it can be misleading if the relationship isn’t actually linear. Common ways a high |r| can still mean a bad linear model: - Nonlinear pattern: data follow a curve (e.g., quadratic); r can be large but the straight-line fit has big systematic residuals. Check the scatterplot and residual plot. - Outlier or influential point: one extreme point can inflate r and the slope. That single point can make r big while the line doesn’t fit the bulk of data. - Restricted range: if x values cover a tiny interval, r can be misleadingly high even though predictions are poor outside that range. - Heteroscedasticity or clusters: varying spread or separate groups can produce a strong r but poor predictive fit. On the AP exam remember DAT-1.B.1–.3: always plot the data and check residuals (r is unit-free and only for linear association). For a quick review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM). For practice spotting these issues, try problems at Fiveable’s practice page (https://library.fiveable.me/practice/ap-statistics).
What does "unit-free" mean when talking about correlation?
“Unit-free” means the correlation r doesn’t carry the original units of x or y (like kg, cm, dollars). Look at the formula: each term uses (xi − x̄)/sx and (yi − ȳ)/sy—those are standardized (z) scores, so the units cancel out. Because of that r always lies between −1 and 1 and won’t change if you add/subtract a constant or multiply a variable by a positive constant (e.g., convert pounds to kilograms). Practical consequences for AP Stats: you can compare strength/direction of linear association across different variable pairs even if they use different units. But remember: “unit-free” doesn’t make r perfect—it only measures linear association (CED DAT-1.B and DAT-1.C), is sensitive to outliers/influential points, and r near ±1 doesn’t automatically mean a linear model is appropriate. Review the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and practice problems (https://library.fiveable.me/practice/ap-statistics) for more examples.
How do I interpret correlation values like r = 0.85 vs r = -0.32 in context problems?
r tells you direction and strength of the linear relationship (unit-free, between −1 and 1). r = 0.85 is a strong, positive linear association: as x increases, y tends to increase and about r^2 = 0.85^2 ≈ 0.72 (72%) of the variation in y is explained by the linear relationship. r = −0.32 is a weak/low negative linear association: y tends to decrease a bit as x increases and r^2 ≈ 0.10 (only ~10% explained). Important AP stuff: always check a scatterplot and residuals (CED: r alone can mislead). Outliers or a curved relationship can make r small or give a misleading large |r|—a high |r| doesn’t guarantee a good linear model (DAT-1.B.3). Also remember correlation ≠ causation (DAT-1.C.2). Use technology to compute r on the exam and practice interpreting in context. For a quick review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM) and try practice questions (https://library.fiveable.me/practice/ap-statistics).
When I calculate correlation, do I always need to check if the relationship is actually linear first?
Short answer: yes—before you treat r as meaningful, check that the relationship looks linear. Why: r (the Pearson correlation) measures the strength and direction of a linear association only (CED DAT-1.B). Always make a scatterplot first and look for a straight-line pattern and for outliers or influential points. If the pattern is clearly curved, r can be near 0 even when there’s a strong nonlinear relationship, or it can be misleading if one outlier inflates |r| (CED DAT-1.B.3, keywords: nonlinearity, outlier, influential point). If the scatter looks linear and no problem points exist, r (−1 to 1) is appropriate to report and use with least-squares regression; otherwise use a different summary or transform the data (or fit a nonlinear model) and report why r isn’t appropriate. For AP prep, the exam expects you to determine/interpret correlation for linear relationships—so practice making and checking scatterplots (see the Topic 2.5 study guide: https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM and Unit 2 overview: https://library.fiveable.me/ap-statistics/unit-2). For extra practice try the problems at: https://library.fiveable.me/practice/ap-statistics.
What's the difference between correlation and association and are they the same thing?
Short answer: they’re related but not the same. Association means any relationship between two variables—it can be linear, curved, monotonic, or just a pattern on a scatterplot. Correlation (the Pearson r) is a specific number that quantifies the direction and strength of a linear association between two quantitative variables (CED DAT-1.B and DAT-1.C). Key things to remember for AP Stats: - r is unit-free and lies between −1 and 1. r near 1 or −1 = strong linear, r = 0 = no linear association. - A strong |r| doesn’t guarantee a linear model is appropriate (check scatterplot and residuals). - Correlation ≠ causation. - You’ll usually calculate r with technology (CED DAT-1.B.2). If you want a quick review, check the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM), the whole Unit 2 overview (https://library.fiveable.me/ap-statistics/unit-2), and practice problems (https://library.fiveable.me/practice/ap-statistics).
I keep getting confused about causation vs correlation on FRQs - what should I write to get full points?
Short answer you can write on FRQs: describe the association (direction and strength using r or words), then explicitly say correlation ≠ causation and give what would be needed to claim causation. Example phrasing to get full points: - “The two variables have a [direction] linear association (r ≈ __), which indicates a [weak/moderate/strong] relationship.” - “This is a correlation, not evidence that changes in X cause changes in Y. A perceived relationship could be due to a lurking/confounding variable, reverse causation, or coincidence.” - “To claim causation you’d need a randomized experiment (random assignment), temporal order and a plausible mechanism, and control of confounders.” Also mention any problems that weaken inference (outliers, nonlinearity, small n, sampling bias). Use AP terms (correlation r, linear association, confounding, randomized experiment). For a quick review, see the Topic 2.5 study guide (https://library.fiveable.me/ap-statistics/unit-2/correlation/study-guide/LlS81pC6QricXgIKNuFM). For extra practice, try problems at (https://library.fiveable.me/practice/ap-statistics).



