Correlation is a statistical measure that describes the strength and direction of the linear relationship between two quantitative variables. It tells you how closely changes in one variable track with changes in another.

Think of it this way: if you plotted two variables on a graph, correlation captures how well the data points follow a straight-line pattern. Height and weight, study time and exam scores, temperature and ice cream sales are all pairs of variables that tend to be correlated.

Correlation is useful for identifying patterns and making predictions, but it has a hard limit: it only captures linear relationships. Two variables can have a strong curved relationship and still show a correlation near zero.

Correlation Coefficients

The correlation coefficient, denoted r, quantifies the linear relationship between two variables on a scale from -1 to +1:

r = +1: A perfect positive linear relationship. Every increase in one variable corresponds to a perfectly proportional increase in the other.
r = -1: A perfect negative linear relationship. Every increase in one variable corresponds to a perfectly proportional decrease in the other.
r = 0: No linear relationship. Changes in one variable tell you nothing about changes in the other (at least not in a straight-line pattern).

The two most common types of correlation coefficients are:

Pearson's r (product-moment correlation): Used for continuous variables. It assumes both variables are roughly normally distributed and that the relationship is linear. This is the default "correlation coefficient" in most linear modeling contexts.
Spearman's rank correlation: Based on the ranks of the data rather than raw values. It's more robust to outliers and works for ordinal variables or relationships that are monotonic but not strictly linear.

Interpreting Correlation

Positive correlation: As one variable increases, the other tends to increase (e.g., height and weight).
Negative correlation: As one variable increases, the other tends to decrease (e.g., price and quantity demanded).

A critical point for this entire course: correlation does not imply causation. It measures association only, with no information about why two variables move together. A classic example: ice cream sales and drowning incidents are positively correlated, but ice cream doesn't cause drowning. Both are driven by a third factor (hot weather).

Correlation Strength and Direction

Determining Correlation Strength

Strength is determined by the absolute value of r. The closer |r| is to 1, the tighter the data points cluster around a straight line.

A common set of rough guidelines:

|r| > 0.7: strong linear relationship
0.3 < |r| < 0.7: moderate linear relationship
|r| < 0.3: weak linear relationship

These thresholds aren't rigid rules. In some fields (like psychology), r = 0.5 might be considered quite strong because human behavior is inherently variable. In physics or engineering, r = 0.9 might be considered mediocre. Always interpret strength relative to the context of your data.

Understanding correlation, Line Fitting, Residuals, and Correlation | Introduction to Statistics

Assessing Correlation Direction

The sign of r tells you the direction:

Positive r: upward trend. Study time and exam scores tend to move in the same direction.
Negative r: downward trend. Age and reaction time tend to move in opposite directions.
r = 0: no linear trend. Changes in one variable are not linearly associated with changes in the other.

Visualizing Correlation with Scatterplots

Scatterplots are the go-to tool for visually assessing correlation before you compute anything.

Data points forming a tight, upward-sloping band suggest a strong positive correlation.
Data points forming a tight, downward-sloping band suggest a strong negative correlation.
Data points scattered with no discernible pattern suggest weak or no correlation.

Scatterplots also reveal things that r alone cannot: outliers that may be inflating or deflating the correlation, curved relationships that a linear measure would miss, and clusters in the data that might warrant separate analysis.

Correlation vs. Causation

Understanding the Difference

Correlation measures association between two variables. Causation means that changes in one variable directly produce changes in another. These are fundamentally different claims, and confusing them is one of the most common errors in data analysis.

Two variables can be correlated for several reasons that have nothing to do with one causing the other:

A common cause (third variable) drives both. Hot weather increases both ice cream sales and crime rates.
Reverse causation: the direction of influence is opposite to what you assumed.
Coincidence: with enough variables, some will correlate by pure chance.

Establishing Causation

Moving from correlation to causation requires additional evidence beyond observing an association:

Controlled experiments: Manipulate one variable while holding others constant. A randomized controlled trial comparing a medication to a placebo is the gold standard.
Temporal precedence: The proposed cause must come before the effect in time.
Elimination of alternative explanations: Rule out confounding variables and reverse causation.

No single criterion is sufficient on its own. Strong causal claims typically require all three.

Understanding correlation, Linear Relationships (4 of 4) | Concepts in Statistics

Confounding Variables and Spurious Correlations

A confounding variable is related to both the predictor and the response, creating a misleading association between them. For example, a correlation between coffee consumption and heart disease might be confounded by smoking: smokers tend to drink more coffee and have higher heart disease risk. The coffee-heart disease link could partly (or entirely) reflect the influence of smoking.

Spurious correlations are associations that don't represent any real underlying relationship. They can arise from confounders, measurement error, or sheer chance. This is why you should never make causal claims based on correlation alone. Always ask: what else could explain this pattern?

Correlation and Linear Regression

Simple Linear Regression

Simple linear regression models the linear relationship between a predictor variable (x) and a response variable (y). The goal is to find the best-fitting straight line through the data.

The regression equation is:

$\hat{y} = \beta_0 + \beta_1 x$

where $\beta_0$ is the y-intercept (the predicted value of y when x = 0), $\beta_1$ is the slope (the predicted change in y for a one-unit increase in x), and $\hat{y}$ is the predicted value. The full model also includes a random error term: $y = \beta_0 + \beta_1 x + \epsilon$ .

Relationship Between Correlation and Regression

The correlation coefficient r and the regression slope $\beta_1$ are directly linked, but they measure different things. Here's the connection:

$\beta_1 = r \cdot \frac{s_y}{s_x}$

where $s_y$ and $s_x$ are the standard deviations of y and x, respectively. This means:

The sign of r determines the direction of the slope. Positive r gives an upward-sloping line; negative r gives a downward-sloping line.
A stronger correlation (|r| closer to 1) produces a slope that more closely matches the ratio $s_y / s_x$ , meaning predictions are more precise.
A weaker correlation (|r| closer to 0) pulls the slope toward zero, meaning x doesn't help much in predicting y.

Note that r is unitless (always between -1 and +1), while $\beta_1$ carries the units of y per unit of x. Correlation tells you how tightly the data follow a line; the slope tells you how steeply that line rises or falls.

Coefficient of Determination

The coefficient of determination, $r^2$ , is simply the square of the correlation coefficient. It represents the proportion of variance in the response variable that is explained by the predictor variable.

$r^2$ ranges from 0 to 1.
If $r^2 = 0.64$ (from r = 0.8), that means 64% of the variation in y can be accounted for by the linear relationship with x. The remaining 36% is unexplained variation.
Higher $r^2$ values indicate a better fit, but "good" depends on context. An $r^2$ of 0.30 might be excellent in social science research and terrible in a physics experiment.

Because $r^2$ is always positive, it tells you nothing about direction. You need r (or the slope) for that.

Assumptions and Limitations

Correlation is a necessary condition for simple linear regression to be useful, but it's not sufficient. The regression model also requires:

Linearity: The relationship between x and y is actually linear (check with a scatterplot or residual plot).
Homoscedasticity: The spread of errors is roughly constant across all values of x. If errors fan out or narrow, your standard errors and confidence intervals become unreliable.
Independence of errors: Each observation's error doesn't depend on any other observation's error. This is especially relevant with time-series data.
Normality of errors: For inference (hypothesis tests, confidence intervals), the error term $\epsilon$ should be approximately normally distributed.

Violations of these assumptions can produce biased or inefficient estimates of $\beta_0$ and $\beta_1$ , making your predictions and inferences unreliable. Always check these assumptions before trusting your regression output.

2,589 studying →