Regression analysis is the backbone of AP Statistics—it's how you'll demonstrate your understanding of relationships between variables, which appears heavily in both multiple-choice and free-response questions. You're being tested on your ability to interpret slopes and intercepts in context, analyze residual plots for model validity, and construct confidence intervals for population parameters. The concepts here connect directly to Unit 2's exploration of two-variable data and Unit 9's inference for slopes.
Don't just memorize formulas like b=r⋅sxsy—understand why the least squares method works, how residuals reveal model problems, and when inference conditions are met. The AP exam rewards students who can explain what a slope means in context, identify when a linear model is inappropriate, and distinguish between different types of unusual points. Master these concepts, and you'll be ready for any regression question they throw at you.
Building the Regression Model
The foundation of regression analysis starts with understanding how we create a line that best represents the relationship between two quantitative variables. The least squares criterion minimizes the sum of squared residuals, ensuring our predictions are as close as possible to observed values.
Least Squares Regression Line (LSRL)
Minimizes the sum of squared residuals—this criterion ensures the line y^=a+bx provides the best linear fit to the data
Always passes through the point (xˉ,yˉ)—the means of both variables anchor the regression line
Slope formula b=r⋅sxsy connects correlation to the rate of change; a one-unit increase in x predicts a b-unit change in y
Interpreting Slope and Intercept
Slope interpretation requires context—always state "for each additional [x-unit], the predicted [y] changes by [b units]"
Y-intercept a=yˉ−bxˉ represents the predicted y when x=0; often has no logical interpretation if x = 0 is outside the data range
Extrapolation beyond the data range is dangerous—predictions outside observed x-values assume the linear pattern continues indefinitely
Coefficient of Determination (r2)
Measures the proportion of variation in y explained by x—expressed as a percentage for interpretation
Calculated as the square of the correlation coefficient r—always between 0 and 1 (or 0% to 100%)
Interpretation template: "[r2]% of the variability in [response variable] can be explained by the linear relationship with [explanatory variable]"
Compare: Correlation r vs. r2—both measure strength of linear relationships, but r shows direction (positive/negative) while r2 shows explanatory power. If an FRQ asks "what percent of variation is explained," you need r2, not r.
Analyzing Residuals
Residuals are your diagnostic tool for determining whether a linear model is appropriate. A residual is simply the difference between what you observed and what the model predicted: ei=yi−y^i.
Sum of residuals equals zero for any LSRL—this is a mathematical property, not something to check
Large residuals indicate outliers in the regression context—points that don't follow the general trend
Residual Plots
Plot residuals against x (or fitted values y^)—this reveals patterns invisible in the original scatterplot
Random scatter with no pattern indicates appropriate linear fit—the model captures the relationship well
Curved patterns suggest nonlinearity—consider transformations or acknowledge the linear model is inadequate
Checking Model Conditions
Homoscedasticity (constant variance) appears as consistent vertical spread across the residual plot; funnel shapes indicate heteroscedasticity
Normality of residuals can be checked with a histogram or QQ-plot—important for inference procedures
Independence requires random sampling or randomized experiment; check the 10% condition (n≤0.10N) for sampling without replacement
Compare: Residual plot patterns—a curved pattern means the relationship isn't linear (wrong model form), while a funnel shape means variance isn't constant (violates conditions for inference). Both are problems, but they indicate different issues.
Identifying Unusual Points
Not all points influence the regression line equally. Understanding the difference between outliers, leverage points, and influential points is crucial for exam success. These distinctions appear frequently in multiple-choice questions.
Outliers in Regression
A point with a large residual that doesn't follow the general trend shown by other data
Falls far from the regression line vertically—the y-value is unusual given the x-value
May or may not affect the slope significantly depending on its x-location
High-Leverage Points
Has a substantially larger or smaller x-value than other observations—sits far from xˉ horizontally
Has potential to strongly influence the regression line because of its extreme x-position
Not necessarily an outlier—a high-leverage point may fall exactly on the trend line
Influential Points
Actually changes the slope or intercept substantially when removed from the analysis
Often both high-leverage AND an outlier—the combination of extreme x and unusual y creates influence
Test by removing the point and recalculating—if the line changes dramatically, the point is influential
Compare: High-leverage vs. influential—a high-leverage point could affect the line (it has the potential), while an influential point does affect the line (it actually changes the results). An FRQ might ask you to explain why removing a specific point changed the slope.
Inference for Regression Slopes
Unit 9 extends regression to inference, allowing you to make claims about the population regression line based on sample data. The sample slope b estimates the population slope β, and we can build confidence intervals to capture β.
Population vs. Sample Regression
Sample regression line y^=a+bx estimates the population regression line μy=α+βx
The slope b is a statistic with a sampling distribution centered at the true population slope β
Standard error of the slope SEb=sxn−1s measures variability in the sampling distribution of b
Confidence Intervals for Slope
Formula: b±t∗⋅SEb where t∗ comes from the t-distribution with df=n−2
Interpretation requires context: "We are [C]% confident that the true slope of the relationship between [x] and [y] is between [lower] and [upper]"
If the interval contains zero, we cannot conclude there's a linear relationship in the population
Conditions for Inference
Linearity—check the residual plot for random scatter (no curved pattern)
Independence—data from random sample or randomized experiment; verify 10% condition
Normality of residuals—check histogram or QQ-plot of residuals; less critical with large samples
Equal variance (homoscedasticity)—residual plot should show consistent spread across all x-values
Compare: Confidence interval for slope vs. interpreting r2—both describe the relationship, but the CI tells you about the rate of change in the population (with uncertainty), while r2 tells you about explanatory power in your sample. FRQs often ask for both.
Transformations and Departures from Linearity
When data show a curved pattern, transformations can linearize the relationship. Applying logarithmic, square root, or power transformations to x, y, or both can make a nonlinear relationship suitable for linear regression.
Common Transformations
Logarithmic transformation (lny or logx)—useful for exponential or power relationships; common for growth data
Square root transformation (y)—helps stabilize variance when spread increases with the mean
Reciprocal transformation (1/x)—can linearize certain curved relationships
Interpreting Transformed Models
Slope interpretation changes with transformation—for lny vs. x, the slope represents multiplicative change, not additive
Back-transformation required for predictions—if you modeled lny, use ey^ to get predictions in original units
Check residual plot of transformed data—transformation is successful if residuals now show random scatter
When to Transform
Curved pattern in original residual plot suggests trying a transformation
Funnel-shaped residuals may be corrected by transforming the response variable
Compare r2 values before and after transformation to assess improvement—higher r2 indicates better fit
Compare: Transforming y vs. transforming x—transforming y (like lny) changes the interpretation of predictions and requires back-transformation; transforming x (like x2) keeps y in original units but changes slope interpretation. Know which approach fits the data pattern.
Quick Reference Table
Concept
Best Examples
Slope interpretation
LSRL slope b, contextual meaning of rate of change
Residual analysis
Residual plots, checking for patterns, identifying outliers
Model fit measures
r2 (variation explained), residual standard deviation s
Log, square root, reciprocal to linearize curved data
Key formulas
b=r⋅sxsy, a=yˉ−bxˉ, df=n−2
Self-Check Questions
A residual plot shows a clear U-shaped curve. What does this indicate about the linear model, and what should you consider doing?
Compare and contrast high-leverage points and influential points. Can a point be one without being the other? Provide an example scenario.
If a 95% confidence interval for the slope is (0.12,0.58), what can you conclude about the relationship between the variables? What if the interval were (−0.15,0.42)?
Which two conditions for regression inference are checked using residual plots, and what specific patterns would indicate each condition is violated?
A student calculates r=0.85 and r2=0.72. Explain what each value tells you about the relationship, and write a complete interpretation of r2 in context for a study of hours studied (x) and exam score (y).