Sometimes a least-squares regression line is not the best model for a data set, either because a few unusual points pull it around or because the underlying pattern is not linear. In AP Statistics, you need to identify outliers, high-leverage points, and influential points, and you need to use transformations (like taking the natural log) to make curved data more linear so a regression line fits better.

more resources to help you study

practice multiple choice FRQ practice & scoring cheatsheets score calculator key terms

Why This Matters for the AP Statistics Exam

This topic builds on residuals and least-squares regression by asking you to judge whether a linear model is actually appropriate. On the exam you may need to spot influential points in a scatterplot, explain how removing a point changes the slope, intercept, or correlation, and decide whether a transformation improves the fit. You may also calculate a predicted response using a regression line built on transformed data, then convert that prediction back to the original units. Strong work here means reading scatterplots and residual plots carefully, choosing the right procedure, and writing interpretations that connect to context.

Key Takeaways

An influential point changes the regression model substantially when removed (different slope, intercept, or correlation). Outliers and high-leverage points are often influential.
An outlier in regression has a large residual, meaning its y-value is far from what the line predicts.
A high-leverage point has an x-value far from the other x-values and can pull the slope of the line.
A residual plot that looks more random and an r squared closer to 1 are evidence that a transformed model fits better.
Common transformations include taking the natural log of the response, taking the log of the explanatory variable, or squaring the explanatory variable.
After predicting on the transformed scale, back-transform (often using exp) to get the prediction in the original units.

Influential Points

An influential point is any point that, when removed, changes the regression relationship substantially. That change might show up as a much different slope, y-intercept, or correlation. Two common types of points are often influential: outliers and high-leverage points.

Courtesy of Starnes, Daren S. and Tabor, Josh. The Practice of Statistics-For the AP Exam, 5th Edition. Cengage Publishing.

Outliers

An outlier in regression is a point that does not follow the general trend of the rest of the data and has a large residual when the LSRL is calculated. In other words, its y-value sits far from the predicted value. Outliers tend to lower the correlation and can sometimes shift the y-intercept. Child 19 on the scatterplot above is an outlier.

High-Leverage Points

A high-leverage point has an x-value much larger or smaller than the other observations. Because it sits far out along the x-axis, it can pull the regression line toward itself and noticeably change the slope. It can also occasionally change the y-intercept. Child 18 on the scatterplot above is a high-leverage point.

Identifying influential points matters because they can have a large impact on the model's slope, intercept, correlation, and overall fit. If an influential point is an outlier, it may be worth checking whether it represents the underlying pattern in the data or whether it should be examined separately. If an influential point is a high-leverage point, it is worth asking whether a linear model is appropriate for the data or whether a different approach would fit better. Either way, do not just delete points without a reason; investigate why the point is unusual first.

Transforming Data and Nonlinear Patterns

Sometimes a linear model is simply not a good fit, and the scatterplot shows a clear curve. In that case you can transform the data to make the pattern more linear, then fit a least-squares regression line to the transformed values. A transformation applies a function (like the natural logarithm or a power) to the explanatory variable, the response variable, or both, so the relationship between the transformed variables is closer to linear.

Most calculators can run these transformations and the regression automatically, so the key skills are deciding what to transform and interpreting the result.

Exponential Patterns

Exponential models have the form ŷ = ab^x, where a and b are constants and x is the explanatory variable. To fit this kind of model with linear regression, take the natural logarithm of both sides:

ln(ŷ) = ln(a) + ln(b)x

Now the relationship between ln(ŷ) and x is linear, so you can fit an LSRL to the transformed data. The y-intercept of that line equals ln(a) and the slope equals ln(b). If the transformed line has y-intercept a* and slope b*, you recover the original constants with:

image courtesy of: codecogs.com

Power Patterns

Power models have the form ŷ = ax^b. Here you take the natural logarithm of both sides and simplify to get:

ln(ŷ) = ln(a) + b·ln(x)

This time the relationship between ln(ŷ) and ln(x) is linear, so you log both the x-values and the y-values before fitting the line. With the transformed line again having y-intercept a* and slope b*, you recover the original constants with:

image courtesy of: codecogs.com

How Do I Know Which Transformation Fits?

To decide whether a transformation worked, look at two things: the residual plot of the transformed data and the r squared value. A transformation is more appropriate when the residual plot becomes more randomly scattered (no leftover curve) and r squared moves closer to 1.

Here, r squared is interpreted as the proportion of variation in the response variable that is explained by the model relative to the explanatory variable, just like with linear regression. If neither the residual plot nor r squared improves, the data may follow a pattern you have not modeled yet, or influential points may be distorting the fit.

To summarize: if the data looks exponential, take the natural log (or another base) of the y-values. If the data looks like a power pattern, take the log of both the x-values and the y-values.

Source: Real Statistics

How to Use This on the AP Statistics Exam

MCQ

Read scatterplots and residual plots to identify whether a point is an outlier, a high-leverage point, or both. Outliers have large residuals (unusual y); high-leverage points have unusual x-values.
Predict how removing an influential point would change the slope, intercept, or correlation.
Compare two residual plots or two r squared values to decide which model (original or transformed) fits better.

Free Response

Show your work when calculating a predicted response from a transformed model, then back-transform to the original units. For exponential and power models this usually means using exp to undo the natural log.
Justify whether a transformation improved the model by pointing to a more random residual plot and an r squared closer to 1. State both, not just one.
Write interpretations in context. Use phrasing like "tend to" and "on average," and connect predictions to the real variables and units.

Common Trap

Do not call a point influential just because it is an outlier. Confirm that removing it actually changes the slope, intercept, or correlation.
Do not report a prediction on the log scale and call it the answer. Always back-transform to the original units.

Practice Problem

You are a statistician working for a company that manufactures and sells a certain type of light bulb. The company wants to understand how the price of the light bulbs affects the number of units sold. To do this, you collect data on the number of units sold and the price of the light bulbs for a sample of 50 different stores.

You begin by performing a linear regression on the data and find that the model has a poor fit, with a low R-squared value. You decide to try transforming the data by taking the natural logarithm of the number of units sold, and then performing a linear regression on the transformed data.

You find that the transformed data has a better fit, with a higher R-squared value. The equation of the transformed model is:

ln(units sold) = 0.5 * ln(price) + 2

You want to transform the model back to its original form so that you can make predictions in terms of the original variables. To do this, you can use the following formula:

units sold = e^(b * price^a), where a and b are constants.

Using the equation of the transformed model, find the values of a and b in the original model.

Hint: Remember that the natural logarithm of a number is the exponent to which the base e must be raised to get that number. For example, ln(2) = 0.69, because e^0.69 = 2.

Answer

Start from the transformed equation:

ln(units sold) = 0.5 * ln(price) + 2

Using the property ln(a^b) = b * ln(a), rewrite the first term:

ln(units sold) = ln(price^0.5) + 2

Now exponentiate both sides to undo the natural log:

units sold = e^(ln(price^0.5) + 2) = e^2 * price^0.5

Matching this to the original form units sold = b * price^a, you get a = 0.5 and b = e^2.

Therefore, the values of a and b in the original model are a = 0.5 and b = e^2.

Common Misconceptions

"Outlier and high-leverage point mean the same thing." They do not. An outlier has a large residual (unusual y-value). A high-leverage point has an unusual x-value. A point can be one, both, or neither.
"Every outlier is influential." Not always. A point is influential only if removing it substantially changes the slope, intercept, or correlation. Check before you decide.
"A high r or r squared means the linear model is the right one." A value close to 1 does not guarantee linearity. Always check the residual plot for leftover curve.
"Transforming the data changes the actual data points." Transformations apply a function (like the natural log) to create a new, transformed data set. You fit the line to those transformed values.
"The prediction from a transformed model is the final answer." If you predicted ln(y), you still need to back-transform with exp to get y in the original units.
"You should always delete an unusual point." Do not remove points just because they look odd. Investigate why a point is unusual before deciding how to handle it.

Vocabulary

The following words are mentioned explicitly in the College Board Course and Exam Description for this topic.

Term	Definition
coefficient of determination	The value r², which represents the proportion of variation in the response variable that is explained by the explanatory variable in the regression model.
correlation	A numerical measure (r) that describes the strength and direction of a linear relationship between two variables, ranging from -1 to 1.
high-leverage point	A point in regression that has a substantially larger or smaller x-value than other observations in the dataset.
influential points	Points in a regression that, when removed, substantially change the relationship between variables, such as the slope, y-intercept, or correlation.
least-squares regression line	A linear model that minimizes the sum of squared residuals to find the best-fitting line through a set of data points.
natural logarithm	A mathematical transformation using the logarithm with base e, often applied to response or explanatory variables to linearize relationships.
outlier	Data points that are unusually small or large relative to the rest of the data.
residual	The difference between the actual observed value and the predicted value in a regression model, calculated as residual = y - ŷ.
residual plot	A scatter plot that displays residuals on the vertical axis versus either the explanatory variable values or predicted response values on the horizontal axis, used to assess the fit of a regression model.
slope	The value b in the regression equation ŷ = a + bx, representing the rate of change in the predicted response for each unit increase in the explanatory variable.
transformed data set	A dataset created by applying mathematical transformations (such as logarithms or powers) to the original variables to achieve a more linear relationship.
y-intercept	The value a in the regression equation ŷ = a + bx, representing the predicted response value when the explanatory variable equals zero.

Frequently Asked Questions

What are departures from linearity in AP Statistics?

Departures from linearity happen when a straight-line regression model is not appropriate for the data. In AP Statistics 2.9, you check scatterplots, residual plots, influential points, and transformed data to decide whether a linear model fits.

What is an outlier in regression?

An outlier in regression is a point that does not follow the general trend of the rest of the data and has a large residual from the least-squares regression line. It is unusual in the y-direction for its x-value.

What is a high-leverage point?

A high-leverage point has an x-value that is much larger or smaller than the other x-values. Because it sits far from the rest of the data horizontally, it can have a strong effect on the slope of the regression line.

What is an influential point in regression?

An influential point is a point that substantially changes the regression relationship when removed. You may see a different slope, y-intercept, or correlation after taking it out.

How do transformations help with nonlinear data?

Transformations such as taking the natural log of the response variable or squaring the explanatory variable can make a curved pattern more linear. A more random residual plot and an r-squared closer to 1 can support using the transformed model.

What is a common AP Stats mistake with transformed regression?

A common mistake is predicting on the transformed scale and forgetting to convert back to the original units. If you used a log transformation, the final answer often needs a back-transformation before it matches the real context.