Data Transformations for Linear Regression
Purpose and Application of Data Transformations
Data transformations apply mathematical operations to your variables so the data better satisfies the assumptions of linear regression. When residual plots reveal non-normality, heteroscedasticity, or a nonlinear mean function, a well-chosen transformation can often fix the problem without abandoning the linear modeling framework.
Common transformations include logarithmic, square root, reciprocal, and power transformations. You can apply them to the response variable , the predictor , or both, depending on which assumption is violated and how the data behaves.
A transformation can do one or more of the following:
- Stabilize the variance of residuals (address heteroscedasticity)
- Linearize a curved relationship between and
- Pull in long tails to make residuals more normally distributed
The choice of transformation depends on the nature of the data, the pattern you see in diagnostic plots, and how you want to interpret the resulting model coefficients. Keep in mind that once you transform a variable, your regression coefficients describe the relationship on the transformed scale, so interpretation requires extra care.
Types and Selection of Data Transformations
- Logarithmic ( or ): Best for positively skewed data or when variance increases with the mean. Very common for financial data, biological measurements, and other strictly positive quantities.
- Square root (): Often used for count data, where the variance tends to be proportional to the mean.
- Reciprocal (): Useful for inverse relationships between and , or when variance decreases as increases.
- Box-Cox power transformation (): A family of transformations indexed by . The procedure estimates the value of that best achieves normality and constant variance simultaneously. Special cases include (no transformation), (square root), (log, by convention), and (reciprocal).
To choose among these, use exploratory tools: residual-vs-fitted plots reveal variance patterns, Q-Q plots show departures from normality, and scatterplots of against expose nonlinearity. For a more systematic approach, the Box-Cox procedure profiles the likelihood over and provides a confidence interval for the best power.
Addressing Non-normality and Heteroscedasticity
Identifying and Addressing Non-normality
Non-normality means the residuals deviate from a normal distribution. This matters because confidence intervals, prediction intervals, and hypothesis tests all rely on the normality assumption. With non-normal residuals, your p-values and intervals can be misleading.
How to detect it:
- Q-Q plot: Plot residuals against theoretical normal quantiles. Systematic curvature or heavy tails indicate non-normality.
- Histogram of residuals: Look for strong skewness or multiple modes.
- Formal tests: The Shapiro-Wilk test or Kolmogorov-Smirnov test provides a p-value, though with large samples these tests can flag trivially small departures.
How to fix it:
Apply a transformation to . If the residual distribution is right-skewed, a log or square root transformation often helps. The Box-Cox procedure can guide you to the best power. After transforming, re-examine the Q-Q plot to confirm improvement.
Identifying and Addressing Heteroscedasticity
Heteroscedasticity means the spread of residuals changes across the range of fitted values. For example, residuals might fan out as increases. This violates the constant-variance assumption, making OLS standard errors unreliable.
How to detect it:
- Residuals vs. fitted values plot: Look for a funnel shape or any systematic change in spread.
- Formal tests: The Breusch-Pagan test regresses squared residuals on the predictors; a significant result indicates heteroscedasticity. The White test is a more general alternative.
How to fix it:
- If variance increases with the mean, try .
- If variance is proportional to the mean (common with counts), try .
- You can also transform , or both and .
- If transformations distort the model's interpretability or don't fully resolve the issue, weighted least squares (below) is the natural alternative.

Weighted Least Squares Regression
Concept and Rationale
Weighted least squares (WLS) is a modification of OLS designed specifically for heteroscedastic data. Instead of treating every observation equally, WLS assigns a weight to each observation that reflects how precise it is.
The core idea: observations with smaller variance carry more information, so they should count more in the fit. Observations with larger variance are noisier, so they should count less.
Formally, OLS minimizes , while WLS minimizes:
where is the weight for observation and is the residual. The standard choice is to set weights inversely proportional to the variance of each observation:
This gives high-precision points large weights and low-precision points small weights. When the weights are correctly specified, WLS produces estimates that are unbiased and more efficient (lower variance) than OLS under heteroscedasticity. It can also reduce the influence of outliers that happen to fall in high-variance regions.
Implementation of Weighted Least Squares Regression
Fitting a WLS model involves several steps:
-
Fit an initial OLS regression and examine residual plots or run a Breusch-Pagan/White test to confirm heteroscedasticity.
-
Identify the variance structure. Determine how the residual variance relates to the predictors or fitted values. For example, you might find that or .
-
Define the weights. Set as the inverse of the estimated variance function:
- If variance is proportional to , use .
- If variance is proportional to , use (using fitted values from the initial OLS).
-
Fit the weighted regression. Most statistical software (R, Python, SAS) has built-in WLS options where you supply the weight vector directly. Conceptually, the procedure multiplies each observation's , , and residual by and then runs OLS on the transformed data.
-
Check diagnostics on the weighted residuals. Plot weighted residuals vs. fitted values to verify that the variance is now approximately constant. Also check normality with a Q-Q plot.
-
Interpret coefficients carefully. The regression coefficients from WLS are on the original scale of and (unlike transformation approaches), which can make interpretation more straightforward. However, if you combined WLS with a transformation, remember that coefficients reflect the transformed scale.
Weighted Least Squares vs. Ordinary Least Squares
Assumptions and Limitations of Ordinary Least Squares
OLS relies on four key assumptions: linearity, independence of observations, normality of residuals, and homoscedasticity (constant variance). When homoscedasticity is violated:
- The coefficient estimates themselves remain unbiased, but they are no longer the most efficient (minimum-variance) estimates available.
- The standard errors computed by OLS are wrong, which means confidence intervals have incorrect coverage and hypothesis tests have incorrect Type I error rates.
- OLS gives equal weight to every observation, so noisy data points pull the fitted line just as much as precise ones.
- Outliers in high-variance regions can have a disproportionate effect on the fit.
Advantages of Weighted Least Squares over Ordinary Least Squares
WLS directly addresses heteroscedasticity by incorporating variance information into the estimation. Here's what you gain:
- More efficient estimates. By down-weighting noisy observations and up-weighting precise ones, WLS achieves lower variance for the estimated coefficients than OLS.
- Valid inference. Standard errors, confidence intervals, and hypothesis tests from WLS are trustworthy when the weight function is correctly specified, whereas OLS inference is distorted under heteroscedasticity.
- Narrower confidence intervals and more powerful tests compared to OLS, because the estimator makes better use of the available information.
- Reduced outlier influence. Observations in high-variance regions naturally receive smaller weights, limiting their pull on the regression line.
- Flexibility. WLS lets you incorporate external knowledge about measurement precision. For instance, if some observations are averages of many measurements and others are single readings, you can weight accordingly.
The main caveat is that WLS requires you to correctly specify the weight function. If the assumed variance structure is wrong, WLS can actually perform worse than OLS. In practice, you often estimate the variance function from the data, which introduces some uncertainty. Always check the weighted residual plots after fitting to confirm the weights are doing their job.