Multiple Linear Regression Model and Assumptions
Multiple linear regression extends simple linear regression by using several predictor variables to model a response. Where simple regression fits a line in two dimensions, multiple regression fits a hyperplane across many. Getting the model right depends on understanding both its structure and the assumptions that make ordinary least squares (OLS) estimation valid.
Multiple Linear Regression Model
Structure and Components
The general form of a multiple linear regression model is:
- is the response (dependent) variable
- are the predictor (independent) variables
- are the regression coefficients (population parameters)
- is the random error term
Each coefficient represents the expected change in for a one-unit increase in , holding all other predictors constant. That "holding constant" part is critical. It's what separates multiple regression from running several simple regressions. For example, if you're modeling sales revenue from advertising spend and product price, tells you the effect of an additional dollar of advertising at a fixed price level.
The intercept is the expected value of when every predictor equals zero. Depending on the context, this may or may not have a meaningful real-world interpretation. If no product can realistically have a price of $0 and zero advertising, is just an anchor for the regression plane rather than a quantity you'd interpret directly.
Error Term and Unexplained Variability
The error term captures everything the model doesn't explain: measurement error, omitted variables, and inherent randomness. Sales revenue might fluctuate due to consumer sentiment, competitor behavior, or seasonal effects that aren't in your model. All of that lands in .
OLS estimation finds the coefficient values that minimize the sum of squared residuals (SSE), which is the total squared difference between observed and predicted values. In other words, OLS picks the hyperplane that sits as close as possible to the data points in a least-squares sense.
Assumptions of Multiple Regression
The classical assumptions are often summarized with the acronym LINE (Linearity, Independence, Normality, Equal variance), plus the additional concern of multicollinearity. Each assumption supports a specific part of what makes OLS work.

Linearity and Independence
Linearity means the relationship between each predictor and the response is linear in the parameters, holding other predictors constant. If advertising spend has a diminishing-returns effect on revenue, a straight-line term won't capture that, and your coefficient estimates will be systematically off.
Independence means the observations don't influence each other. The residual for one data point should carry no information about the residual for another. This assumption is most commonly violated with time-series data (where consecutive observations are correlated) or clustered data (e.g., multiple products from the same company).
Homoscedasticity and Normality
Homoscedasticity (constant variance) means the spread of the residuals stays the same across all fitted values and all levels of the predictors. If residuals fan out as predicted values increase, you have heteroscedasticity, and your standard errors will be unreliable.
Normality means the residuals follow a normal distribution with mean zero: . This assumption matters most for inference (hypothesis tests, confidence intervals). With large samples, the Central Limit Theorem provides some protection, but with small samples, non-normal errors can seriously distort your p-values.
Multicollinearity and Outliers
No perfect multicollinearity is required for OLS to even produce estimates. In practice, the concern is high (but not perfect) multicollinearity, where predictors are strongly correlated with each other. When two predictors move together, the model struggles to separate their individual effects, leading to inflated standard errors and unstable coefficients.
No unduly influential observations means that no single data point should dominate the regression results. An extreme outlier in can pull the fitted plane toward itself, distorting coefficient estimates for the entire model.
Assessing Assumptions in Regression

Evaluating Linearity and Independence
To check linearity:
- Plot residuals against each predictor variable individually.
- Plot residuals against the fitted values.
- Look for any systematic curvature or patterns. Random scatter around zero is what you want. A U-shape or other curve signals that a linear term isn't adequate for that predictor.
Added-variable plots (partial regression plots) are especially useful in multiple regression because they show the relationship between a predictor and the response after accounting for the other predictors.
To check independence:
- Consider the study design. Were observations sampled independently, or is there a natural grouping or time ordering?
- Plot residuals against the order of data collection (if an ordering exists).
- Look for runs, cycles, or trends. A systematic pattern suggests autocorrelation.
The Durbin-Watson test provides a formal check for first-order autocorrelation, with values near 2 indicating no autocorrelation.
Checking Homoscedasticity and Normality
To check homoscedasticity:
- Plot residuals (or standardized residuals) against fitted values.
- The vertical spread of points should look roughly constant from left to right.
- A fan shape (wider spread at higher fitted values) or a funnel shape indicates heteroscedasticity.
To check normality:
- Create a Q-Q (quantile-quantile) plot of the residuals. Points should fall approximately along a straight diagonal line.
- Examine a histogram of the residuals. It should be roughly symmetric and bell-shaped.
- Formal tests like Shapiro-Wilk can supplement visual checks, but with large samples they can flag trivial departures, so the plots are usually more informative.
Detecting Multicollinearity and Influential Points
To detect multicollinearity:
-
Examine the correlation matrix of the predictors. Pairwise correlations above roughly 0.8 deserve attention.
-
Calculate the Variance Inflation Factor (VIF) for each predictor. VIF measures how much the variance of a coefficient is inflated due to correlation with other predictors.
- , where is the from regressing predictor on all other predictors.
- A common rule of thumb: VIF > 5 warrants concern; VIF > 10 indicates serious multicollinearity.
-
Note that pairwise correlations can miss multicollinearity involving three or more variables, so VIF is the more reliable diagnostic.
To detect outliers and influential points:
- Studentized residuals identify observations with unusually large residuals. Values beyond or flag potential outliers.
- Leverage values () measure how far an observation's predictor values are from the center of the predictor space. High leverage means the point could be influential.
- Cook's distance combines leverage and residual size into a single measure of overall influence. A Cook's distance greater than 1 (or greater than as a more sensitive threshold) suggests the observation substantially shifts the fitted model.
An observation with high leverage but a small residual may not be a problem. An observation with both high leverage and a large residual almost certainly is.
Consequences of Violated Assumptions
Impact on Coefficient Estimates and Predictions
Linearity violated: Coefficient estimates become biased because the model is systematically misspecified. Predictions will be poor, especially when extrapolating. If the true relationship between advertising and revenue is curved, a linear model will overpredict in some regions and underpredict in others.
Multicollinearity present: Coefficients remain unbiased but become highly variable. Small changes in the data can produce large swings in the estimated coefficients. For instance, the estimated effect of advertising on revenue might flip sign depending on whether price is included in the model. The model's overall predictions may still be reasonable, but interpreting individual coefficients becomes unreliable.
Effect on Inference and Model Fit
Independence violated: Standard errors are typically underestimated, which makes confidence intervals too narrow and p-values too small. You end up declaring effects "significant" that may not be. This is one of the more dangerous violations because the coefficient estimates themselves may look fine while the inferential machinery quietly breaks down.
Homoscedasticity violated: OLS estimates are still unbiased but no longer the most efficient (minimum variance). Standard errors are biased, so hypothesis tests and confidence intervals lose their validity. Weighted least squares or heteroscedasticity-robust standard errors (e.g., White's robust standard errors) are common remedies.
Normality violated: For large samples, this is often the least consequential violation because the sampling distribution of the coefficients approaches normality regardless (via the CLT). For small samples, non-normal errors distort t-tests and F-tests, potentially leading to incorrect conclusions about which predictors matter.
Influential points present: A single extreme observation can pull the regression plane toward itself, inflating or deflating coefficient estimates and artificially raising or lowering . Always investigate flagged points. Sometimes they represent data entry errors; other times they reveal a subpopulation your model doesn't account for.