Fiveable

🥖Linear Modeling Theory Unit 6 Review

QR code for Linear Modeling Theory practice questions

6.1 Multiple Linear Regression Model and Assumptions

6.1 Multiple Linear Regression Model and Assumptions

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Multiple Linear Regression Model and Assumptions

Multiple linear regression extends simple linear regression by using several predictor variables to model a response. Where simple regression fits a line in two dimensions, multiple regression fits a hyperplane across many. Getting the model right depends on understanding both its structure and the assumptions that make ordinary least squares (OLS) estimation valid.

Multiple Linear Regression Model

Structure and Components

The general form of a multiple linear regression model is:

y=β0+β1x1+β2x2++βpxp+εy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon

  • yy is the response (dependent) variable
  • x1,x2,,xpx_1, x_2, \ldots, x_p are the pp predictor (independent) variables
  • β0,β1,β2,,βp\beta_0, \beta_1, \beta_2, \ldots, \beta_p are the regression coefficients (population parameters)
  • ε\varepsilon is the random error term

Each coefficient βj\beta_j represents the expected change in yy for a one-unit increase in xjx_j, holding all other predictors constant. That "holding constant" part is critical. It's what separates multiple regression from running several simple regressions. For example, if you're modeling sales revenue from advertising spend and product price, β1\beta_1 tells you the effect of an additional dollar of advertising at a fixed price level.

The intercept β0\beta_0 is the expected value of yy when every predictor equals zero. Depending on the context, this may or may not have a meaningful real-world interpretation. If no product can realistically have a price of $0 and zero advertising, β0\beta_0 is just an anchor for the regression plane rather than a quantity you'd interpret directly.

Error Term and Unexplained Variability

The error term ε\varepsilon captures everything the model doesn't explain: measurement error, omitted variables, and inherent randomness. Sales revenue might fluctuate due to consumer sentiment, competitor behavior, or seasonal effects that aren't in your model. All of that lands in ε\varepsilon.

OLS estimation finds the coefficient values that minimize the sum of squared residuals (SSE), which is the total squared difference between observed and predicted values. In other words, OLS picks the hyperplane that sits as close as possible to the data points in a least-squares sense.

Assumptions of Multiple Regression

The classical assumptions are often summarized with the acronym LINE (Linearity, Independence, Normality, Equal variance), plus the additional concern of multicollinearity. Each assumption supports a specific part of what makes OLS work.

Structure and Components, Predictive and mechanistic multivariate linear regression models for reaction development ...

Linearity and Independence

Linearity means the relationship between each predictor and the response is linear in the parameters, holding other predictors constant. If advertising spend has a diminishing-returns effect on revenue, a straight-line term won't capture that, and your coefficient estimates will be systematically off.

Independence means the observations don't influence each other. The residual for one data point should carry no information about the residual for another. This assumption is most commonly violated with time-series data (where consecutive observations are correlated) or clustered data (e.g., multiple products from the same company).

Homoscedasticity and Normality

Homoscedasticity (constant variance) means the spread of the residuals stays the same across all fitted values and all levels of the predictors. If residuals fan out as predicted values increase, you have heteroscedasticity, and your standard errors will be unreliable.

Normality means the residuals follow a normal distribution with mean zero: εN(0,σ2)\varepsilon \sim N(0, \sigma^2). This assumption matters most for inference (hypothesis tests, confidence intervals). With large samples, the Central Limit Theorem provides some protection, but with small samples, non-normal errors can seriously distort your p-values.

Multicollinearity and Outliers

No perfect multicollinearity is required for OLS to even produce estimates. In practice, the concern is high (but not perfect) multicollinearity, where predictors are strongly correlated with each other. When two predictors move together, the model struggles to separate their individual effects, leading to inflated standard errors and unstable coefficients.

No unduly influential observations means that no single data point should dominate the regression results. An extreme outlier in yy can pull the fitted plane toward itself, distorting coefficient estimates for the entire model.

Assessing Assumptions in Regression

Structure and Components, data visualization - How to describe or visualize a multiple linear regression model - Cross ...

Evaluating Linearity and Independence

To check linearity:

  1. Plot residuals against each predictor variable individually.
  2. Plot residuals against the fitted values.
  3. Look for any systematic curvature or patterns. Random scatter around zero is what you want. A U-shape or other curve signals that a linear term isn't adequate for that predictor.

Added-variable plots (partial regression plots) are especially useful in multiple regression because they show the relationship between a predictor and the response after accounting for the other predictors.

To check independence:

  1. Consider the study design. Were observations sampled independently, or is there a natural grouping or time ordering?
  2. Plot residuals against the order of data collection (if an ordering exists).
  3. Look for runs, cycles, or trends. A systematic pattern suggests autocorrelation.

The Durbin-Watson test provides a formal check for first-order autocorrelation, with values near 2 indicating no autocorrelation.

Checking Homoscedasticity and Normality

To check homoscedasticity:

  1. Plot residuals (or standardized residuals) against fitted values.
  2. The vertical spread of points should look roughly constant from left to right.
  3. A fan shape (wider spread at higher fitted values) or a funnel shape indicates heteroscedasticity.

To check normality:

  1. Create a Q-Q (quantile-quantile) plot of the residuals. Points should fall approximately along a straight diagonal line.
  2. Examine a histogram of the residuals. It should be roughly symmetric and bell-shaped.
  3. Formal tests like Shapiro-Wilk can supplement visual checks, but with large samples they can flag trivial departures, so the plots are usually more informative.

Detecting Multicollinearity and Influential Points

To detect multicollinearity:

  1. Examine the correlation matrix of the predictors. Pairwise correlations above roughly 0.8 deserve attention.

  2. Calculate the Variance Inflation Factor (VIF) for each predictor. VIF measures how much the variance of a coefficient is inflated due to correlation with other predictors.

    • VIFj=11Rj2VIF_j = \frac{1}{1 - R_j^2}, where Rj2R_j^2 is the R2R^2 from regressing predictor xjx_j on all other predictors.
    • A common rule of thumb: VIF > 5 warrants concern; VIF > 10 indicates serious multicollinearity.
  3. Note that pairwise correlations can miss multicollinearity involving three or more variables, so VIF is the more reliable diagnostic.

To detect outliers and influential points:

  • Studentized residuals identify observations with unusually large residuals. Values beyond ±2\pm 2 or ±3\pm 3 flag potential outliers.
  • Leverage values (hiih_{ii}) measure how far an observation's predictor values are from the center of the predictor space. High leverage means the point could be influential.
  • Cook's distance combines leverage and residual size into a single measure of overall influence. A Cook's distance greater than 1 (or greater than 4/n4/n as a more sensitive threshold) suggests the observation substantially shifts the fitted model.

An observation with high leverage but a small residual may not be a problem. An observation with both high leverage and a large residual almost certainly is.

Consequences of Violated Assumptions

Impact on Coefficient Estimates and Predictions

Linearity violated: Coefficient estimates become biased because the model is systematically misspecified. Predictions will be poor, especially when extrapolating. If the true relationship between advertising and revenue is curved, a linear model will overpredict in some regions and underpredict in others.

Multicollinearity present: Coefficients remain unbiased but become highly variable. Small changes in the data can produce large swings in the estimated coefficients. For instance, the estimated effect of advertising on revenue might flip sign depending on whether price is included in the model. The model's overall predictions may still be reasonable, but interpreting individual coefficients becomes unreliable.

Effect on Inference and Model Fit

Independence violated: Standard errors are typically underestimated, which makes confidence intervals too narrow and p-values too small. You end up declaring effects "significant" that may not be. This is one of the more dangerous violations because the coefficient estimates themselves may look fine while the inferential machinery quietly breaks down.

Homoscedasticity violated: OLS estimates are still unbiased but no longer the most efficient (minimum variance). Standard errors are biased, so hypothesis tests and confidence intervals lose their validity. Weighted least squares or heteroscedasticity-robust standard errors (e.g., White's robust standard errors) are common remedies.

Normality violated: For large samples, this is often the least consequential violation because the sampling distribution of the coefficients approaches normality regardless (via the CLT). For small samples, non-normal errors distort t-tests and F-tests, potentially leading to incorrect conclusions about which predictors matter.

Influential points present: A single extreme observation can pull the regression plane toward itself, inflating or deflating coefficient estimates and artificially raising or lowering R2R^2. Always investigate flagged points. Sometimes they represent data entry errors; other times they reveal a subpopulation your model doesn't account for.