Fiveable

🥖Linear Modeling Theory Unit 6 Review

QR code for Linear Modeling Theory practice questions

6.1 Multiple Linear Regression Model and Assumptions

🥖Linear Modeling Theory
Unit 6 Review

6.1 Multiple Linear Regression Model and Assumptions

Written by the Fiveable Content Team • Last updated September 2025
Written by the Fiveable Content Team • Last updated September 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Multiple linear regression models predict a response using several predictors, like sales revenue based on advertising and price. The model's structure includes coefficients for each predictor and an error term. Understanding these components helps interpret the relationships between variables.

The model relies on key assumptions: linearity, independence, homoscedasticity, normality, and no multicollinearity. Violating these can lead to biased estimates, inaccurate predictions, and incorrect inferences. Checking these assumptions is crucial for reliable results and meaningful interpretations.

Multiple Linear Regression Model

Structure and Components

  • A multiple linear regression model predicts a quantitative response variable using multiple predictor variables (age, income, education level)
  • The general form is $y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε$
    • $y$ is the response variable (sales revenue)
    • $x₁, x₂, ..., xₚ$ are the predictor variables (advertising expenditure, product price)
    • $β₀, β₁, β₂, ..., βₚ$ are the regression coefficients
    • $ε$ is the random error term
  • The regression coefficients $βᵢ$ represent the change in the mean response variable for a one-unit increase in the corresponding predictor variable, holding all other predictors constant (a $1 increase in advertising expenditure leads to a $βᵢ increase in sales revenue, keeping product price constant)
  • The intercept $β₀$ is the expected value of the response variable when all predictor variables are zero (baseline sales revenue with no advertising and a product price of $0)

Error Term and Unexplained Variability

  • The error term $ε$ accounts for the variability in the response variable that cannot be explained by the linear relationship with the predictor variables
  • This unexplained variability can be due to measurement error, omitted variables, or inherent randomness in the data (sales revenue fluctuations due to factors not included in the model, such as consumer preferences or economic conditions)
  • The goal of multiple linear regression is to minimize the sum of squared errors, which represents the total unexplained variability in the model (finding the regression coefficients that produce the smallest overall difference between the observed and predicted values of the response variable)

Assumptions of Multiple Regression

Linearity and Independence

  • Linearity assumes that the relationship between each predictor variable and the response variable is linear, holding all other predictors constant (a straight-line relationship between advertising expenditure and sales revenue, at a fixed product price)
  • Independence assumes that the observations are independently sampled, meaning that the value of one observation does not influence or depend on the value of another observation (sales revenue for one product does not affect the sales revenue for another product)

Homoscedasticity and Normality

  • Homoscedasticity assumes that the variability of the residuals (differences between observed and predicted values) is constant across all levels of the predictor variables (the spread of the residuals is the same for low and high levels of advertising expenditure)
  • Normality assumes that the residuals follow a normal distribution with a mean of zero (the distribution of the differences between observed and predicted sales revenue is symmetric and bell-shaped)

Multicollinearity and Outliers

  • No multicollinearity assumes that the predictor variables are not highly correlated with each other (advertising expenditure and product price are not strongly related)
    • Multicollinearity can lead to unstable estimates of the regression coefficients and difficulty in interpreting their individual effects
  • No outliers or influential points assumes that there are no extreme observations that have a disproportionate impact on the regression results (an unusually high sales revenue value that significantly affects the estimated relationship between the predictors and the response)

Assessing Assumptions in Regression

Evaluating Linearity and Independence

  • Linearity can be assessed by examining scatterplots of the response variable against each predictor variable, looking for any non-linear patterns or curvature (a curved relationship between advertising expenditure and sales revenue suggests a violation of linearity)
  • Independence can be evaluated by considering the sampling method and the nature of the data
    • Plotting residuals against the order of data collection can help identify any time-dependent patterns or autocorrelation (residuals that exhibit a systematic pattern when plotted in the order the data was collected indicate a violation of independence)

Checking Homoscedasticity and Normality

  • Homoscedasticity can be checked by plotting the residuals against the predicted values
    • The spread of the residuals should be consistent across the range of predicted values (a fan-shaped pattern of residuals suggests a violation of homoscedasticity)
  • Normality of the residuals can be assessed using a normal probability plot (Q-Q plot) or a histogram of the residuals
    • The points in the Q-Q plot should fall close to a straight line, and the histogram should resemble a bell-shaped curve (significant deviations from a straight line in the Q-Q plot or a skewed histogram indicate a violation of normality)

Detecting Multicollinearity and Influential Points

  • Multicollinearity can be detected by examining the correlation matrix of the predictor variables or by calculating the variance inflation factor (VIF) for each predictor
    • High correlations or VIF values (e.g., > 5 or 10) suggest the presence of multicollinearity (a correlation of 0.9 between advertising expenditure and product price indicates a high degree of multicollinearity)
  • Outliers and influential points can be identified using diagnostic measures such as studentized residuals, leverage values, and Cook's distance
    • These measures quantify the impact of each observation on the regression results (a Cook's distance value greater than 1 suggests an influential point that substantially affects the estimated coefficients)

Consequences of Violated Assumptions

Impact on Coefficient Estimates and Predictions

  • Violation of linearity can lead to biased estimates of the regression coefficients and inaccurate predictions, especially when extrapolating beyond the range of the observed data (using a linear model to predict sales revenue for advertising expenditure levels much higher than those observed in the data may result in unrealistic estimates)
  • The presence of multicollinearity can lead to unstable and unreliable estimates of the regression coefficients, making it difficult to interpret the individual effects of the predictor variables (the estimated effect of advertising expenditure on sales revenue may change drastically when product price is included or excluded from the model)

Effect on Inference and Model Fit

  • Violation of independence can result in underestimated standard errors and overly narrow confidence intervals, leading to incorrect inferences about the significance of the regression coefficients (concluding that advertising expenditure has a significant effect on sales revenue when, in reality, the relationship is not statistically significant)
  • Violation of homoscedasticity can lead to inefficient estimates of the regression coefficients and biased standard errors, affecting the validity of hypothesis tests and confidence intervals (the true variability in the effect of advertising expenditure on sales revenue may be underestimated or overestimated, depending on the pattern of heteroscedasticity)
  • Outliers and influential points can have a substantial impact on the regression results, potentially distorting the estimates of the coefficients and reducing the overall fit of the model (a single observation with an extremely high sales revenue value can make the relationship between advertising expenditure and sales revenue appear stronger than it actually is)