upgrade
upgrade

🧮Data Science Numerical Analysis

Linear Regression Assumptions

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Linear regression is the workhorse of predictive modeling, but its power comes with strings attached. When you fit a regression model, you're not just drawing a line through points—you're making implicit mathematical promises about how your data behaves. Violate these assumptions, and your coefficient estimates become biased, your standard errors mislead you, and your p-values lie. In data science interviews and applied statistics exams, you're being tested on whether you understand why these assumptions matter, not just what they are.

The assumptions fall into distinct categories: some protect the validity of your estimates (linearity, no multicollinearity), others ensure your inference is trustworthy (independence, homoscedasticity, normality), and still others guard against unstable or misleading results (outliers, sample size). Don't just memorize the list—know which assumption you'd check first for a given problem, what diagnostic tool you'd use, and what remediation you'd apply when things go wrong.


Assumptions That Protect Your Estimates

These assumptions ensure that your regression coefficients actually represent the true relationships in your data. Violate them, and your β^\hat{\beta} values become systematically wrong.

Linearity

  • The relationship between predictors and response must be linear in parameters—this means E[YX]=β0+β1X1+...+βpXpE[Y|X] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p must hold
  • Scatter plots and residual-vs-fitted plots are your first diagnostic tools; look for curves or patterns that suggest missed nonlinearity
  • Violation leads to biased estimates—your model systematically over- or under-predicts in certain regions, which no amount of data can fix

No Multicollinearity

  • Independent variables should not be highly correlated with each other—when they are, the model can't distinguish their individual effects
  • Variance Inflation Factor (VIF) quantifies the severity; VIF > 10 is a common red flag, though some use VIF > 5
  • Inflated standard errors make coefficients appear insignificant even when predictors matter—you lose statistical power without losing predictive accuracy

Compare: Linearity vs. Multicollinearity—both corrupt your coefficient estimates, but linearity violations cause bias (systematically wrong), while multicollinearity causes variance (unstable and imprecise). If asked to prioritize diagnostics, check linearity first since bias can't be fixed by collecting more data.


Assumptions That Validate Your Inference

These assumptions ensure that your confidence intervals, hypothesis tests, and p-values are mathematically valid. Your estimates might still be unbiased without them, but you can't trust your uncertainty quantification.

Independence of Errors

  • Residuals must be uncorrelated with each other—formally, Cov(ϵi,ϵj)=0Cov(\epsilon_i, \epsilon_j) = 0 for all iji \neq j
  • Autocorrelation in time series data is the classic violation; the Durbin-Watson test detects first-order autocorrelation
  • Consequences include invalid standard errors—your confidence intervals become too narrow, and you'll reject null hypotheses too often

Homoscedasticity

  • Residual variance must be constant across all fitted values—mathematically, Var(ϵi)=σ2Var(\epsilon_i) = \sigma^2 for all observations
  • Heteroscedasticity (non-constant variance) often appears as a "funnel shape" in residual plots; Breusch-Pagan and White tests provide formal detection
  • Standard errors become inefficient, meaning ordinary least squares no longer gives you the best linear unbiased estimator (BLUE)

Normality of Residuals

  • Residuals should follow ϵN(0,σ2)\epsilon \sim N(0, \sigma^2) for valid hypothesis tests and confidence intervals
  • Q-Q plots and Shapiro-Wilk tests are standard diagnostics; this assumption matters most for small samples due to the Central Limit Theorem
  • Large samples (n > 30-50) are forgiving—coefficient estimates remain valid even with non-normal residuals, though prediction intervals suffer

Compare: Independence vs. Homoscedasticity—both affect your standard errors, but independence violations (autocorrelation) typically require model restructuring or time-series methods, while homoscedasticity violations can often be addressed with weighted least squares or robust standard errors. Know which remediation matches which violation.


Assumptions That Ensure Stability

These assumptions protect against results that are technically valid but practically unreliable—models that would look completely different with slightly different data.

No Outliers or Influential Points

  • Outliers disproportionately pull the regression line—a single extreme point can flip the sign of a coefficient
  • Leverage measures how far an observation's predictors are from the mean; Cook's distance combines leverage with residual size to identify influential points
  • High-leverage points aren't automatically problems—only those with large residuals (influential points) distort your model

Large Sample Size Relative to Predictors

  • The rule of thumb is 10-15 observations per predictor variable—fewer observations lead to overfitting and unstable estimates
  • Small samples inflate variance in coefficient estimates, making your model highly sensitive to which specific data points you happened to collect
  • Generalization suffers because the model memorizes noise rather than learning true patterns; this is why regularization methods exist

Compare: Outliers vs. Sample Size—both threaten model stability, but outliers are a data quality issue (identify and decide whether to remove), while insufficient sample size is a study design issue (collect more data or reduce model complexity). In practice, you often face both simultaneously.


Quick Reference Table

ConceptBest Examples
Bias in estimatesLinearity violation, omitted variable bias
Variance inflationMulticollinearity, small sample size
Invalid inferenceIndependence violation, heteroscedasticity
Diagnostic plotsResidual vs. fitted (linearity, homoscedasticity), Q-Q plot (normality)
Formal testsDurbin-Watson (independence), Breusch-Pagan (homoscedasticity), Shapiro-Wilk (normality)
Influence metricsLeverage, Cook's distance, DFBETAS
Remediation toolsTransformations, weighted least squares, robust standard errors, regularization
Sample size rules10-15 observations per predictor, Central Limit Theorem kicks in around n=30

Self-Check Questions

  1. Which two assumptions, if violated, cause your coefficient estimates to be biased rather than just imprecise? What's the key difference between them?

  2. You plot residuals against fitted values and see a clear funnel shape (variance increasing with fitted values). Which assumption is violated, what test would you run to confirm, and what's your primary remediation strategy?

  3. Compare and contrast the consequences of violating independence versus violating homoscedasticity. Both affect standard errors—how do their remediation strategies differ?

  4. A colleague argues that normality of residuals doesn't matter for their dataset of 10,000 observations. Are they correct? Explain the role of sample size in this assumption.

  5. You compute VIF values and find that two predictors have VIF > 15. Your model's R2R^2 is high and predictions are accurate. Should you still be concerned? What specific problem does multicollinearity cause even when predictions are good?