Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Linear regression is the workhorse of predictive modeling, but its power comes with strings attached. When you fit a regression model, you're not just drawing a line through points—you're making implicit mathematical promises about how your data behaves. Violate these assumptions, and your coefficient estimates become biased, your standard errors mislead you, and your p-values lie. In data science interviews and applied statistics exams, you're being tested on whether you understand why these assumptions matter, not just what they are.
The assumptions fall into distinct categories: some protect the validity of your estimates (linearity, no multicollinearity), others ensure your inference is trustworthy (independence, homoscedasticity, normality), and still others guard against unstable or misleading results (outliers, sample size). Don't just memorize the list—know which assumption you'd check first for a given problem, what diagnostic tool you'd use, and what remediation you'd apply when things go wrong.
These assumptions ensure that your regression coefficients actually represent the true relationships in your data. Violate them, and your values become systematically wrong.
Compare: Linearity vs. Multicollinearity—both corrupt your coefficient estimates, but linearity violations cause bias (systematically wrong), while multicollinearity causes variance (unstable and imprecise). If asked to prioritize diagnostics, check linearity first since bias can't be fixed by collecting more data.
These assumptions ensure that your confidence intervals, hypothesis tests, and p-values are mathematically valid. Your estimates might still be unbiased without them, but you can't trust your uncertainty quantification.
Compare: Independence vs. Homoscedasticity—both affect your standard errors, but independence violations (autocorrelation) typically require model restructuring or time-series methods, while homoscedasticity violations can often be addressed with weighted least squares or robust standard errors. Know which remediation matches which violation.
These assumptions protect against results that are technically valid but practically unreliable—models that would look completely different with slightly different data.
Compare: Outliers vs. Sample Size—both threaten model stability, but outliers are a data quality issue (identify and decide whether to remove), while insufficient sample size is a study design issue (collect more data or reduce model complexity). In practice, you often face both simultaneously.
| Concept | Best Examples |
|---|---|
| Bias in estimates | Linearity violation, omitted variable bias |
| Variance inflation | Multicollinearity, small sample size |
| Invalid inference | Independence violation, heteroscedasticity |
| Diagnostic plots | Residual vs. fitted (linearity, homoscedasticity), Q-Q plot (normality) |
| Formal tests | Durbin-Watson (independence), Breusch-Pagan (homoscedasticity), Shapiro-Wilk (normality) |
| Influence metrics | Leverage, Cook's distance, DFBETAS |
| Remediation tools | Transformations, weighted least squares, robust standard errors, regularization |
| Sample size rules | 10-15 observations per predictor, Central Limit Theorem kicks in around n=30 |
Which two assumptions, if violated, cause your coefficient estimates to be biased rather than just imprecise? What's the key difference between them?
You plot residuals against fitted values and see a clear funnel shape (variance increasing with fitted values). Which assumption is violated, what test would you run to confirm, and what's your primary remediation strategy?
Compare and contrast the consequences of violating independence versus violating homoscedasticity. Both affect standard errors—how do their remediation strategies differ?
A colleague argues that normality of residuals doesn't matter for their dataset of 10,000 observations. Are they correct? Explain the role of sample size in this assumption.
You compute VIF values and find that two predictors have VIF > 15. Your model's is high and predictions are accurate. Should you still be concerned? What specific problem does multicollinearity cause even when predictions are good?