Fiveable

🧠Thinking Like a Mathematician Unit 5 Review

QR code for Thinking Like a Mathematician practice questions

5.1 Linear models

5.1 Linear models

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🧠Thinking Like a Mathematician
Unit & Topic Study Guides

Fundamentals of Linear Models

Linear models describe relationships between variables using straight-line equations. They're the starting point for most statistical analysis because they let you quantify patterns in data, make predictions, and test whether relationships are real or just noise. Once you understand linear models, more advanced techniques build directly on top of them.

Definition and Purpose

A linear model is a mathematical equation that expresses how a dependent variable (the thing you're trying to predict) relates to one or more independent variables (the factors you think influence it). The relationship is "linear" because the dependent variable is modeled as a weighted sum of the independent variables.

These models show up everywhere: economists use them to forecast GDP, biologists use them to study how drug dosage affects response, and social scientists use them to analyze survey data. Their power comes from turning messy real-world relationships into interpretable, testable equations.

Components of a Linear Model

Every linear model has the same core parts:

  • Dependent variable (Y): the outcome you're studying
  • Independent variables (X): the predictors or explanatory factors
  • Coefficients (β): numbers that quantify how much each independent variable affects Y
  • Intercept (β₀): the predicted value of Y when all X variables equal zero
  • Error term (ε): captures the variation your model can't explain (random noise, missing variables, measurement error)

Types of Linear Models

  • Simple linear regression: one independent variable predicts one dependent variable (e.g., predicting test scores from hours studied)
  • Multiple linear regression: two or more independent variables predict a single dependent variable (e.g., predicting house price from square footage, number of bedrooms, and neighborhood)
  • Analysis of variance (ANOVA): compares means across different groups or categories
  • Analysis of covariance (ANCOVA): combines regression and ANOVA by adjusting group comparisons for continuous covariates
  • Hierarchical linear models: handle nested or clustered data, like students within schools

Mathematical Representation

Writing a linear model as an equation lets you move from a vague idea ("these variables are related") to a precise, testable statement. This is abstraction at work: you're representing a complex real-world phenomenon in symbolic form so you can manipulate it mathematically.

Equation of a Linear Model

The general form is:

Y=β0+β1X1+β2X2+...+βkXk+εY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \varepsilon

  • YY is the dependent variable
  • X1,X2,...,XkX_1, X_2, ..., X_k are the independent variables
  • β0\beta_0 is the intercept
  • β1,β2,...,βk\beta_1, \beta_2, ..., \beta_k are the slope coefficients
  • ε\varepsilon is the error term

For simple linear regression with one predictor, this simplifies to Y=β0+β1X+εY = \beta_0 + \beta_1 X + \varepsilon.

Slope and Intercept Interpretation

The slope (βj\beta_j) tells you: for every one-unit increase in XjX_j, the predicted value of YY changes by βj\beta_j units, holding all other variables constant.

  • A positive slope means YY increases as XX increases (direct relationship)
  • A negative slope means YY decreases as XX increases (inverse relationship)

The intercept (β0\beta_0) is the predicted value of YY when every XX equals zero. Sometimes this has a meaningful interpretation (e.g., baseline salary with zero years of experience), and sometimes it doesn't (e.g., predicted weight at zero height). Context matters.

When you have multiple predictors, comparing slopes helps you gauge which variables have the strongest effect on YY.

Matrix Notation for Linear Models

With many variables, writing out every term gets unwieldy. Matrix notation compresses the entire model into:

Y=Xβ+εY = X\beta + \varepsilon

  • YY is an n×1n \times 1 vector of observed outcomes
  • XX is an n×(k+1)n \times (k+1) matrix where each row is one observation and the first column is all ones (for the intercept)
  • β\beta is a (k+1)×1(k+1) \times 1 vector of coefficients
  • ε\varepsilon is an n×1n \times 1 vector of errors

This compact form makes computation much more efficient, especially when fitting models with many predictors.

Model Assumptions

Linear models rely on several assumptions. If these assumptions hold, your estimates and hypothesis tests are trustworthy. If they're violated, your results can be misleading. Checking assumptions isn't optional; it's a core part of responsible modeling.

Linearity Assumption

The relationship between YY and each XX must be linear. You can check this by looking at scatter plots of YY against each predictor, or by examining residual plots for curved patterns.

If the relationship is actually curved, you might need to transform your variables (e.g., take the log or add polynomial terms). Ignoring non-linearity leads to biased coefficient estimates and poor predictions.

Independence of Errors

The residuals should be uncorrelated with each other. This assumption is especially important with time series data or clustered observations, where nearby data points tend to be similar.

  • The Durbin-Watson test can detect autocorrelation in residuals
  • Violations inflate your significance levels, making you think effects are real when they might not be
  • Fixes include generalized least squares or mixed-effects models

Homoscedasticity

The spread of residuals should stay roughly constant across all levels of the independent variables. When the spread changes (called heteroscedasticity), your coefficient estimates become inefficient and standard errors become unreliable.

  • Check by plotting residuals against fitted values; look for a "fan" or "funnel" shape
  • Formal tests include the Breusch-Pagan test and White's test
  • Weighted least squares or robust standard errors can address the problem

Normality of Residuals

The residuals should follow a normal distribution. This matters most for hypothesis tests and confidence intervals, especially with small samples.

  • Assess with Q-Q plots, histograms, or formal tests like Shapiro-Wilk
  • With large samples, mild non-normality is less of a concern thanks to the Central Limit Theorem
  • Severe non-normality may require variable transformations or robust regression methods

Estimation Methods

Estimation is about finding the coefficient values that make your model fit the data as well as possible. Different methods make different trade-offs between simplicity, efficiency, and flexibility.

Ordinary Least Squares (OLS)

OLS is the most common estimation method. It finds the coefficients that minimize the sum of squared residuals, the total squared distance between observed and predicted values.

The closed-form solution is:

β^=(XX)1XY\hat{\beta} = (X'X)^{-1}X'Y

OLS produces unbiased estimates when the model assumptions are met, and it's computationally efficient for moderately sized datasets. It's the default starting point for linear regression.

Definition and purpose, Types of Regression

Maximum Likelihood Estimation (MLE)

MLE finds the parameter values that make the observed data most probable under a specified distribution (usually normal for linear models).

Under the normality assumption, MLE gives the same estimates as OLS. Where MLE becomes more useful is with generalized linear models and other settings where OLS doesn't directly apply. MLE also provides a natural framework for hypothesis testing and confidence intervals.

Weighted Least Squares (WLS)

WLS is an extension of OLS designed for situations where the variance of errors isn't constant (heteroscedasticity). Instead of treating every observation equally, WLS assigns weights to observations: more reliable observations (lower variance) get higher weight.

This improves the efficiency of your estimates, but you need to know or estimate the variance structure of the errors to assign appropriate weights.

Model Evaluation

After fitting a model, you need to assess how well it actually works. These metrics help you judge model quality and compare competing models.

Coefficient of Determination (R-squared)

R2R^2 measures the proportion of variance in YY that your model explains:

R2=1SSresSStotR^2 = 1 - \frac{SS_{res}}{SS_{tot}}

  • SSresSS_{res} is the sum of squared residuals (unexplained variation)
  • SStotSS_{tot} is the total sum of squares (total variation in YY)

R2R^2 ranges from 0 to 1. An R2R^2 of 0.85 means your model explains 85% of the variation in YY. Be cautious, though: R2R^2 always increases when you add more predictors, even useless ones.

Adjusted R-squared

Adjusted R2R^2 fixes the problem above by penalizing for extra predictors:

Radj2=1(1R2)(n1)nk1R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}

Here nn is the number of observations and kk is the number of predictors. Unlike regular R2R^2, adjusted R2R^2 can decrease if you add a variable that doesn't improve the model enough to justify its inclusion. This makes it better for comparing models with different numbers of predictors.

F-statistic and p-value

The F-statistic tests whether your model as a whole explains significantly more variance than a model with no predictors (intercept only). It's calculated as the ratio of explained variance to unexplained variance.

  • A large F-statistic means the model explains much more than expected by chance
  • The associated p-value tells you the probability of getting that F-statistic if the model had no real explanatory power
  • A p-value below 0.05 is the conventional threshold for concluding the model is statistically significant

Hypothesis Testing

Hypothesis testing lets you make formal claims about whether relationships in your model are real or could have arisen by chance.

T-tests for Coefficients

To test whether a single predictor has a significant effect, you use a t-test. The null hypothesis is that the coefficient equals zero (no effect).

The t-statistic is:

t=β^SE(β^)t = \frac{\hat{\beta}}{SE(\hat{\beta})}

where β^\hat{\beta} is the estimated coefficient and SE(β^)SE(\hat{\beta}) is its standard error. A large absolute t-value (and correspondingly small p-value) suggests the predictor has a real effect on YY.

Confidence Intervals

A confidence interval gives you a range of plausible values for a coefficient, not just a single point estimate. At the 95% level:

CI=β^±tα/2,  nk1×SE(β^)CI = \hat{\beta} \pm t_{\alpha/2, \; n-k-1} \times SE(\hat{\beta})

where tα/2,  nk1t_{\alpha/2, \; n-k-1} is the critical t-value for your chosen confidence level and degrees of freedom.

  • Wider intervals mean less precision
  • If a 95% CI for a coefficient doesn't include zero, that coefficient is significant at the 0.05 level

ANOVA for Linear Models

ANOVA decomposes total variability into explained (by the model) and unexplained (residual) components:

F=MSmodelMSresidualF = \frac{MS_{model}}{MS_{residual}}

where MSMS stands for mean square (sum of squares divided by degrees of freedom). ANOVA is particularly useful for testing whether groups of predictors jointly matter, or for comparing nested models. It's the standard tool in experimental designs with multiple treatment groups.

Diagnostics and Residual Analysis

Fitting a model is only half the job. Diagnostics tell you whether the model is trustworthy and where it might be going wrong.

Residual Plots

Residual plots are your primary diagnostic tool. The most useful ones:

  • Residuals vs. fitted values: checks linearity and homoscedasticity. You want a random scatter with no pattern.
  • Normal Q-Q plot: checks normality. Points should fall roughly along the diagonal line.
  • Residuals vs. leverage: helps spot influential observations.
  • Scale-location plot: plots standardized residuals\sqrt{|\text{standardized residuals}|} against fitted values to check for changing variance.

Any systematic pattern in these plots signals a potential assumption violation.

Leverage and Influence

Leverage measures how far an observation's predictor values are from the average. High-leverage points have unusual XX values and can pull the regression line toward them.

Leverage for observation ii comes from the hat matrix diagonal:

hii=Xi(XX)1Xih_{ii} = X_i(X'X)^{-1}X_i'

A common rule of thumb: flag points where hii>2(k+1)/nh_{ii} > 2(k+1)/n.

Influence combines leverage with the size of the residual. Cook's distance is the standard measure:

Di=(Y^iYi)2(k+1)MSE×hii(1hii)2D_i = \frac{(\hat{Y}_i - Y_i)^2}{(k+1) \cdot MSE} \times \frac{h_{ii}}{(1 - h_{ii})^2}

A single high-influence point can substantially change your coefficient estimates, so these observations deserve close inspection.

Outlier Detection

Outliers are observations that don't fit the overall pattern. Several approaches exist:

  • Standardized residuals: residuals divided by their standard deviation. Values beyond ±2\pm 2 or ±3\pm 3 are suspicious.
  • Studentized residuals: adjust for the fact that each observation's predicted value has different precision. These are generally more reliable than standardized residuals.
  • Bonferroni outlier test: adjusts for multiple comparisons when you're checking many observations at once.

Before removing an outlier, consider whether it reflects a data entry error, a genuinely unusual case, or a sign that your model is misspecified.

Definition and purpose, Graphical Analysis of One-Dimensional Motion – Physics

Model Selection

The goal of model selection is parsimony: find the simplest model that adequately explains the data. Adding more predictors always improves in-sample fit, but it can hurt prediction on new data (overfitting).

Stepwise Regression

Stepwise regression automates predictor selection by iteratively adding or removing variables based on statistical criteria.

  1. Forward stepwise: start with no predictors. At each step, add the variable that improves fit the most. Stop when no remaining variable meets the entry criterion.
  2. Backward stepwise: start with all predictors. At each step, remove the least significant variable. Stop when all remaining variables meet the retention criterion.
  3. Bidirectional stepwise: combines both approaches, allowing variables to be added or removed at each step.

Forward vs. Backward Selection

Forward and backward selection can produce different final models because of their sequential nature. Forward selection might miss a variable that's only useful in combination with another, while backward selection starts with all interactions present.

Neither method guarantees finding the globally best subset of predictors. They're useful heuristics, but for small numbers of candidate variables, best-subsets regression (testing all possible combinations) is more thorough.

Information Criteria (AIC, BIC)

Information criteria balance fit against complexity using a single number. Lower values are better.

  • AIC (Akaike Information Criterion): AIC=2k2ln(L)AIC = 2k - 2\ln(L)
    • kk is the number of parameters, LL is the maximum likelihood
    • Penalizes complexity moderately
  • BIC (Bayesian Information Criterion): BIC=kln(n)2ln(L)BIC = k\ln(n) - 2\ln(L)
    • nn is the number of observations
    • Penalizes complexity more heavily than AIC, so it tends to select simpler models

Both criteria can compare non-nested models (unlike the F-test), making them versatile tools for model selection.

Multicollinearity

Multicollinearity occurs when independent variables are highly correlated with each other. It doesn't ruin your model's overall predictions, but it makes individual coefficient estimates unreliable and hard to interpret.

Causes and Consequences

Multicollinearity is common in observational studies or when predictors measure overlapping constructs (e.g., including both "years of education" and "highest degree earned").

The consequences:

  • Coefficient estimates become unstable: small changes in the data produce large swings in estimated values
  • Standard errors inflate, making it harder to detect significant effects
  • Coefficients may have counterintuitive signs (e.g., a variable you expect to be positive shows up negative)
  • Overall model fit (R2R^2, predictions) is largely unaffected

Variance Inflation Factor (VIF)

VIF quantifies how much a predictor's variance is inflated due to correlation with other predictors:

VIFj=11Rj2VIF_j = \frac{1}{1 - R^2_j}

Here Rj2R^2_j comes from regressing the jj-th predictor on all the other predictors. A VIF of 1 means no collinearity. VIF above 5 is a warning sign; above 10 is generally considered serious.

Use VIF to identify which variables are causing the problem, then decide whether to drop one, combine correlated predictors, or use a regularization technique.

Ridge Regression

Ridge regression addresses multicollinearity by adding a penalty that shrinks coefficients toward zero. It minimizes:

i=1n(yiβ0j=1pβjxij)2+λj=1pβj2\sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p \beta_j^2

The parameter λ\lambda controls the penalty strength. As λ\lambda increases, coefficients shrink more. This introduces a small amount of bias but can dramatically reduce variance, improving prediction accuracy.

Choosing the right λ\lambda is typically done through cross-validation: you test different values and pick the one that minimizes prediction error on held-out data.

Generalized Linear Models

Generalized linear models (GLMs) extend the linear modeling framework to handle response variables that aren't normally distributed. They're essential for modeling binary outcomes, counts, and other non-continuous data.

Logistic Regression

Logistic regression models binary outcomes (yes/no, success/failure) by predicting the probability of an event. The logit link function transforms the probability to a linear scale:

log(p1p)=β0+β1X1+...+βkXk\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k

Here pp is the probability of the event occurring, and p1p\frac{p}{1-p} is the odds. Coefficients are interpreted as changes in log-odds for a one-unit increase in the predictor. Exponentiating a coefficient gives you the odds ratio.

Logistic regression is widely used for classification problems in medicine (disease diagnosis), marketing (purchase prediction), and finance (credit risk).

Poisson Regression

Poisson regression models count data (number of accidents, number of species observed, number of customer complaints). It uses the log link:

log(μ)=β0+β1X1+...+βkXk\log(\mu) = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k

where μ\mu is the expected count. The Poisson distribution assumes the mean equals the variance. When this assumption is violated (overdispersion), you may need a negative binomial model instead.

Exponentiating a coefficient gives you the multiplicative change in the expected count for a one-unit increase in the predictor.

The link function is what connects the linear predictor to the expected value of the response. Different response types call for different links:

  • Identity link (linear regression): g(μ)=μg(\mu) = \mu
  • Logit link (logistic regression): g(μ)=log(μ/(1μ))g(\mu) = \log(\mu / (1 - \mu))
  • Log link (Poisson regression): g(μ)=log(μ)g(\mu) = \log(\mu)
  • Probit link: g(μ)=Φ1(μ)g(\mu) = \Phi^{-1}(\mu), where Φ\Phi is the standard normal CDF

The choice depends on the nature of your response variable and the assumptions you're willing to make.

Applications and Extensions

Linear models form the basis for a wide range of specialized techniques. These extensions adapt the core framework to handle more complex data structures and relationships.

Time Series Regression

Time series regression analyzes data collected over time to identify trends, seasonality, and other temporal patterns. Standard linear regression assumes independent errors, but time series data typically has serial correlation (today's value depends on yesterday's).

  • ARIMA models incorporate autoregressive (AR) and moving average (MA) components to handle this dependence
  • Differencing can address non-stationarity (when the mean or variance changes over time)
  • Lag structures let you model how past values of predictors affect the current outcome

These models are used for economic forecasting, demand planning, and studying how effects unfold over time.

Panel Data Models

Panel data combines cross-sectional and time series dimensions (e.g., tracking 50 countries over 20 years). This structure lets you control for factors that are hard to measure directly.

  • Fixed effects models account for unobserved, time-invariant differences between units (e.g., cultural factors that differ across countries but don't change over time)
  • Random effects models assume unit-specific effects are uncorrelated with the predictors, which is more efficient but a stronger assumption
  • The Hausman test helps you choose between the two

Panel models are standard in economics and sociology for studying how policies or conditions affect outcomes over time.

Nonlinear Transformations

When the true relationship between variables is curved, you can often still use the linear modeling framework by transforming variables:

  • Polynomial regression: add X2X^2, X3X^3, etc. as predictors to capture curvature
  • Log transformations: linearize exponential or multiplicative relationships (e.g., log(Y)=β0+β1X\log(Y) = \beta_0 + \beta_1 X)
  • Box-Cox transformations: a family of power transformations that includes log as a special case
  • Spline functions: fit piecewise polynomials that join smoothly at specified points (knots)
  • Generalized additive models (GAMs): replace linear terms with smooth, flexible functions of each predictor

These approaches let you model complex, nonlinear patterns while keeping the interpretability advantages of the linear framework.