Linear models describe relationships between variables using straight-line equations. They're the starting point for most statistical analysis because they let you quantify patterns in data, make predictions, and test whether relationships are real or just noise. Once you understand linear models, more advanced techniques build directly on top of them.

Definition and Purpose

A linear model is a mathematical equation that expresses how a dependent variable (the thing you're trying to predict) relates to one or more independent variables (the factors you think influence it). The relationship is "linear" because the dependent variable is modeled as a weighted sum of the independent variables.

These models show up everywhere: economists use them to forecast GDP, biologists use them to study how drug dosage affects response, and social scientists use them to analyze survey data. Their power comes from turning messy real-world relationships into interpretable, testable equations.

Components of a Linear Model

Every linear model has the same core parts:

Dependent variable (Y): the outcome you're studying
Independent variables (X): the predictors or explanatory factors
Coefficients (β): numbers that quantify how much each independent variable affects Y
Intercept (β₀): the predicted value of Y when all X variables equal zero
Error term (ε): captures the variation your model can't explain (random noise, missing variables, measurement error)

Types of Linear Models

Simple linear regression: one independent variable predicts one dependent variable (e.g., predicting test scores from hours studied)
Multiple linear regression: two or more independent variables predict a single dependent variable (e.g., predicting house price from square footage, number of bedrooms, and neighborhood)
Analysis of variance (ANOVA): compares means across different groups or categories
Analysis of covariance (ANCOVA): combines regression and ANOVA by adjusting group comparisons for continuous covariates
Hierarchical linear models: handle nested or clustered data, like students within schools

Mathematical Representation

Writing a linear model as an equation lets you move from a vague idea ("these variables are related") to a precise, testable statement. This is abstraction at work: you're representing a complex real-world phenomenon in symbolic form so you can manipulate it mathematically.

Equation of a Linear Model

The general form is:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \varepsilon$

$Y$ is the dependent variable
$X_1, X_2, ..., X_k$ are the independent variables
$\beta_0$ is the intercept
$\beta_1, \beta_2, ..., \beta_k$ are the slope coefficients
$\varepsilon$ is the error term

For simple linear regression with one predictor, this simplifies to $Y = \beta_0 + \beta_1 X + \varepsilon$ .

Slope and Intercept Interpretation

The slope ( $\beta_j$ ) tells you: for every one-unit increase in $X_j$ , the predicted value of $Y$ changes by $\beta_j$ units, holding all other variables constant.

A positive slope means $Y$ increases as $X$ increases (direct relationship)
A negative slope means $Y$ decreases as $X$ increases (inverse relationship)

The intercept ( $\beta_0$ ) is the predicted value of $Y$ when every $X$ equals zero. Sometimes this has a meaningful interpretation (e.g., baseline salary with zero years of experience), and sometimes it doesn't (e.g., predicted weight at zero height). Context matters.

When you have multiple predictors, comparing slopes helps you gauge which variables have the strongest effect on $Y$ .

Matrix Notation for Linear Models

With many variables, writing out every term gets unwieldy. Matrix notation compresses the entire model into:

$Y = X\beta + \varepsilon$

$Y$ is an $n \times 1$ vector of observed outcomes
$X$ is an $n \times (k+1)$ matrix where each row is one observation and the first column is all ones (for the intercept)
$\beta$ is a $(k+1) \times 1$ vector of coefficients
$\varepsilon$ is an $n \times 1$ vector of errors

This compact form makes computation much more efficient, especially when fitting models with many predictors.

Model Assumptions

Linear models rely on several assumptions. If these assumptions hold, your estimates and hypothesis tests are trustworthy. If they're violated, your results can be misleading. Checking assumptions isn't optional; it's a core part of responsible modeling.

Linearity Assumption

The relationship between $Y$ and each $X$ must be linear. You can check this by looking at scatter plots of $Y$ against each predictor, or by examining residual plots for curved patterns.

If the relationship is actually curved, you might need to transform your variables (e.g., take the log or add polynomial terms). Ignoring non-linearity leads to biased coefficient estimates and poor predictions.

Independence of Errors

The residuals should be uncorrelated with each other. This assumption is especially important with time series data or clustered observations, where nearby data points tend to be similar.

The Durbin-Watson test can detect autocorrelation in residuals
Violations inflate your significance levels, making you think effects are real when they might not be
Fixes include generalized least squares or mixed-effects models

Homoscedasticity

The spread of residuals should stay roughly constant across all levels of the independent variables. When the spread changes (called heteroscedasticity), your coefficient estimates become inefficient and standard errors become unreliable.

Check by plotting residuals against fitted values; look for a "fan" or "funnel" shape
Formal tests include the Breusch-Pagan test and White's test
Weighted least squares or robust standard errors can address the problem

Normality of Residuals

The residuals should follow a normal distribution. This matters most for hypothesis tests and confidence intervals, especially with small samples.

Assess with Q-Q plots, histograms, or formal tests like Shapiro-Wilk
With large samples, mild non-normality is less of a concern thanks to the Central Limit Theorem
Severe non-normality may require variable transformations or robust regression methods

Estimation Methods

Estimation is about finding the coefficient values that make your model fit the data as well as possible. Different methods make different trade-offs between simplicity, efficiency, and flexibility.

Ordinary Least Squares (OLS)

OLS is the most common estimation method. It finds the coefficients that minimize the sum of squared residuals, the total squared distance between observed and predicted values.

The closed-form solution is:

$\hat{\beta} = (X'X)^{-1}X'Y$

OLS produces unbiased estimates when the model assumptions are met, and it's computationally efficient for moderately sized datasets. It's the default starting point for linear regression.

Definition and purpose, Types of Regression

Maximum Likelihood Estimation (MLE)

MLE finds the parameter values that make the observed data most probable under a specified distribution (usually normal for linear models).

Under the normality assumption, MLE gives the same estimates as OLS. Where MLE becomes more useful is with generalized linear models and other settings where OLS doesn't directly apply. MLE also provides a natural framework for hypothesis testing and confidence intervals.

Weighted Least Squares (WLS)

WLS is an extension of OLS designed for situations where the variance of errors isn't constant (heteroscedasticity). Instead of treating every observation equally, WLS assigns weights to observations: more reliable observations (lower variance) get higher weight.

This improves the efficiency of your estimates, but you need to know or estimate the variance structure of the errors to assign appropriate weights.

Model Evaluation

After fitting a model, you need to assess how well it actually works. These metrics help you judge model quality and compare competing models.

Coefficient of Determination (R-squared)

$R^2$ measures the proportion of variance in $Y$ that your model explains:

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

$SS_{res}$ is the sum of squared residuals (unexplained variation)
$SS_{tot}$ is the total sum of squares (total variation in $Y$ )

$R^2$ ranges from 0 to 1. An $R^2$ of 0.85 means your model explains 85% of the variation in $Y$ . Be cautious, though: $R^2$ always increases when you add more predictors, even useless ones.

Adjusted R-squared

Adjusted $R^2$ fixes the problem above by penalizing for extra predictors:

$R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$

Here $n$ is the number of observations and $k$ is the number of predictors. Unlike regular $R^2$ , adjusted $R^2$ can decrease if you add a variable that doesn't improve the model enough to justify its inclusion. This makes it better for comparing models with different numbers of predictors.

F-statistic and p-value

The F-statistic tests whether your model as a whole explains significantly more variance than a model with no predictors (intercept only). It's calculated as the ratio of explained variance to unexplained variance.

A large F-statistic means the model explains much more than expected by chance
The associated p-value tells you the probability of getting that F-statistic if the model had no real explanatory power
A p-value below 0.05 is the conventional threshold for concluding the model is statistically significant

Hypothesis Testing

Hypothesis testing lets you make formal claims about whether relationships in your model are real or could have arisen by chance.

T-tests for Coefficients

To test whether a single predictor has a significant effect, you use a t-test. The null hypothesis is that the coefficient equals zero (no effect).

The t-statistic is:

$t = \frac{\hat{\beta}}{SE(\hat{\beta})}$

where $\hat{\beta}$ is the estimated coefficient and $SE(\hat{\beta})$ is its standard error. A large absolute t-value (and correspondingly small p-value) suggests the predictor has a real effect on $Y$ .

Confidence Intervals

A confidence interval gives you a range of plausible values for a coefficient, not just a single point estimate. At the 95% level:

$CI = \hat{\beta} \pm t_{\alpha/2, \; n-k-1} \times SE(\hat{\beta})$

where $t_{\alpha/2, \; n-k-1}$ is the critical t-value for your chosen confidence level and degrees of freedom.

Wider intervals mean less precision
If a 95% CI for a coefficient doesn't include zero, that coefficient is significant at the 0.05 level

ANOVA for Linear Models

ANOVA decomposes total variability into explained (by the model) and unexplained (residual) components:

$F = \frac{MS_{model}}{MS_{residual}}$

where $MS$ stands for mean square (sum of squares divided by degrees of freedom). ANOVA is particularly useful for testing whether groups of predictors jointly matter, or for comparing nested models. It's the standard tool in experimental designs with multiple treatment groups.

Diagnostics and Residual Analysis

Fitting a model is only half the job. Diagnostics tell you whether the model is trustworthy and where it might be going wrong.

Residual Plots

Residual plots are your primary diagnostic tool. The most useful ones:

Residuals vs. fitted values: checks linearity and homoscedasticity. You want a random scatter with no pattern.
Normal Q-Q plot: checks normality. Points should fall roughly along the diagonal line.
Residuals vs. leverage: helps spot influential observations.
Scale-location plot: plots $\sqrt{|\text{standardized residuals}|}$ against fitted values to check for changing variance.

Any systematic pattern in these plots signals a potential assumption violation.

Leverage and Influence

Leverage measures how far an observation's predictor values are from the average. High-leverage points have unusual $X$ values and can pull the regression line toward them.

Leverage for observation $i$ comes from the hat matrix diagonal:

$h_{ii} = X_i(X'X)^{-1}X_i'$

A common rule of thumb: flag points where $h_{ii} > 2(k+1)/n$ .

Influence combines leverage with the size of the residual. Cook's distance is the standard measure:

$D_i = \frac{(\hat{Y}_i - Y_i)^2}{(k+1) \cdot MSE} \times \frac{h_{ii}}{(1 - h_{ii})^2}$

A single high-influence point can substantially change your coefficient estimates, so these observations deserve close inspection.

Outlier Detection

Outliers are observations that don't fit the overall pattern. Several approaches exist:

Standardized residuals: residuals divided by their standard deviation. Values beyond $\pm 2$ or $\pm 3$ are suspicious.
Studentized residuals: adjust for the fact that each observation's predicted value has different precision. These are generally more reliable than standardized residuals.
Bonferroni outlier test: adjusts for multiple comparisons when you're checking many observations at once.

Before removing an outlier, consider whether it reflects a data entry error, a genuinely unusual case, or a sign that your model is misspecified.

Definition and purpose, Graphical Analysis of One-Dimensional Motion – Physics

Model Selection

The goal of model selection is parsimony: find the simplest model that adequately explains the data. Adding more predictors always improves in-sample fit, but it can hurt prediction on new data (overfitting).

Stepwise Regression

Stepwise regression automates predictor selection by iteratively adding or removing variables based on statistical criteria.

Forward stepwise: start with no predictors. At each step, add the variable that improves fit the most. Stop when no remaining variable meets the entry criterion.
Backward stepwise: start with all predictors. At each step, remove the least significant variable. Stop when all remaining variables meet the retention criterion.
Bidirectional stepwise: combines both approaches, allowing variables to be added or removed at each step.

Forward vs. Backward Selection

Forward and backward selection can produce different final models because of their sequential nature. Forward selection might miss a variable that's only useful in combination with another, while backward selection starts with all interactions present.

Neither method guarantees finding the globally best subset of predictors. They're useful heuristics, but for small numbers of candidate variables, best-subsets regression (testing all possible combinations) is more thorough.

Information Criteria (AIC, BIC)

Information criteria balance fit against complexity using a single number. Lower values are better.

AIC (Akaike Information Criterion): $AIC = 2k - 2\ln(L)$ $A I C = 2 k - 2 ln (L)$
- $k$ is the number of parameters, $L$ is the maximum likelihood
- Penalizes complexity moderately
BIC (Bayesian Information Criterion): $BIC = k\ln(n) - 2\ln(L)$ $B I C = k ln (n) - 2 ln (L)$
- $n$ is the number of observations
- Penalizes complexity more heavily than AIC, so it tends to select simpler models

Both criteria can compare non-nested models (unlike the F-test), making them versatile tools for model selection.

Multicollinearity

Multicollinearity occurs when independent variables are highly correlated with each other. It doesn't ruin your model's overall predictions, but it makes individual coefficient estimates unreliable and hard to interpret.

Causes and Consequences

Multicollinearity is common in observational studies or when predictors measure overlapping constructs (e.g., including both "years of education" and "highest degree earned").

The consequences:

Coefficient estimates become unstable: small changes in the data produce large swings in estimated values
Standard errors inflate, making it harder to detect significant effects
Coefficients may have counterintuitive signs (e.g., a variable you expect to be positive shows up negative)
Overall model fit ( $R^2$ , predictions) is largely unaffected

Variance Inflation Factor (VIF)

VIF quantifies how much a predictor's variance is inflated due to correlation with other predictors:

$VIF_j = \frac{1}{1 - R^2_j}$

Here $R^2_j$ comes from regressing the $j$ -th predictor on all the other predictors. A VIF of 1 means no collinearity. VIF above 5 is a warning sign; above 10 is generally considered serious.

Use VIF to identify which variables are causing the problem, then decide whether to drop one, combine correlated predictors, or use a regularization technique.

Ridge Regression

Ridge regression addresses multicollinearity by adding a penalty that shrinks coefficients toward zero. It minimizes:

$\sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p \beta_j^2$

The parameter $\lambda$ controls the penalty strength. As $\lambda$ increases, coefficients shrink more. This introduces a small amount of bias but can dramatically reduce variance, improving prediction accuracy.

Choosing the right $\lambda$ is typically done through cross-validation: you test different values and pick the one that minimizes prediction error on held-out data.

Generalized Linear Models

Generalized linear models (GLMs) extend the linear modeling framework to handle response variables that aren't normally distributed. They're essential for modeling binary outcomes, counts, and other non-continuous data.

Logistic Regression

Logistic regression models binary outcomes (yes/no, success/failure) by predicting the probability of an event. The logit link function transforms the probability to a linear scale:

$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k$

Here $p$ is the probability of the event occurring, and $\frac{p}{1-p}$ is the odds. Coefficients are interpreted as changes in log-odds for a one-unit increase in the predictor. Exponentiating a coefficient gives you the odds ratio.

Logistic regression is widely used for classification problems in medicine (disease diagnosis), marketing (purchase prediction), and finance (credit risk).

Poisson Regression

Poisson regression models count data (number of accidents, number of species observed, number of customer complaints). It uses the log link:

$\log(\mu) = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k$

where $\mu$ is the expected count. The Poisson distribution assumes the mean equals the variance. When this assumption is violated (overdispersion), you may need a negative binomial model instead.

Exponentiating a coefficient gives you the multiplicative change in the expected count for a one-unit increase in the predictor.

Link Functions

The link function is what connects the linear predictor to the expected value of the response. Different response types call for different links:

Identity link (linear regression): $g(\mu) = \mu$
Logit link (logistic regression): $g(\mu) = \log(\mu / (1 - \mu))$
Log link (Poisson regression): $g(\mu) = \log(\mu)$
Probit link: $g(\mu) = \Phi^{-1}(\mu)$ , where $\Phi$ is the standard normal CDF

The choice depends on the nature of your response variable and the assumptions you're willing to make.

Applications and Extensions

Linear models form the basis for a wide range of specialized techniques. These extensions adapt the core framework to handle more complex data structures and relationships.

Time Series Regression

Time series regression analyzes data collected over time to identify trends, seasonality, and other temporal patterns. Standard linear regression assumes independent errors, but time series data typically has serial correlation (today's value depends on yesterday's).

ARIMA models incorporate autoregressive (AR) and moving average (MA) components to handle this dependence
Differencing can address non-stationarity (when the mean or variance changes over time)
Lag structures let you model how past values of predictors affect the current outcome

These models are used for economic forecasting, demand planning, and studying how effects unfold over time.

Panel Data Models

Panel data combines cross-sectional and time series dimensions (e.g., tracking 50 countries over 20 years). This structure lets you control for factors that are hard to measure directly.

Fixed effects models account for unobserved, time-invariant differences between units (e.g., cultural factors that differ across countries but don't change over time)
Random effects models assume unit-specific effects are uncorrelated with the predictors, which is more efficient but a stronger assumption
The Hausman test helps you choose between the two

Panel models are standard in economics and sociology for studying how policies or conditions affect outcomes over time.

Nonlinear Transformations

When the true relationship between variables is curved, you can often still use the linear modeling framework by transforming variables:

Polynomial regression: add $X^2$ , $X^3$ , etc. as predictors to capture curvature
Log transformations: linearize exponential or multiplicative relationships (e.g., $\log(Y) = \beta_0 + \beta_1 X$ )
Box-Cox transformations: a family of power transformations that includes log as a special case
Spline functions: fit piecewise polynomials that join smoothly at specified points (knots)
Generalized additive models (GAMs): replace linear terms with smooth, flexible functions of each predictor

These approaches let you model complex, nonlinear patterns while keeping the interpretability advantages of the linear framework.