Fiveable

🥖Linear Modeling Theory Unit 2 Review

QR code for Linear Modeling Theory practice questions

2.1 Ordinary Least Squares (OLS) Method

2.1 Ordinary Least Squares (OLS) Method

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Least Squares Principle in Regression

Minimizing the Sum of Squared Residuals

The core idea behind OLS is straightforward: find the line (or hyperplane, in multiple regression) that makes the residuals as small as possible, in aggregate. A residual is the difference between an observed value yiy_i and its predicted value y^i\hat{y}_i:

ei=yiy^ie_i = y_i - \hat{y}_i

Rather than minimizing the raw residuals (which could cancel each other out, since some are positive and some negative), OLS minimizes the sum of squared residuals (SSR):

SSR=i=1n(yiy^i)2SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Squaring serves two purposes: it penalizes large deviations more heavily than small ones, and it eliminates the sign-cancellation problem.

Note that the least squares principle is purely an optimization criterion. The classical assumptions about the errors (normality, zero mean, constant variance / homoscedasticity, independence) are separate conditions that determine how well the resulting estimates behave statistically.

Unique Solution for Regression Coefficients

OLS yields a unique set of coefficient estimates as long as the columns of the design matrix XX are linearly independent (i.e., XTXX^TX is invertible). When that condition holds, the sum-of-squared-residuals surface is a strictly convex function of β\beta, so it has exactly one minimum.

This uniqueness is one reason OLS is so widely used. Under the classical linear model assumptions (linearity, independence, homoscedasticity, and non-stochastic or exogenous regressors), the Gauss-Markov theorem guarantees that OLS estimators are the Best Linear Unbiased Estimators (BLUE):

  • Unbiased: E(β^)=βE(\hat{\beta}) = \beta
  • Minimum variance: Among all estimators that are both linear in yy and unbiased, OLS has the smallest variance.

The Gauss-Markov result does not require normality of errors. Normality becomes important when you want to do inference (confidence intervals, hypothesis tests) in finite samples.

Normal Equations for OLS Estimators

Deriving the Normal Equations

The normal equations come from setting the gradient of the SSR to zero. Here's the process for simple linear regression with intercept β0\beta_0 and slope β1\beta_1:

  1. Write the objective function: SSR=i=1n(yiβ0β1xi)2SSR = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2

  2. Take the partial derivative with respect to β0\beta_0, set it to zero: SSRβ0=2i=1n(yiβ0β1xi)=0\frac{\partial \, SSR}{\partial \beta_0} = -2\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i) = 0

  3. Take the partial derivative with respect to β1\beta_1, set it to zero: SSRβ1=2i=1nxi(yiβ0β1xi)=0\frac{\partial \, SSR}{\partial \beta_1} = -2\sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i) = 0

  4. Simplify to get the two normal equations:

yi=nβ0+β1xi\sum y_i = n\beta_0 + \beta_1 \sum x_i

xiyi=β0xi+β1xi2\sum x_i y_i = \beta_0 \sum x_i + \beta_1 \sum x_i^2

These are two linear equations in two unknowns (β0\beta_0 and β1\beta_1), so they can be solved by substitution or elimination.

Minimizing the Sum of Squared Residuals, Introduction to Assessing the Fit of a Line | Concepts in Statistics

Normal Equations in Matrix Form

For multiple linear regression with pp predictor variables, writing out individual partial derivatives becomes impractical. The matrix formulation handles any number of predictors at once:

XTXβ^=XTyX^T X \, \hat{\beta} = X^T y

where:

  • XX is the n×(p+1)n \times (p+1) design matrix (the first column is typically all 1s for the intercept)
  • XTX^T is the transpose of XX
  • yy is the n×1n \times 1 vector of observed responses
  • β^\hat{\beta} is the (p+1)×1(p+1) \times 1 vector of coefficient estimates

This is the matrix equivalent of setting all partial derivatives to zero simultaneously. Software functions like lm() in R or LinearRegression in scikit-learn solve this system (often via numerically stable decompositions like QR rather than direct inversion).

Calculating OLS Estimates

Simple Linear Regression

Solving the two normal equations from above gives closed-form formulas. Start with the slope, then use it to get the intercept:

β^1=i=1nxiyinxˉyˉi=1nxi2nxˉ2=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\hat{\beta}_1 = \frac{\sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

The second form of β^1\hat{\beta}_1 (using deviations from means) is worth remembering because it shows that the slope is the ratio of the sample covariance of xx and yy to the sample variance of xx:

β^1=SxySxx\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}

The intercept formula also tells you something useful: the fitted line always passes through the point (xˉ,yˉ)(\bar{x}, \bar{y}).

Multiple Linear Regression

In the general case, the OLS solution is:

β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y

This requires XTXX^T X to be invertible, which fails when predictors are perfectly collinear (e.g., one column is an exact linear combination of others). In practice, near-collinearity (multicollinearity) doesn't prevent inversion but inflates the variance of the estimates, making them unstable.

You'll rarely compute (XTX)1(X^T X)^{-1} by hand. Software handles the matrix algebra and also returns standard errors, t-values, and p-values alongside the coefficient estimates.

Minimizing the Sum of Squared Residuals, The Regression Equation | Introduction to Statistics – Gravina

Interpreting OLS Estimates

Regression Coefficients

Each slope coefficient β^j\hat{\beta}_j estimates the expected change in yy for a one-unit increase in xjx_j, holding all other predictors constant (ceteris paribus). This "holding constant" part is critical in multiple regression because it means each coefficient reflects the partial effect of that predictor.

The intercept β^0\hat{\beta}_0 is the predicted value of yy when every predictor equals zero. Sometimes this is meaningful (e.g., baseline test score with zero hours of study). Other times it's not (e.g., predicting weight when height is zero). Either way, the intercept is necessary for the model to fit correctly; just be cautious about interpreting it substantively.

Sign and Magnitude of Coefficients

  • Sign: A positive β^j\hat{\beta}_j means yy tends to increase as xjx_j increases; a negative β^j\hat{\beta}_j means yy tends to decrease.
  • Magnitude: Larger absolute values indicate a steeper relationship per unit of xjx_j. A coefficient of 2.5 means each one-unit increase in xjx_j is associated with a 2.5-unit increase in yy, compared to only 0.5 for a coefficient of 0.5.

However, you cannot directly compare magnitudes across predictors measured on different scales. A coefficient of 50 on a variable measured in meters is not "stronger" than a coefficient of 0.05 on a variable measured in kilometers. If you want to compare relative importance, you'd need to look at standardized coefficients or other measures.

Considerations for Interpretation

Units matter. The coefficient's units are always (units of yy) per (units of xjx_j). If yy is income in thousands of dollars and xx is education in years, then β^=3.2\hat{\beta} = 3.2 means each additional year of education is associated with $3,200 more income, on average.

Assumptions matter. OLS estimates are only BLUE when the Gauss-Markov conditions hold. Violations like heteroscedasticity or omitted variable bias can make the estimates biased, inefficient, or both. Always check diagnostics before trusting your interpretation.

Inference tools for assessing estimates:

  • Confidence intervals: A 95% CI for βj\beta_j gives a range of plausible values for the true parameter. If the interval excludes zero, the predictor is statistically significant at the 5% level.
  • t-tests: The test statistic t=β^j/SE(β^j)t = \hat{\beta}_j / SE(\hat{\beta}_j) tests whether the coefficient differs from zero. Large absolute tt-values (and correspondingly small p-values) indicate statistical significance.

These inference procedures rely on the normality assumption (or large-sample approximations), which goes beyond what Gauss-Markov alone provides.