Fiveable

🥖Linear Modeling Theory Unit 5 Review

QR code for Linear Modeling Theory practice questions

5.2 Matrix Formulation of Simple Linear Regression

5.2 Matrix Formulation of Simple Linear Regression

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Linear Regression in Matrix Form

Matrix formulation recasts simple linear regression as a system of matrix equations, giving you a compact way to represent the model, estimate parameters, and derive their properties. This matters because the same matrix framework scales directly to multiple regression and more complex linear models. If you nail the simple case here, the jump to pp predictors later is mostly just making the matrices bigger.

Matrix Notation

The simple linear regression model with nn observations can be written as:

y=Xβ+εy = X\beta + \varepsilon

Each piece of this equation is a matrix or vector:

  • yy is an n×1n \times 1 vector of response values (the thing you're trying to predict).
  • XX is an n×2n \times 2 design matrix. The first column is all ones (accounting for the intercept), and the second column holds the predictor values x1,x2,,xnx_1, x_2, \dots, x_n.
  • β\beta is a 2×12 \times 1 parameter vector containing the intercept β0\beta_0 and the slope β1\beta_1.
  • ε\varepsilon is an n×1n \times 1 vector of errors.

The error vector is assumed to follow εN(0,σ2I)\varepsilon \sim N(0,\, \sigma^2 I), where II is the n×nn \times n identity matrix. Writing the covariance as σ2I\sigma^2 I encodes two things at once: every error has the same variance σ2\sigma^2, and errors are uncorrelated with each other.

Model Assumptions

  • The true relationship between the predictor and the response is linear in the parameters.
  • Errors are independently and identically distributed (i.i.d.) normal: εiN(0,σ2)\varepsilon_i \sim N(0, \sigma^2).
  • Error variance σ2\sigma^2 is constant across all observations (homoscedasticity).
  • The predictor values are treated as fixed (non-random) and measured without error.

These are the same assumptions you've seen in scalar form. The matrix version just packages them more efficiently.

Design Matrix and Parameter Vector

Design Matrix Structure

For simple linear regression with nn observations and one predictor, XX is n×2n \times 2:

  • Column 1: a vector of ones (1,1,,1)T(1, 1, \dots, 1)^T, which multiplies the intercept β0\beta_0.
  • Column 2: the predictor values (x1,x2,,xn)T(x_1, x_2, \dots, x_n)^T, which multiply the slope β1\beta_1.

Example. With 5 observations and predictor values (2,4,6,8,10)(2, 4, 6, 8, 10):

X=[12141618110]X = \begin{bmatrix} 1 & 2 \\ 1 & 4 \\ 1 & 6 \\ 1 & 8 \\ 1 & 10 \end{bmatrix}

Each row corresponds to one observation. The matrix product XβX\beta produces the vector of fitted values y^i=β0+β1xi\hat{y}_i = \beta_0 + \beta_1 x_i for every observation simultaneously.

Parameter Vector

The parameter vector collects the intercept and slope into a single object:

β=[β0β1]\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}

This is what makes the notation compact. Instead of writing out yi=β0+β1xi+εiy_i = \beta_0 + \beta_1 x_i + \varepsilon_i for each observation, the single equation y=Xβ+εy = X\beta + \varepsilon captures all nn equations at once.

Matrix Notation, Introduction to Matrices | Boundless Algebra

Least Squares Estimation with Matrices

Objective Function

The goal is the same as in scalar regression: minimize the sum of squared residuals. In matrix form, the residual vector is (yXβ)(y - X\beta), and the sum of squared residuals is:

S(β)=(yXβ)T(yXβ)S(\beta) = (y - X\beta)^T(y - X\beta)

This is just the dot product of the residual vector with itself, which equals i=1n(yiy^i)2\sum_{i=1}^n (y_i - \hat{y}_i)^2.

To minimize, you differentiate S(β)S(\beta) with respect to β\beta and set the result to zero. That yields the normal equation:

XT(yXβ)=0X^T(y - X\beta) = 0

The intuition here: XT(yXβ)=0X^T(y - X\beta) = 0 says the residual vector is orthogonal to every column of XX. The residuals have no linear relationship left with the predictors, which is exactly what a good fit should achieve.

Solving the Normal Equations

Rearranging the normal equation gives XTXβ^=XTyX^TX\hat{\beta} = X^Ty. If XTXX^TX is invertible, you can solve directly:

β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty

Here's the step-by-step process:

  1. Compute XTXX^TX (a 2×22 \times 2 matrix).
  2. Compute XTyX^Ty (a 2×12 \times 1 vector).
  3. Find (XTX)1(X^TX)^{-1}.
  4. Multiply: β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty.

Example. Using the design matrix from above and response vector y=(3,5,7,9,11)Ty = (3, 5, 7, 9, 11)^T:

XTX=[53030220],XTy=[35250]X^TX = \begin{bmatrix} 5 & 30 \\ 30 & 220 \end{bmatrix}, \qquad X^Ty = \begin{bmatrix} 35 \\ 250 \end{bmatrix}

The determinant of XTXX^TX is 5(220)30(30)=1100900=2005(220) - 30(30) = 1100 - 900 = 200, so:

(XTX)1=1200[22030305](X^TX)^{-1} = \frac{1}{200}\begin{bmatrix} 220 & -30 \\ -30 & 5 \end{bmatrix}

Then:

β^=1200[22030305][35250]=1200[200200]=[11]\hat{\beta} = \frac{1}{200}\begin{bmatrix} 220 & -30 \\ -30 & 5 \end{bmatrix}\begin{bmatrix} 35 \\ 250 \end{bmatrix} = \frac{1}{200}\begin{bmatrix} 200 \\ 200 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}

So β^0=1\hat{\beta}_0 = 1 and β^1=1\hat{\beta}_1 = 1, giving the fitted line y^=1+1x\hat{y} = 1 + 1 \cdot x. You can verify: when x=2x = 2, y^=3\hat{y} = 3; when x=10x = 10, y^=11\hat{y} = 11. The fit is perfect here because the data are exactly linear.

Note: The original guide states that (XTX)1(X^TX)^{-1} is the variance-covariance matrix of β^\hat{\beta}. More precisely, Var(β^)=σ2(XTX)1\text{Var}(\hat{\beta}) = \sigma^2(X^TX)^{-1}. The matrix (XTX)1(X^TX)^{-1} still needs to be scaled by σ2\sigma^2 (or its estimate σ^2\hat{\sigma}^2) to give actual variances and covariances.

Normal Equations Derivation

Expanding the Least Squares Problem

Starting from XT(yXβ)=0X^T(y - X\beta) = 0 and distributing:

XTXβ=XTyX^TX\beta = X^Ty

For simple linear regression, XTXX^TX is 2×22 \times 2 and XTyX^Ty is 2×12 \times 1. Writing these out element by element:

[ni=1nxii=1nxii=1nxi2][β0β1]=[i=1nyii=1nxiyi]\begin{bmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^n y_i \\ \sum_{i=1}^n x_i y_i \end{bmatrix}

This gives you two simultaneous equations (the classical normal equations you may have seen in scalar form):

  1. nβ0+β1xi=yin\beta_0 + \beta_1 \sum x_i = \sum y_i
  2. β0xi+β1xi2=xiyi\beta_0 \sum x_i + \beta_1 \sum x_i^2 = \sum x_i y_i

The matrix formulation and the scalar normal equations are the same system, just written differently.

Solving for Parameter Estimates

Premultiplying both sides by (XTX)1(X^TX)^{-1}:

β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty

The variance-covariance matrix of the parameter estimates is:

Var(β^)=σ2(XTX)1\text{Var}(\hat{\beta}) = \sigma^2(X^TX)^{-1}

  • The diagonal elements of σ2(XTX)1\sigma^2(X^TX)^{-1} give Var(β^0)\text{Var}(\hat{\beta}_0) and Var(β^1)\text{Var}(\hat{\beta}_1). Taking square roots of these gives you the standard errors you'd use for hypothesis tests and confidence intervals.
  • The off-diagonal elements give Cov(β^0,β^1)\text{Cov}(\hat{\beta}_0, \hat{\beta}_1). In simple linear regression, the intercept and slope estimates are generally correlated unless xˉ=0\bar{x} = 0.

This is one of the big payoffs of the matrix approach: the formula β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty and the variance expression σ2(XTX)1\sigma^2(X^TX)^{-1} work regardless of whether you have 1 predictor or 100. The structure stays the same.