Linear Regression in Matrix Form
Matrix formulation recasts simple linear regression as a system of matrix equations, giving you a compact way to represent the model, estimate parameters, and derive their properties. This matters because the same matrix framework scales directly to multiple regression and more complex linear models. If you nail the simple case here, the jump to predictors later is mostly just making the matrices bigger.
Matrix Notation
The simple linear regression model with observations can be written as:
Each piece of this equation is a matrix or vector:
- is an vector of response values (the thing you're trying to predict).
- is an design matrix. The first column is all ones (accounting for the intercept), and the second column holds the predictor values .
- is a parameter vector containing the intercept and the slope .
- is an vector of errors.
The error vector is assumed to follow , where is the identity matrix. Writing the covariance as encodes two things at once: every error has the same variance , and errors are uncorrelated with each other.
Model Assumptions
- The true relationship between the predictor and the response is linear in the parameters.
- Errors are independently and identically distributed (i.i.d.) normal: .
- Error variance is constant across all observations (homoscedasticity).
- The predictor values are treated as fixed (non-random) and measured without error.
These are the same assumptions you've seen in scalar form. The matrix version just packages them more efficiently.
Design Matrix and Parameter Vector
Design Matrix Structure
For simple linear regression with observations and one predictor, is :
- Column 1: a vector of ones , which multiplies the intercept .
- Column 2: the predictor values , which multiply the slope .
Example. With 5 observations and predictor values :
Each row corresponds to one observation. The matrix product produces the vector of fitted values for every observation simultaneously.
Parameter Vector
The parameter vector collects the intercept and slope into a single object:
This is what makes the notation compact. Instead of writing out for each observation, the single equation captures all equations at once.

Least Squares Estimation with Matrices
Objective Function
The goal is the same as in scalar regression: minimize the sum of squared residuals. In matrix form, the residual vector is , and the sum of squared residuals is:
This is just the dot product of the residual vector with itself, which equals .
To minimize, you differentiate with respect to and set the result to zero. That yields the normal equation:
The intuition here: says the residual vector is orthogonal to every column of . The residuals have no linear relationship left with the predictors, which is exactly what a good fit should achieve.
Solving the Normal Equations
Rearranging the normal equation gives . If is invertible, you can solve directly:
Here's the step-by-step process:
- Compute (a matrix).
- Compute (a vector).
- Find .
- Multiply: .
Example. Using the design matrix from above and response vector :
The determinant of is , so:
Then:
So and , giving the fitted line . You can verify: when , ; when , . The fit is perfect here because the data are exactly linear.
Note: The original guide states that is the variance-covariance matrix of . More precisely, . The matrix still needs to be scaled by (or its estimate ) to give actual variances and covariances.
Normal Equations Derivation
Expanding the Least Squares Problem
Starting from and distributing:
For simple linear regression, is and is . Writing these out element by element:
This gives you two simultaneous equations (the classical normal equations you may have seen in scalar form):
The matrix formulation and the scalar normal equations are the same system, just written differently.
Solving for Parameter Estimates
Premultiplying both sides by :
The variance-covariance matrix of the parameter estimates is:
- The diagonal elements of give and . Taking square roots of these gives you the standard errors you'd use for hypothesis tests and confidence intervals.
- The off-diagonal elements give . In simple linear regression, the intercept and slope estimates are generally correlated unless .
This is one of the big payoffs of the matrix approach: the formula and the variance expression work regardless of whether you have 1 predictor or 100. The structure stays the same.