Fiveable

Linear Algebra and Differential Equations Unit 6 Review

QR code for Linear Algebra and Differential Equations practice questions

6.3 Least Squares Approximations

6.3 Least Squares Approximations

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
Linear Algebra and Differential Equations
Unit & Topic Study Guides

Least Squares Approximations

When a system of equations has no exact solution (more equations than unknowns), least squares gives you the next best thing: the solution that minimizes the total squared error. This technique connects directly to orthogonal projection, since the best approximation is literally the projection of your target vector onto a subspace.

Least Squares Problems with Inner Products

Fundamentals of the Least Squares Method

The core idea: you have a vector bb and a subspace WW, and you want to find the vector in WW that's closest to bb. "Closest" means minimizing the squared distance, which is where inner products come in.

The orthogonality principle drives everything here. The error vector (the difference between bb and your approximation b^\hat{b}) must be orthogonal to the subspace WW. If it weren't, you could reduce the error further by adjusting your approximation within WW.

This principle leads to the normal equations:

ATAx^=ATbA^T A \hat{x} = A^T b

where AA is the matrix whose columns span WW. The matrix ATAA^T A is called the Gram matrix, built from inner products of the column vectors of AA. You solve this system to get the coefficient vector x^\hat{x}.

Function Approximation and Continuous Least Squares

Least squares isn't limited to discrete data points. For functions, you define the inner product as an integral:

f,g=abf(x)g(x)dx\langle f, g \rangle = \int_a^b f(x)g(x)\,dx

This turns function approximation into the same kind of problem. You pick a set of basis functions (polynomials, trigonometric functions, etc.) and find the linear combination that best approximates your target function in the least squares sense.

The Gram-Schmidt process is especially useful here. Orthogonalizing your basis functions before solving makes the Gram matrix diagonal, which dramatically simplifies computation and improves numerical stability.

Best Approximating Vectors in Subspaces

Fundamentals of Least Squares Method, Least squares - Wikipedia

Orthogonal Projection and Error Analysis

The best approximation of bb in subspace WW is the orthogonal projection of bb onto WW. Here's how to think about it step by step:

  1. You have a target vector bb and a subspace WW spanned by the columns of AA.

  2. The projection b^\hat{b} is the unique vector in WW closest to bb.

  3. The error vector e=bb^e = b - \hat{b} is orthogonal to every vector in WW.

  4. The coefficients x^\hat{x} satisfy the normal equations ATAx^=ATbA^T A \hat{x} = A^T b.

Uniqueness: The best approximation b^\hat{b} is always unique. The coefficient vector x^\hat{x} is unique when the columns of AA are linearly independent (i.e., ATAA^T A is invertible).

If you have an orthonormal basis {u1,u2,,uk}\{u_1, u_2, \ldots, u_k\} for WW, the projection simplifies to:

b^=b,u1u1+b,u2u2++b,ukuk\hat{b} = \langle b, u_1 \rangle u_1 + \langle b, u_2 \rangle u_2 + \cdots + \langle b, u_k \rangle u_k

The Residual Sum of Squares (RSS) measures how good the approximation is:

RSS=bb^2=i=1n(yiy^i)2RSS = \|b - \hat{b}\|^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2

The least squares solution is precisely the one that minimizes this quantity.

Applications to Overdetermined Systems

An overdetermined system Ax=bAx = b has more equations than unknowns, so typically no exact solution exists. Least squares finds the x^\hat{x} that makes Ax^A\hat{x} as close to bb as possible.

A common example: fitting a line y=c0+c1xy = c_0 + c_1 x to data points (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n) where n>2n > 2. Each data point gives one equation, producing an overdetermined system. The least squares solution gives the line of best fit, minimizing the sum of squared vertical distances from the data points to the line.

This same framework extends to fitting polynomials, exponentials, or any model that's linear in its parameters.

Solving Least Squares with Orthogonal Projections

Direct Methods and Matrix Decompositions

There are several ways to solve least squares problems, each with trade-offs in stability and efficiency.

Method 1: Normal Equations

Solve ATAx^=ATbA^T A \hat{x} = A^T b directly. This is conceptually simple, and you can use Cholesky decomposition (since ATAA^T A is symmetric positive definite when AA has full column rank) to solve efficiently. The downside: forming ATAA^T A can amplify numerical errors, roughly squaring the condition number of AA.

Method 2: QR Decomposition

Factor A=QRA = QR where QQ has orthonormal columns and RR is upper triangular. Then the least squares solution satisfies:

Rx^=QTbR\hat{x} = Q^T b

This avoids forming ATAA^T A entirely and is more numerically stable. For most practical problems, QR is the preferred approach.

Method 3: Projection Matrix

The orthogonal projection matrix P=A(ATA)1ATP = A(A^T A)^{-1}A^T projects any vector directly onto the column space of AA. So b^=Pb\hat{b} = Pb. This is useful conceptually but less efficient computationally than QR for solving individual problems.

For rank-deficient cases (columns of AA are linearly dependent), the pseudoinverse (Moore-Penrose inverse) A+A^+ gives the minimum-norm least squares solution: x^=A+b\hat{x} = A^+ b.

Fundamentals of Least Squares Method, Least squares - formulasearchengine

Iterative and Regularization Techniques

For large-scale problems where direct methods are too expensive, iterative methods like the conjugate gradient algorithm solve the normal equations without explicitly forming ATAA^T A.

When the problem is ill-conditioned (small changes in data cause large changes in the solution), regularization helps:

  • Tikhonov regularization (ridge regression) adds a penalty: minimize Axb2+λx2\|Ax - b\|^2 + \lambda\|x\|^2. This biases the solution toward smaller coefficients, trading a bit of fit quality for much better stability.
  • L1 regularization (Lasso) penalizes x1\|x\|_1 instead, which tends to push some coefficients to exactly zero, producing sparse solutions.
  • Truncated SVD keeps only the largest singular values, discarding components associated with near-zero singular values that amplify noise.

Cross-validation is the standard technique for choosing the regularization parameter λ\lambda.

Geometric Interpretation of Least Squares Approximations

Euclidean Geometry of Least Squares

The geometry here is clean and worth internalizing. Picture bb as a point in Rn\mathbb{R}^n and the column space of AA as a subspace (a plane, for instance, if AA has two columns). The least squares solution b^\hat{b} is the point on that plane closest to bb.

The error vector e=bb^e = b - \hat{b} points straight "up" from the plane to bb, forming a right angle with the subspace. This is exactly the Pythagorean relationship:

b2=b^2+e2\|b\|^2 = \|\hat{b}\|^2 + \|e\|^2

The coefficient of determination R2R^2 has a geometric meaning too: it equals the squared cosine of the angle between bb and its projection b^\hat{b}. An R2R^2 near 1 means bb nearly lies in the subspace; near 0 means the subspace captures almost none of the variation in bb.

Geometric Concepts in Regression Analysis

Several regression diagnostics connect back to this geometry:

  • The hat matrix H=A(ATA)1ATH = A(A^T A)^{-1}A^T is the projection matrix onto the column space of AA. It maps observed values to fitted values: y^=Hy\hat{y} = Hy.
  • Leverage values are the diagonal entries of HH. A high leverage value means that data point has a strong geometric pull on the fitted subspace.
  • Mahalanobis distance generalizes Euclidean distance by accounting for correlations in the data. It measures how far a point is from the center of a distribution, scaled by the covariance structure. This is useful for detecting outliers in multivariate settings.

Principal Component Analysis (PCA) also connects here: it identifies the orthogonal directions along which data varies most, effectively finding the best low-dimensional subspace for the data in a least squares sense.