Least Squares Approximations
When a system of equations has no exact solution (more equations than unknowns), least squares gives you the next best thing: the solution that minimizes the total squared error. This technique connects directly to orthogonal projection, since the best approximation is literally the projection of your target vector onto a subspace.
Least Squares Problems with Inner Products
Fundamentals of the Least Squares Method
The core idea: you have a vector and a subspace , and you want to find the vector in that's closest to . "Closest" means minimizing the squared distance, which is where inner products come in.
The orthogonality principle drives everything here. The error vector (the difference between and your approximation ) must be orthogonal to the subspace . If it weren't, you could reduce the error further by adjusting your approximation within .
This principle leads to the normal equations:
where is the matrix whose columns span . The matrix is called the Gram matrix, built from inner products of the column vectors of . You solve this system to get the coefficient vector .
Function Approximation and Continuous Least Squares
Least squares isn't limited to discrete data points. For functions, you define the inner product as an integral:
This turns function approximation into the same kind of problem. You pick a set of basis functions (polynomials, trigonometric functions, etc.) and find the linear combination that best approximates your target function in the least squares sense.
The Gram-Schmidt process is especially useful here. Orthogonalizing your basis functions before solving makes the Gram matrix diagonal, which dramatically simplifies computation and improves numerical stability.
Best Approximating Vectors in Subspaces

Orthogonal Projection and Error Analysis
The best approximation of in subspace is the orthogonal projection of onto . Here's how to think about it step by step:
-
You have a target vector and a subspace spanned by the columns of .
-
The projection is the unique vector in closest to .
-
The error vector is orthogonal to every vector in .
-
The coefficients satisfy the normal equations .
Uniqueness: The best approximation is always unique. The coefficient vector is unique when the columns of are linearly independent (i.e., is invertible).
If you have an orthonormal basis for , the projection simplifies to:
The Residual Sum of Squares (RSS) measures how good the approximation is:
The least squares solution is precisely the one that minimizes this quantity.
Applications to Overdetermined Systems
An overdetermined system has more equations than unknowns, so typically no exact solution exists. Least squares finds the that makes as close to as possible.
A common example: fitting a line to data points where . Each data point gives one equation, producing an overdetermined system. The least squares solution gives the line of best fit, minimizing the sum of squared vertical distances from the data points to the line.
This same framework extends to fitting polynomials, exponentials, or any model that's linear in its parameters.
Solving Least Squares with Orthogonal Projections
Direct Methods and Matrix Decompositions
There are several ways to solve least squares problems, each with trade-offs in stability and efficiency.
Method 1: Normal Equations
Solve directly. This is conceptually simple, and you can use Cholesky decomposition (since is symmetric positive definite when has full column rank) to solve efficiently. The downside: forming can amplify numerical errors, roughly squaring the condition number of .
Method 2: QR Decomposition
Factor where has orthonormal columns and is upper triangular. Then the least squares solution satisfies:
This avoids forming entirely and is more numerically stable. For most practical problems, QR is the preferred approach.
Method 3: Projection Matrix
The orthogonal projection matrix projects any vector directly onto the column space of . So . This is useful conceptually but less efficient computationally than QR for solving individual problems.
For rank-deficient cases (columns of are linearly dependent), the pseudoinverse (Moore-Penrose inverse) gives the minimum-norm least squares solution: .

Iterative and Regularization Techniques
For large-scale problems where direct methods are too expensive, iterative methods like the conjugate gradient algorithm solve the normal equations without explicitly forming .
When the problem is ill-conditioned (small changes in data cause large changes in the solution), regularization helps:
- Tikhonov regularization (ridge regression) adds a penalty: minimize . This biases the solution toward smaller coefficients, trading a bit of fit quality for much better stability.
- L1 regularization (Lasso) penalizes instead, which tends to push some coefficients to exactly zero, producing sparse solutions.
- Truncated SVD keeps only the largest singular values, discarding components associated with near-zero singular values that amplify noise.
Cross-validation is the standard technique for choosing the regularization parameter .
Geometric Interpretation of Least Squares Approximations
Euclidean Geometry of Least Squares
The geometry here is clean and worth internalizing. Picture as a point in and the column space of as a subspace (a plane, for instance, if has two columns). The least squares solution is the point on that plane closest to .
The error vector points straight "up" from the plane to , forming a right angle with the subspace. This is exactly the Pythagorean relationship:
The coefficient of determination has a geometric meaning too: it equals the squared cosine of the angle between and its projection . An near 1 means nearly lies in the subspace; near 0 means the subspace captures almost none of the variation in .
Geometric Concepts in Regression Analysis
Several regression diagnostics connect back to this geometry:
- The hat matrix is the projection matrix onto the column space of . It maps observed values to fitted values: .
- Leverage values are the diagonal entries of . A high leverage value means that data point has a strong geometric pull on the fitted subspace.
- Mahalanobis distance generalizes Euclidean distance by accounting for correlations in the data. It measures how far a point is from the center of a distribution, scaled by the covariance structure. This is useful for detecting outliers in multivariate settings.
Principal Component Analysis (PCA) also connects here: it identifies the orthogonal directions along which data varies most, effectively finding the best low-dimensional subspace for the data in a least squares sense.