Least Squares Principle in Regression
Minimizing the Sum of Squared Residuals
The core idea behind OLS is straightforward: find the line (or hyperplane, in multiple regression) that makes the residuals as small as possible, in aggregate. A residual is the difference between an observed value and its predicted value :
Rather than minimizing the raw residuals (which could cancel each other out, since some are positive and some negative), OLS minimizes the sum of squared residuals (SSR):
Squaring serves two purposes: it penalizes large deviations more heavily than small ones, and it eliminates the sign-cancellation problem.
Note that the least squares principle is purely an optimization criterion. The classical assumptions about the errors (normality, zero mean, constant variance / homoscedasticity, independence) are separate conditions that determine how well the resulting estimates behave statistically.
Unique Solution for Regression Coefficients
OLS yields a unique set of coefficient estimates as long as the columns of the design matrix are linearly independent (i.e., is invertible). When that condition holds, the sum-of-squared-residuals surface is a strictly convex function of , so it has exactly one minimum.
This uniqueness is one reason OLS is so widely used. Under the classical linear model assumptions (linearity, independence, homoscedasticity, and non-stochastic or exogenous regressors), the Gauss-Markov theorem guarantees that OLS estimators are the Best Linear Unbiased Estimators (BLUE):
- Unbiased:
- Minimum variance: Among all estimators that are both linear in and unbiased, OLS has the smallest variance.
The Gauss-Markov result does not require normality of errors. Normality becomes important when you want to do inference (confidence intervals, hypothesis tests) in finite samples.
Normal Equations for OLS Estimators
Deriving the Normal Equations
The normal equations come from setting the gradient of the SSR to zero. Here's the process for simple linear regression with intercept and slope :
-
Write the objective function:
-
Take the partial derivative with respect to , set it to zero:
-
Take the partial derivative with respect to , set it to zero:
-
Simplify to get the two normal equations:
These are two linear equations in two unknowns ( and ), so they can be solved by substitution or elimination.

Normal Equations in Matrix Form
For multiple linear regression with predictor variables, writing out individual partial derivatives becomes impractical. The matrix formulation handles any number of predictors at once:
where:
- is the design matrix (the first column is typically all 1s for the intercept)
- is the transpose of
- is the vector of observed responses
- is the vector of coefficient estimates
This is the matrix equivalent of setting all partial derivatives to zero simultaneously. Software functions like lm() in R or LinearRegression in scikit-learn solve this system (often via numerically stable decompositions like QR rather than direct inversion).
Calculating OLS Estimates
Simple Linear Regression
Solving the two normal equations from above gives closed-form formulas. Start with the slope, then use it to get the intercept:
The second form of (using deviations from means) is worth remembering because it shows that the slope is the ratio of the sample covariance of and to the sample variance of :
The intercept formula also tells you something useful: the fitted line always passes through the point .
Multiple Linear Regression
In the general case, the OLS solution is:
This requires to be invertible, which fails when predictors are perfectly collinear (e.g., one column is an exact linear combination of others). In practice, near-collinearity (multicollinearity) doesn't prevent inversion but inflates the variance of the estimates, making them unstable.
You'll rarely compute by hand. Software handles the matrix algebra and also returns standard errors, t-values, and p-values alongside the coefficient estimates.

Interpreting OLS Estimates
Regression Coefficients
Each slope coefficient estimates the expected change in for a one-unit increase in , holding all other predictors constant (ceteris paribus). This "holding constant" part is critical in multiple regression because it means each coefficient reflects the partial effect of that predictor.
The intercept is the predicted value of when every predictor equals zero. Sometimes this is meaningful (e.g., baseline test score with zero hours of study). Other times it's not (e.g., predicting weight when height is zero). Either way, the intercept is necessary for the model to fit correctly; just be cautious about interpreting it substantively.
Sign and Magnitude of Coefficients
- Sign: A positive means tends to increase as increases; a negative means tends to decrease.
- Magnitude: Larger absolute values indicate a steeper relationship per unit of . A coefficient of 2.5 means each one-unit increase in is associated with a 2.5-unit increase in , compared to only 0.5 for a coefficient of 0.5.
However, you cannot directly compare magnitudes across predictors measured on different scales. A coefficient of 50 on a variable measured in meters is not "stronger" than a coefficient of 0.05 on a variable measured in kilometers. If you want to compare relative importance, you'd need to look at standardized coefficients or other measures.
Considerations for Interpretation
Units matter. The coefficient's units are always (units of ) per (units of ). If is income in thousands of dollars and is education in years, then means each additional year of education is associated with $3,200 more income, on average.
Assumptions matter. OLS estimates are only BLUE when the Gauss-Markov conditions hold. Violations like heteroscedasticity or omitted variable bias can make the estimates biased, inefficient, or both. Always check diagnostics before trusting your interpretation.
Inference tools for assessing estimates:
- Confidence intervals: A 95% CI for gives a range of plausible values for the true parameter. If the interval excludes zero, the predictor is statistically significant at the 5% level.
- t-tests: The test statistic tests whether the coefficient differs from zero. Large absolute -values (and correspondingly small p-values) indicate statistical significance.
These inference procedures rely on the normality assumption (or large-sample approximations), which goes beyond what Gauss-Markov alone provides.