The core idea behind OLS is straightforward: find the line (or hyperplane, in multiple regression) that makes the residuals as small as possible, in aggregate. A residual is the difference between an observed value $y_i$ and its predicted value $\hat{y}_i$ :

$e_i = y_i - \hat{y}_i$

Rather than minimizing the raw residuals (which could cancel each other out, since some are positive and some negative), OLS minimizes the sum of squared residuals (SSR):

$SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Squaring serves two purposes: it penalizes large deviations more heavily than small ones, and it eliminates the sign-cancellation problem.

Note that the least squares principle is purely an optimization criterion. The classical assumptions about the errors (normality, zero mean, constant variance / homoscedasticity, independence) are separate conditions that determine how well the resulting estimates behave statistically.

Unique Solution for Regression Coefficients

OLS yields a unique set of coefficient estimates as long as the columns of the design matrix $X$ are linearly independent (i.e., $X^TX$ is invertible). When that condition holds, the sum-of-squared-residuals surface is a strictly convex function of $\beta$ , so it has exactly one minimum.

This uniqueness is one reason OLS is so widely used. Under the classical linear model assumptions (linearity, independence, homoscedasticity, and non-stochastic or exogenous regressors), the Gauss-Markov theorem guarantees that OLS estimators are the Best Linear Unbiased Estimators (BLUE):

Unbiased: $E(\hat{\beta}) = \beta$
Minimum variance: Among all estimators that are both linear in $y$ and unbiased, OLS has the smallest variance.

The Gauss-Markov result does not require normality of errors. Normality becomes important when you want to do inference (confidence intervals, hypothesis tests) in finite samples.

Normal Equations for OLS Estimators

Deriving the Normal Equations

The normal equations come from setting the gradient of the SSR to zero. Here's the process for simple linear regression with intercept $\beta_0$ and slope $\beta_1$ :

Write the objective function: $SSR = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2$
Take the partial derivative with respect to $\beta_0$ , set it to zero: $\frac{\partial \, SSR}{\partial \beta_0} = -2\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i) = 0$
Take the partial derivative with respect to $\beta_1$ , set it to zero: $\frac{\partial \, SSR}{\partial \beta_1} = -2\sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i) = 0$
Simplify to get the two normal equations:

$\sum y_i = n\beta_0 + \beta_1 \sum x_i$

$\sum x_i y_i = \beta_0 \sum x_i + \beta_1 \sum x_i^2$

These are two linear equations in two unknowns ( $\beta_0$ and $\beta_1$ ), so they can be solved by substitution or elimination.

Minimizing the Sum of Squared Residuals, Introduction to Assessing the Fit of a Line | Concepts in Statistics

Normal Equations in Matrix Form

For multiple linear regression with $p$ predictor variables, writing out individual partial derivatives becomes impractical. The matrix formulation handles any number of predictors at once:

$X^T X \, \hat{\beta} = X^T y$

where:

$X$ is the $n \times (p+1)$ design matrix (the first column is typically all 1s for the intercept)
$X^T$ is the transpose of $X$
$y$ is the $n \times 1$ vector of observed responses
$\hat{\beta}$ is the $(p+1) \times 1$ vector of coefficient estimates

This is the matrix equivalent of setting all partial derivatives to zero simultaneously. Software functions like lm() in R or LinearRegression in scikit-learn solve this system (often via numerically stable decompositions like QR rather than direct inversion).

Calculating OLS Estimates

Simple Linear Regression

Solving the two normal equations from above gives closed-form formulas. Start with the slope, then use it to get the intercept:

$\hat{\beta}_1 = \frac{\sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$

$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

The second form of $\hat{\beta}_1$ (using deviations from means) is worth remembering because it shows that the slope is the ratio of the sample covariance of $x$ and $y$ to the sample variance of $x$ :

$\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}$

The intercept formula also tells you something useful: the fitted line always passes through the point $(\bar{x}, \bar{y})$ .

Multiple Linear Regression

In the general case, the OLS solution is:

$\hat{\beta} = (X^T X)^{-1} X^T y$

This requires $X^T X$ to be invertible, which fails when predictors are perfectly collinear (e.g., one column is an exact linear combination of others). In practice, near-collinearity (multicollinearity) doesn't prevent inversion but inflates the variance of the estimates, making them unstable.

You'll rarely compute $(X^T X)^{-1}$ by hand. Software handles the matrix algebra and also returns standard errors, t-values, and p-values alongside the coefficient estimates.

Minimizing the Sum of Squared Residuals, The Regression Equation | Introduction to Statistics – Gravina

Interpreting OLS Estimates

Regression Coefficients

Each slope coefficient $\hat{\beta}_j$ estimates the expected change in $y$ for a one-unit increase in $x_j$ , holding all other predictors constant (ceteris paribus). This "holding constant" part is critical in multiple regression because it means each coefficient reflects the partial effect of that predictor.

The intercept $\hat{\beta}_0$ is the predicted value of $y$ when every predictor equals zero. Sometimes this is meaningful (e.g., baseline test score with zero hours of study). Other times it's not (e.g., predicting weight when height is zero). Either way, the intercept is necessary for the model to fit correctly; just be cautious about interpreting it substantively.

Sign and Magnitude of Coefficients

Sign: A positive $\hat{\beta}_j$ means $y$ tends to increase as $x_j$ increases; a negative $\hat{\beta}_j$ means $y$ tends to decrease.
Magnitude: Larger absolute values indicate a steeper relationship per unit of $x_j$ . A coefficient of 2.5 means each one-unit increase in $x_j$ is associated with a 2.5-unit increase in $y$ , compared to only 0.5 for a coefficient of 0.5.

However, you cannot directly compare magnitudes across predictors measured on different scales. A coefficient of 50 on a variable measured in meters is not "stronger" than a coefficient of 0.05 on a variable measured in kilometers. If you want to compare relative importance, you'd need to look at standardized coefficients or other measures.

Considerations for Interpretation

Units matter. The coefficient's units are always (units of $y$ ) per (units of $x_j$ ). If $y$ is income in thousands of dollars and $x$ is education in years, then $\hat{\beta} = 3.2$ means each additional year of education is associated with $3,200 more income, on average.

Assumptions matter. OLS estimates are only BLUE when the Gauss-Markov conditions hold. Violations like heteroscedasticity or omitted variable bias can make the estimates biased, inefficient, or both. Always check diagnostics before trusting your interpretation.

Inference tools for assessing estimates:

Confidence intervals: A 95% CI for $\beta_j$ gives a range of plausible values for the true parameter. If the interval excludes zero, the predictor is statistically significant at the 5% level.
t-tests: The test statistic $t = \hat{\beta}_j / SE(\hat{\beta}_j)$ tests whether the coefficient differs from zero. Large absolute $t$ -values (and correspondingly small p-values) indicate statistical significance.

These inference procedures rely on the normality assumption (or large-sample approximations), which goes beyond what Gauss-Markov alone provides.

2,589 studying →