The Gauss-Markov assumptions are a set of conditions that, when satisfied, guarantee the OLS estimators have two key properties: unbiasedness and efficiency. Each assumption rules out a specific problem that would undermine your estimates.

Assumption 1: Correct model specification. The relationship between the dependent variable and the independent variables is linear in parameters, and no important variables are omitted. This doesn't mean the relationship between $y$ and $x$ has to be a straight line; it means the model is linear in the coefficients $\beta$ . A model like $y = \beta_0 + \beta_1 x^2 + \varepsilon$ is still "linear" in this sense.

Assumption 2: Zero conditional mean of errors. $E[\varepsilon | X] = 0$ . This rules out any systematic relationship between the error term and the independent variables. If this fails, your coefficient estimates will be biased because the errors are "leaking" information that should have been captured by the model.

Assumption 3: Homoscedasticity. The error terms have constant variance across all observations: $\text{Var}(\varepsilon_i | X) = \sigma^2$ for all $i$ . When this is violated (heteroscedasticity), the OLS estimators remain unbiased but lose their efficiency, and your standard errors become unreliable.

Assumption 4: No autocorrelation. The error terms are uncorrelated with each other: $\text{Cov}(\varepsilon_i, \varepsilon_j | X) = 0$ for $i \neq j$ . Autocorrelation biases your standard errors and makes the estimators inefficient, even though the coefficient estimates themselves stay unbiased.

Assumption 5: No perfect multicollinearity. No independent variable is an exact linear combination of the others. If one variable is perfectly determined by the others, the matrix $X'X$ becomes singular and $(X'X)^{-1}$ doesn't exist, so you literally cannot compute the OLS estimates.

Note that Assumptions 1 and 2 are needed for unbiasedness. Assumptions 3 and 4 are additionally needed for efficiency (the "best" part of BLUE). Assumption 5 is needed for the estimator to exist at all.

Consequences of Violating Gauss-Markov Assumptions

Different violations cause different problems, and it's worth keeping them straight:

Omitted variables or wrong functional form (violates Assumptions 1–2): Produces biased and inconsistent estimates. This is the most damaging type of violation because the coefficients themselves are wrong on average.
Heteroscedasticity (violates Assumption 3): OLS estimates remain unbiased, but they're no longer efficient. Standard errors computed the usual way are incorrect, which distorts hypothesis tests and confidence intervals.
Autocorrelation (violates Assumption 4): Same story as heteroscedasticity: unbiased but inefficient estimates, with biased standard errors leading to unreliable inference.
Perfect multicollinearity (violates Assumption 5): Estimation fails entirely. Near-perfect multicollinearity doesn't prevent estimation, but it inflates variances dramatically, making individual coefficient estimates unstable.

Unbiasedness of OLS Estimators

Conditions for Desirable OLS Estimator Properties, multiple regression - Should I test for all the OLS assumptions for a pooled OLS from panel data ...

Definition and Importance of Unbiasedness

An estimator is unbiased if its expected value equals the true population parameter: $E[\hat{\beta}] = \beta$ . In practical terms, this means that if you could repeatedly draw new samples and compute $\hat{\beta}$ each time, the average of all those estimates would converge to the true $\beta$ . It doesn't mean any single estimate is exactly right, but there's no systematic tendency to overshoot or undershoot.

Unbiasedness matters because it's a minimum standard for trusting your estimates. If an estimator is biased, every inference you build on top of it (predictions, hypothesis tests, confidence intervals) inherits that systematic error.

Proving the Unbiasedness of OLS Estimators

The proof is short and worth understanding step by step, since it shows exactly where the Gauss-Markov assumptions enter.

Start with the linear model in matrix form: $y = X\beta + \varepsilon$
Write the OLS estimator: $\hat{\beta} = (X'X)^{-1}X'y$
Substitute the model expression for $y$ : $\hat{\beta} = (X'X)^{-1}X'(X\beta + \varepsilon)$
Distribute and simplify. Since $(X'X)^{-1}X'X = I$ : $\hat{\beta} = \beta + (X'X)^{-1}X'\varepsilon$
Take the conditional expectation given $X$ : $E[\hat{\beta} | X] = \beta + (X'X)^{-1}X' E[\varepsilon | X]$
Apply Assumption 2 ( $E[\varepsilon | X] = 0$ ): $E[\hat{\beta} | X] = \beta$

The key move is step 6. The entire proof hinges on the zero conditional mean assumption. If $E[\varepsilon | X] \neq 0$ (for instance, because of omitted variable bias), the extra term doesn't vanish and $\hat{\beta}$ is biased.

Variance-Covariance Matrix of OLS Estimators

Conditions for Desirable OLS Estimator Properties, heteroscedasticity - What does having "constant variance" in a linear regression model mean ...

Definition and Interpretation of the Variance-Covariance Matrix

The variance-covariance matrix $\text{Var}(\hat{\beta})$ tells you how precise your coefficient estimates are and how they relate to each other.

Diagonal elements are the variances $\text{Var}(\hat{\beta}_j)$ . Smaller variance means a more precise estimate of that coefficient. The square root of a diagonal element gives you the standard error of that coefficient.
Off-diagonal elements are the covariances $\text{Cov}(\hat{\beta}_j, \hat{\beta}_k)$ . These indicate how estimation errors in one coefficient relate to errors in another. High covariance between two coefficients often signals that the corresponding variables are correlated in the data, making it hard to separate their individual effects.

Derivation of the Variance-Covariance Matrix

Starting from the expression derived in the unbiasedness proof:

Recall that $\hat{\beta} - \beta = (X'X)^{-1}X'\varepsilon$
The variance-covariance matrix is defined as: $\text{Var}(\hat{\beta} | X) = E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)' | X]$
Substitute the expression from step 1: $\text{Var}(\hat{\beta} | X) = (X'X)^{-1}X' \, E[\varepsilon \varepsilon' | X] \, X(X'X)^{-1}$
Apply Assumptions 3 and 4 (homoscedasticity and no autocorrelation). Together these give $E[\varepsilon \varepsilon' | X] = \sigma^2 I$ : $\text{Var}(\hat{\beta} | X) = \sigma^2 (X'X)^{-1} X'X (X'X)^{-1}$
Simplify, since $X'X (X'X)^{-1} = I$ : $\text{Var}(\hat{\beta} | X) = \sigma^2 (X'X)^{-1}$

This result has a clean interpretation. The precision of your estimates depends on two things: the noise level in the data ( $\sigma^2$ ) and the information content of your regressors ( $(X'X)^{-1}$ ). More variation in $X$ makes $(X'X)$ "larger," which makes its inverse smaller, reducing the variance of $\hat{\beta}$ . More noise (larger $\sigma^2$ ) inflates the variance.

If homoscedasticity or no-autocorrelation fails, $E[\varepsilon \varepsilon' | X] \neq \sigma^2 I$ , and the simple formula $\sigma^2(X'X)^{-1}$ no longer holds. In that case, using it anyway gives you wrong standard errors.

Efficiency of OLS Estimators

Definition and Importance of Efficiency

An estimator is efficient within a class if it has the smallest variance among all estimators in that class. For OLS, the relevant class is all linear unbiased estimators. An efficient estimator squeezes the most information out of your data, giving you the tightest possible standard errors and therefore the most powerful hypothesis tests and narrowest confidence intervals.

Gauss-Markov Theorem and the Efficiency of OLS Estimators

The Gauss-Markov theorem is the central result of this section. It states:

Under Assumptions 1–5, the OLS estimator $\hat{\beta}$ is the Best Linear Unbiased Estimator (BLUE) of $\beta$ .

"Best" means minimum variance. "Linear" means the estimator is a linear function of $y$ . So among every possible estimator that is both linear in $y$ and unbiased, OLS has the smallest variance.

The proof works by considering any alternative linear unbiased estimator $\tilde{\beta} = Cy$ (where $C$ is some matrix) and showing that $\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta})$ is a positive semi-definite matrix. This means the variance of $\tilde{\beta}$ is at least as large as that of $\hat{\beta}$ in every direction.

The efficiency result depends critically on Assumptions 3 and 4. If either homoscedasticity or no-autocorrelation fails:

OLS is still unbiased (as long as Assumptions 1–2 hold), but it's no longer the most efficient linear unbiased estimator.
Generalized least squares (GLS) can produce more efficient estimates by accounting for the actual structure of $\text{Var}(\varepsilon)$ .
Alternatively, you can stick with OLS but use robust (heteroscedasticity-consistent) standard errors to fix the inference problem without changing the point estimates.

The Gauss-Markov theorem also has a limitation worth noting: it only compares OLS to other linear unbiased estimators. There may exist nonlinear estimators with even smaller variance. To rule those out, you'd need the stronger assumption that $\varepsilon$ is normally distributed, which gives you the Cramér-Rao lower bound result.

2,589 studying →