Fiveable

🥖Linear Modeling Theory Unit 13 Review

QR code for Linear Modeling Theory practice questions

13.3 Maximum Likelihood Estimation for GLMs

13.3 Maximum Likelihood Estimation for GLMs

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🥖Linear Modeling Theory
Unit & Topic Study Guides

Likelihood Function for GLMs

Maximum Likelihood Estimation (MLE) finds the parameter values that make your observed data most probable under the model you've specified. In ordinary linear regression, you can solve for coefficients directly with closed-form equations. GLMs don't have that luxury for most distributions, so MLE combined with iterative algorithms becomes the standard fitting approach.

Formulation and General Form

The likelihood function for a GLM is built from one core assumption: observations are independent. Because of that independence, the joint probability of all your data is just the product of each individual observation's probability:

L(β;y)=i=1nf(yi;θi,ϕ)L(\beta; y) = \prod_{i=1}^{n} f(y_i; \theta_i, \phi)

where:

  • f(yi;θi,ϕ)f(y_i; \theta_i, \phi) is the probability density (continuous) or mass (discrete) function for observation ii
  • θi\theta_i is the natural (canonical) parameter for observation ii, which encodes information about the mean
  • ϕ\phi is the dispersion parameter, which controls how spread out the distribution is

The specific form of ff depends on which exponential family distribution you've chosen for the response (Gaussian, Bernoulli, Poisson, Gamma, etc.).

Relationship between Parameters and Predictors

The natural parameter θi\theta_i connects to your predictors through two steps. First, the link function gg maps the mean response μi\mu_i to the linear predictor:

g(μi)=ηi=xiβg(\mu_i) = \eta_i = x_i^\top \beta

  • μi\mu_i is the mean of the response for observation ii
  • xix_i is the vector of predictor values for observation ii
  • β\beta is the vector of regression coefficients you're estimating

Second, the relationship between μi\mu_i and θi\theta_i is determined by the specific exponential family distribution. When you use the canonical link, θi=ηi\theta_i = \eta_i directly, which simplifies the math considerably.

The dispersion parameter ϕ\phi is assumed constant across all observations. For some distributions (Bernoulli, Poisson), ϕ\phi is fixed at 1 by definition. For others (Gaussian, Gamma), it must be estimated.

Log-Likelihood Function for GLMs

Derivation and Decomposition

Working with products is algebraically painful, so you take the natural log to convert the product into a sum:

(β;y)=lnL(β;y)=i=1nlnf(yi;θi,ϕ)\ell(\beta; y) = \ln L(\beta; y) = \sum_{i=1}^{n} \ln f(y_i; \theta_i, \phi)

This transformation doesn't change where the maximum occurs (the log is monotonically increasing), but it makes differentiation and computation far more tractable.

Because GLMs use exponential family distributions, the log-likelihood has a standard decomposition:

(β;y)=i=1n[yiθib(θi)a(ϕ)+c(yi,ϕ)]\ell(\beta; y) = \sum_{i=1}^{n} \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]

Each piece has a specific role:

  • b(θi)b(\theta_i) is the cumulant function. It determines the mean and variance of the distribution. Specifically, μi=b(θi)\mu_i = b'(\theta_i) and Var(Yi)=b(θi)a(ϕ)\text{Var}(Y_i) = b''(\theta_i) \cdot a(\phi).
  • a(ϕ)a(\phi) is typically ϕ/wi\phi / w_i where wiw_i is a known prior weight. For many standard cases, a(ϕ)=ϕa(\phi) = \phi.
  • c(yi,ϕ)c(y_i, \phi) is a normalizing term that depends on the data and dispersion but not on β\beta. It drops out when you differentiate with respect to β\beta, so it doesn't affect the MLE of the regression coefficients.
Formulation and General Form, Maximum Likelihood Estimate and Logistic Regression simplified — Pavan Mirla

Distribution-Specific Functions

Different distributions plug different functions into this framework. For example:

Distributionθi\theta_ib(θi)b(\theta_i)a(ϕ)a(\phi)
Gaussianμi\mu_iθi2/2\theta_i^2 / 2σ2\sigma^2
Bernoulliln(μi/(1μi))\ln(\mu_i / (1-\mu_i))ln(1+eθi)\ln(1 + e^{\theta_i})1
Poissonln(μi)\ln(\mu_i)eθie^{\theta_i}1

Notice that for Bernoulli and Poisson, a(ϕ)=1a(\phi) = 1, meaning there's no separate dispersion parameter to estimate.

The score function is the gradient of the log-likelihood with respect to β\beta:

U(β)=(β;y)βU(\beta) = \frac{\partial \ell(\beta; y)}{\partial \beta}

Setting this equal to zero gives the equations you need to solve for the MLE.

Maximum Likelihood Estimation for GLMs

Estimation Process

The MLE of β\beta satisfies:

(β;y)β=i=1n(yiμi)a(ϕ)v(μi)μiηixi=0\frac{\partial \ell(\beta; y)}{\partial \beta} = \sum_{i=1}^{n} \frac{(y_i - \mu_i)}{a(\phi) \cdot v(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot x_i = 0

where v(μi)=b(θi)v(\mu_i) = b''(\theta_i) is the variance function. This system of equations is nonlinear in β\beta for most GLMs (the Gaussian with identity link being the notable exception), so you can't solve it in closed form.

Iterative Optimization and Convergence

Two closely related algorithms handle this:

  1. Newton-Raphson: Uses the observed second derivatives (the Hessian) to update parameter estimates at each step.
  2. Fisher scoring: Replaces the observed Hessian with its expected value (the Fisher information matrix). This is the more common choice for GLMs because the expected information has a cleaner form.

Fisher scoring is equivalent to Iteratively Reweighted Least Squares (IRLS), which recasts each iteration as a weighted least squares problem. The update at iteration t+1t+1 is:

β(t+1)=(XW(t)X)1XW(t)z(t)\beta^{(t+1)} = (X^\top W^{(t)} X)^{-1} X^\top W^{(t)} z^{(t)}

where W(t)W^{(t)} is a diagonal weight matrix and z(t)z^{(t)} is a vector of "working responses," both computed from the current parameter estimates.

The algorithm proceeds as follows:

  1. Choose starting values for β\beta (often from a simpler model or set to zero)
  2. Compute ηi\eta_i, μi\mu_i, the weights WW, and working responses zz from current β\beta
  3. Solve the weighted least squares problem to get updated β\beta
  4. Check convergence: has the change in β\beta or in (β;y)\ell(\beta; y) dropped below a specified tolerance?
  5. If not converged, return to step 2

For well-specified models with canonical links, Fisher scoring typically converges quickly (often in under 10 iterations). Non-canonical links or sparse data can slow convergence or cause numerical issues.

If ϕ\phi is unknown (as in Gaussian or Gamma GLMs), it's estimated after β\beta converges, often using the Pearson statistic:

ϕ^=1npi=1n(yiμ^i)2v(μ^i)\hat{\phi} = \frac{1}{n - p} \sum_{i=1}^{n} \frac{(y_i - \hat{\mu}_i)^2}{v(\hat{\mu}_i)}

where pp is the number of estimated regression coefficients.

Formulation and General Form, Maximum Likelihood Estimate and Logistic Regression simplified — Pavan Mirla

Standard Errors and the Information Matrix

Once you have the MLE β^\hat{\beta}, you need standard errors to do inference. These come from the Fisher information matrix:

I(β)=XWX\mathcal{I}(\beta) = X^\top W X

The covariance matrix of β^\hat{\beta} is estimated as:

Cov^(β^)=(XW^X)1\widehat{\text{Cov}}(\hat{\beta}) = (X^\top \hat{W} X)^{-1}

The standard error of any individual coefficient β^j\hat{\beta}_j is the square root of the jjth diagonal element of this matrix. Larger information (more data, less variance) means smaller standard errors.

Interpreting GLM Coefficients

Each estimated coefficient β^j\hat{\beta}_j represents the change in the linear predictor η\eta for a one-unit increase in predictor xjx_j, holding all other predictors constant. But because the link function sits between η\eta and μ\mu, interpretation on the original response scale depends on which link you're using:

  • Identity link (Gaussian): β^j\hat{\beta}_j is the change in the mean response itself. This is the familiar OLS interpretation.
  • Log link (Poisson, Gamma): β^j\hat{\beta}_j is the change in ln(μ)\ln(\mu). Exponentiating gives eβ^je^{\hat{\beta}_j}, which is a multiplicative factor on the mean. For example, β^j=0.3\hat{\beta}_j = 0.3 means a one-unit increase in xjx_j multiplies the expected count by e0.31.35e^{0.3} \approx 1.35, a 35% increase.
  • Logit link (Bernoulli/Binomial): β^j\hat{\beta}_j is the change in the log-odds. Exponentiating gives eβ^je^{\hat{\beta}_j}, the odds ratio. For example, β^j=0.5\hat{\beta}_j = -0.5 corresponds to an odds ratio of e0.50.61e^{-0.5} \approx 0.61, meaning the odds decrease by about 39%.

Hypothesis Tests and Significance

Two main tests assess whether a coefficient differs from zero:

  • Wald test: Computes z=β^j/SE(β^j)z = \hat{\beta}_j / \text{SE}(\hat{\beta}_j). Under H0:βj=0H_0: \beta_j = 0, this follows approximately a standard normal distribution for large samples. It's quick to compute since it only requires the fitted model.
  • Likelihood ratio test (LRT): Compares the log-likelihood of the full model to a reduced model without predictor jj. The test statistic is Λ=2[(β^full)(β^reduced)]\Lambda = 2[\ell(\hat{\beta}_{\text{full}}) - \ell(\hat{\beta}_{\text{reduced}})], which follows a χ2\chi^2 distribution with degrees of freedom equal to the number of parameters dropped. The LRT is generally more reliable than the Wald test, especially when estimates are far from zero or sample sizes are moderate.

Confidence Intervals and Exponentiated Coefficients

A Wald-based confidence interval for βj\beta_j is:

β^j±zα/2SE(β^j)\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{SE}(\hat{\beta}_j)

For log or logit links, you can exponentiate the endpoints to get a confidence interval on the multiplicative or odds ratio scale. Profile likelihood confidence intervals are an alternative that tends to perform better in small samples or when the likelihood surface is asymmetric.

The exponentiated coefficients (eβ^je^{\hat{\beta}_j}) are often the quantities you'll report in practice. For logistic regression, these are odds ratios. For Poisson or log-linear models, they're rate ratios (or incidence rate ratios). These are more interpretable than raw log-scale coefficients because they describe relative changes rather than changes on a transformed scale.