Likelihood Function for GLMs
Maximum Likelihood Estimation (MLE) finds the parameter values that make your observed data most probable under the model you've specified. In ordinary linear regression, you can solve for coefficients directly with closed-form equations. GLMs don't have that luxury for most distributions, so MLE combined with iterative algorithms becomes the standard fitting approach.
Formulation and General Form
The likelihood function for a GLM is built from one core assumption: observations are independent. Because of that independence, the joint probability of all your data is just the product of each individual observation's probability:
where:
- is the probability density (continuous) or mass (discrete) function for observation
- is the natural (canonical) parameter for observation , which encodes information about the mean
- is the dispersion parameter, which controls how spread out the distribution is
The specific form of depends on which exponential family distribution you've chosen for the response (Gaussian, Bernoulli, Poisson, Gamma, etc.).
Relationship between Parameters and Predictors
The natural parameter connects to your predictors through two steps. First, the link function maps the mean response to the linear predictor:
- is the mean of the response for observation
- is the vector of predictor values for observation
- is the vector of regression coefficients you're estimating
Second, the relationship between and is determined by the specific exponential family distribution. When you use the canonical link, directly, which simplifies the math considerably.
The dispersion parameter is assumed constant across all observations. For some distributions (Bernoulli, Poisson), is fixed at 1 by definition. For others (Gaussian, Gamma), it must be estimated.
Log-Likelihood Function for GLMs
Derivation and Decomposition
Working with products is algebraically painful, so you take the natural log to convert the product into a sum:
This transformation doesn't change where the maximum occurs (the log is monotonically increasing), but it makes differentiation and computation far more tractable.
Because GLMs use exponential family distributions, the log-likelihood has a standard decomposition:
Each piece has a specific role:
- is the cumulant function. It determines the mean and variance of the distribution. Specifically, and .
- is typically where is a known prior weight. For many standard cases, .
- is a normalizing term that depends on the data and dispersion but not on . It drops out when you differentiate with respect to , so it doesn't affect the MLE of the regression coefficients.

Distribution-Specific Functions
Different distributions plug different functions into this framework. For example:
| Distribution | |||
|---|---|---|---|
| Gaussian | |||
| Bernoulli | 1 | ||
| Poisson | 1 |
Notice that for Bernoulli and Poisson, , meaning there's no separate dispersion parameter to estimate.
The score function is the gradient of the log-likelihood with respect to :
Setting this equal to zero gives the equations you need to solve for the MLE.
Maximum Likelihood Estimation for GLMs
Estimation Process
The MLE of satisfies:
where is the variance function. This system of equations is nonlinear in for most GLMs (the Gaussian with identity link being the notable exception), so you can't solve it in closed form.
Iterative Optimization and Convergence
Two closely related algorithms handle this:
- Newton-Raphson: Uses the observed second derivatives (the Hessian) to update parameter estimates at each step.
- Fisher scoring: Replaces the observed Hessian with its expected value (the Fisher information matrix). This is the more common choice for GLMs because the expected information has a cleaner form.
Fisher scoring is equivalent to Iteratively Reweighted Least Squares (IRLS), which recasts each iteration as a weighted least squares problem. The update at iteration is:
where is a diagonal weight matrix and is a vector of "working responses," both computed from the current parameter estimates.
The algorithm proceeds as follows:
- Choose starting values for (often from a simpler model or set to zero)
- Compute , , the weights , and working responses from current
- Solve the weighted least squares problem to get updated
- Check convergence: has the change in or in dropped below a specified tolerance?
- If not converged, return to step 2
For well-specified models with canonical links, Fisher scoring typically converges quickly (often in under 10 iterations). Non-canonical links or sparse data can slow convergence or cause numerical issues.
If is unknown (as in Gaussian or Gamma GLMs), it's estimated after converges, often using the Pearson statistic:
where is the number of estimated regression coefficients.

Standard Errors and the Information Matrix
Once you have the MLE , you need standard errors to do inference. These come from the Fisher information matrix:
The covariance matrix of is estimated as:
The standard error of any individual coefficient is the square root of the th diagonal element of this matrix. Larger information (more data, less variance) means smaller standard errors.
Interpreting GLM Coefficients
Interpretation and Link Functions
Each estimated coefficient represents the change in the linear predictor for a one-unit increase in predictor , holding all other predictors constant. But because the link function sits between and , interpretation on the original response scale depends on which link you're using:
- Identity link (Gaussian): is the change in the mean response itself. This is the familiar OLS interpretation.
- Log link (Poisson, Gamma): is the change in . Exponentiating gives , which is a multiplicative factor on the mean. For example, means a one-unit increase in multiplies the expected count by , a 35% increase.
- Logit link (Bernoulli/Binomial): is the change in the log-odds. Exponentiating gives , the odds ratio. For example, corresponds to an odds ratio of , meaning the odds decrease by about 39%.
Hypothesis Tests and Significance
Two main tests assess whether a coefficient differs from zero:
- Wald test: Computes . Under , this follows approximately a standard normal distribution for large samples. It's quick to compute since it only requires the fitted model.
- Likelihood ratio test (LRT): Compares the log-likelihood of the full model to a reduced model without predictor . The test statistic is , which follows a distribution with degrees of freedom equal to the number of parameters dropped. The LRT is generally more reliable than the Wald test, especially when estimates are far from zero or sample sizes are moderate.
Confidence Intervals and Exponentiated Coefficients
A Wald-based confidence interval for is:
For log or logit links, you can exponentiate the endpoints to get a confidence interval on the multiplicative or odds ratio scale. Profile likelihood confidence intervals are an alternative that tends to perform better in small samples or when the likelihood surface is asymmetric.
The exponentiated coefficients () are often the quantities you'll report in practice. For logistic regression, these are odds ratios. For Poisson or log-linear models, they're rate ratios (or incidence rate ratios). These are more interpretable than raw log-scale coefficients because they describe relative changes rather than changes on a transformed scale.