Deviance is the primary goodness-of-fit measure for GLMs. It quantifies how well your fitted model compares to a perfect (saturated) model that fits every data point exactly.

Formally, deviance equals twice the difference between the log-likelihood of the saturated model and the log-likelihood of your fitted model:

$D = 2\left[\ell(\text{saturated}) - \ell(\text{fitted})\right]$

A saturated model has one parameter per observation, so it reproduces the data perfectly. The larger the deviance, the worse your model fits relative to that ideal.

Residual deviance is the deviance of your current model. Under a correctly specified model, it approximately follows a chi-squared distribution:

$D \sim \chi^2_{n - p}$

where $n$ is the number of observations and $p$ is the number of estimated parameters.

A quick rule of thumb: if the residual deviance is roughly equal to its degrees of freedom ( $n - p$ ), the model fits reasonably well. If the residual deviance is much larger, the model may be missing important structure. This check is most straightforward for Poisson regression; for logistic regression with binary (0/1) responses, individual deviance contributions don't aggregate as neatly, so this rule of thumb is less reliable.

Residuals in GLMs

GLMs use specialized residuals because raw residuals (observed minus predicted) don't behave the same way they do in ordinary linear regression.

Pearson residuals scale the raw residual by the estimated standard deviation of the observation under the model:

$r_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}$

where $V(\hat{\mu}_i)$ is the variance function evaluated at the fitted value. These are the GLM analog of standardized residuals in OLS.

Deviance residuals measure each observation's contribution to the total deviance. They're calculated as the signed square root of the individual deviance contribution:

$r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{d_i}$

where $d_i$ is the $i$ th observation's contribution to the overall deviance $D = \sum d_i$ .

Deviance residuals tend to be more symmetrically distributed than Pearson residuals, especially for count and binary data. This makes them generally more useful for diagnostic plots. In either case, observations with large residuals (roughly beyond $\pm 2$ or $\pm 3$ ) deserve closer inspection.

Residual analysis for GLMs

Deviance and residual deviance, Goodness-of-Fit Test | Introduction to Statistics

Residual plots for model diagnostics

Residual plots are your main visual tool for spotting problems with a fitted GLM. The logic is similar to linear regression diagnostics, but you'll typically plot deviance residuals rather than raw residuals.

Residuals vs. fitted values: Plot deviance (or Pearson) residuals on the y-axis against fitted values (on the link scale or response scale) on the x-axis. You want to see a random scatter around zero with no obvious pattern. Curvature suggests a missing nonlinear term or the wrong link function. A fan or funnel shape (increasing spread) suggests the variance function isn't capturing the true mean-variance relationship.

Q-Q plot: This compares the quantiles of your standardized residuals against theoretical quantiles (typically standard normal). Points should fall roughly along a straight line. Systematic departures, especially in the tails, indicate that the distributional assumption may be wrong or that outliers are present. For binary logistic regression, Q-Q plots of raw residuals are hard to interpret because the residuals are inherently discrete; randomized quantile residuals can help in that setting.

Scale-location plot: This displays $\sqrt{|\text{standardized residuals}|}$ against fitted values. A roughly flat trend line means the variance is stable across fitted values. An upward trend signals that variability increases with the mean, which might mean you need a different family or variance structure.

Identifying influential observations and outliers

Not all observations contribute equally to the fitted model. A few unusual points can shift your coefficient estimates substantially.

Leverage measures how far an observation's predictor values are from the center of the predictor space. High-leverage points have an outsized role in determining the fitted surface. You can read leverage values from the diagonal of the hat matrix. In a residuals vs. leverage plot, points in the upper-right or lower-right corners are both unusual and influential.
Cook's distance combines leverage and residual size into a single measure of each observation's overall influence on the fitted coefficients. A common guideline is that Cook's distance values greater than 1 (or greater than $4/n$ as a more sensitive threshold) warrant investigation.

$D_i = \frac{r_i^2 \cdot h_i}{p\,(1 - h_i)^2}$

where $r_i$ is the standardized residual, $h_i$ is the leverage, and $p$ is the number of parameters.

Outliers show up as observations with large standardized residuals (absolute value above 2 or 3). Before removing any outlier, investigate whether it reflects a data entry error, a measurement problem, or a genuinely unusual case. Removing real data points without justification can bias your results.

Inference for GLM parameters

Deviance and residual deviance, Chi-square distribution - wikidoc

Hypothesis tests for GLM coefficients

Two main testing approaches are used for GLM coefficients: Wald tests and likelihood ratio tests. They answer slightly different questions and have different strengths.

Wald test. This tests whether a single coefficient equals zero. The test statistic is:

$z = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}$

Under the null hypothesis $H_0: \beta_j = 0$ , this statistic follows a standard normal distribution (or, equivalently, $z^2$ follows a $\chi^2_1$ distribution). Wald tests are computationally cheap because they only require fitting one model. However, they can be unreliable when sample sizes are small or when the true parameter is far from zero, because they rely on the curvature of the log-likelihood at the MLE only.

Likelihood ratio test (LRT). This compares two nested models: a reduced model (without certain predictors) and a full model (with them). The test statistic is:

$G^2 = 2\left[\ell(\text{full}) - \ell(\text{reduced})\right]$

Under the null hypothesis that the extra parameters are all zero, $G^2 \sim \chi^2_q$ , where $q$ is the difference in the number of parameters between the two models. LRTs require fitting both models but are generally more reliable than Wald tests, especially in smaller samples.

Confidence intervals for GLM parameters

Confidence intervals give you a range of plausible values for each coefficient, which is often more informative than a simple reject/fail-to-reject decision.

Wald confidence intervals use the asymptotic normality of the MLE:

$\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{SE}(\hat{\beta}_j)$

These are fast to compute and are what most software reports by default. The downside is that they can perform poorly when the sample size is small or when the parameter is near a boundary (e.g., a probability near 0 or 1 in logistic regression).

Profile likelihood confidence intervals are constructed by inverting the likelihood ratio test. Instead of assuming a symmetric, normal-shaped likelihood, they trace the actual shape of the log-likelihood curve and find the set of parameter values not rejected at the desired significance level. This makes them more accurate when the likelihood is asymmetric or when sample sizes are modest. The tradeoff is that they require more computation (refitting the model repeatedly).

For practical work: use Wald intervals as a quick default, but switch to profile likelihood intervals when you have a small sample or when Wald intervals give implausible results (like a confidence interval for a probability that extends below 0).

Comparing and Selecting GLMs

Information criteria for model selection

When you have several candidate models, you need a principled way to choose among them. Information criteria balance fit against complexity.

Akaike Information Criterion (AIC):

$\text{AIC} = -2\ell(\hat{\theta}) + 2p$

where $\ell(\hat{\theta})$ is the maximized log-likelihood and $p$ is the number of estimated parameters. Lower AIC means a better balance of fit and simplicity. AIC estimates the expected out-of-sample prediction error (specifically, it approximates the Kullback-Leibler divergence), so it favors models that predict well, not just models that fit the training data well.

Bayesian Information Criterion (BIC):

$\text{BIC} = -2\ell(\hat{\theta}) + \ln(n) \cdot p$

BIC replaces AIC's $2p$ penalty with $\ln(n) \cdot p$ . Because $\ln(n) > 2$ for any $n \geq 8$ , BIC penalizes complexity more heavily than AIC for most real datasets. This means BIC tends to select simpler models, especially as sample size grows.

AIC and BIC can disagree. AIC is better suited when the goal is prediction; BIC is better suited when you believe the true model is among your candidates and you want to identify it consistently. Both criteria can compare non-nested models, which is a major advantage over likelihood ratio tests.

Likelihood ratio tests and the principle of parsimony

For nested models (where the simpler model is a special case of the more complex one), the likelihood ratio test provides a formal hypothesis test of whether the extra parameters are needed. The procedure is the same as described in the inference section above: compute $G^2$ and compare to a $\chi^2_q$ distribution.

The principle of parsimony guides model selection more broadly: prefer the simplest model that adequately explains the data. Overly complex models tend to overfit, capturing noise in the training data rather than the true underlying pattern. This leads to poor performance on new data.

Cross-validation offers a direct way to assess predictive performance:

Split the data into $k$ roughly equal subsets (folds). Common choices are $k = 5$ or $k = 10$ .
For each fold, fit the model on the remaining $k - 1$ folds and predict the held-out fold.
Compute a performance metric (e.g., deviance, classification accuracy, or mean squared error) for each fold.
Average the metric across all $k$ folds.

The model with the best average cross-validated performance is the one most likely to generalize well. Cross-validation is especially useful when you can't rely on asymptotic approximations (small samples) or when comparing models that aren't nested.

2,589 studying →