Overdispersion complicates model selection because the standard tools we rely on (like AIC or likelihood ratio tests) assume the variance structure is correctly specified. When that assumption breaks down, those tools can steer you toward overly complex models or give you false confidence in a poor fit. This section covers how to choose among competing models when overdispersion is present, using criteria and diagnostics designed for that situation.

Overdispersion and Its Consequences

Overdispersion occurs when the observed variance in your data exceeds what the model assumes. In a standard Poisson regression, for example, the model assumes the variance equals the mean. Real count data frequently violates this.

Common sources of overdispersion:

Unobserved heterogeneity: important predictors are missing from the model, so unexplained variation inflates the residual variance
Clustering: observations within groups are correlated, but the model treats them as independent
Excess zeros: more zero counts than the Poisson distribution can accommodate

The practical damage is serious. Standard errors get underestimated, which makes test statistics too large and p-values too small. You end up declaring effects "significant" that may not be real. Model selection criteria that depend on the likelihood (AIC, BIC) can also be distorted, since the likelihood itself is misspecified.

Criteria for Model Selection

The goal of model selection is always to balance fit against complexity. With overdispersed data, you need criteria that account for the extra-variability the standard model misses.

Standard criteria and their limitations:

AIC ( $AIC = -2\ell + 2p$ ) and BIC ( $BIC = -2\ell + p \ln n$ ) both rely on the log-likelihood $\ell$ . When the variance function is wrong, the log-likelihood values are unreliable, so AIC and BIC comparisons across models with different variance assumptions can be misleading.

Quasi-likelihood-based alternative:

QAIC adjusts AIC for overdispersion: $QAIC = \frac{-2\ell}{\hat{c}} + 2p$ where $\hat{c}$ is the estimated dispersion parameter (often from the Pearson chi-square statistic divided by residual degrees of freedom) and $p$ is the number of parameters. Use QAIC when comparing models fit with the same quasi-likelihood family (e.g., several quasi-Poisson models with different predictors). You cannot use QAIC to compare across fundamentally different model families (e.g., quasi-Poisson vs. negative binomial), because the likelihoods aren't on the same scale.

When full likelihoods are available:

If you fit models that explicitly specify the variance structure (negative binomial, zero-inflated Poisson, hurdle models), each has a proper likelihood. You can compare these models using standard AIC or BIC directly, since the likelihood already reflects the overdispersion mechanism.

Bayesian alternatives:

DIC (Deviance Information Criterion) and WAIC (Watanabe-Akaike Information Criterion) serve similar roles in a Bayesian framework. WAIC is generally preferred because it's fully Bayesian and more stable with complex models.

Comparing Competing Models

Developing Candidate Models

Don't just fit one model and call it done. Build a set of candidate models that represent plausible data-generating mechanisms:

Start with a baseline. Fit the standard model (e.g., Poisson GLM) to establish a reference point and confirm overdispersion exists. Check whether the residual deviance or Pearson statistic divided by degrees of freedom is substantially greater than 1.
Specify alternatives that handle overdispersion differently. Common choices include:
- Quasi-Poisson: adjusts standard errors via a dispersion parameter but doesn't change the mean model
- Negative binomial: adds a parameter for the variance ( $\text{Var}(Y) = \mu + \mu^2 / \theta$ ), producing a proper likelihood
- Zero-inflated models (ZIP, ZINB): model excess zeros as a separate process
- Hurdle models: separate the zero/non-zero decision from the positive-count distribution
Vary the predictor sets within each model family if your research question involves variable selection.
Compute the appropriate selection criterion for each fitted model (QAIC for quasi-likelihood models, AIC/BIC for models with full likelihoods).

Selecting the Best Model

Rank models by your chosen criterion, with lower values indicating a better balance of fit and parsimony. But don't treat small differences as meaningful.

A difference of less than about 2 in AIC (or QAIC) between two models suggests they fit roughly equally well. In that case, prefer the simpler model or the one with a clearer scientific interpretation.
Differences greater than 10 represent strong evidence favoring the lower-scoring model.
For values in between, the evidence is moderate, and you should weigh other considerations.

Beyond the numbers, check that the selected model aligns with what you know about the data. A zero-inflated model only makes sense if there's a plausible reason for "structural" zeros (e.g., some subjects could never experience the event). A negative binomial may be more appropriate when the overdispersion is general rather than zero-driven.

Evaluating Model Performance

Goodness-of-Fit Assessment

Selecting the "best" model from your candidate set doesn't guarantee it actually fits well. You still need diagnostics.

Residual analysis: Plot Pearson or deviance residuals against fitted values and against each predictor. Look for systematic patterns (curves, fans, clusters) that signal misspecification. For count models, randomized quantile residuals (Dunn-Smyth residuals) are especially useful because they should look approximately standard normal if the model is correct.
Overdispersion check on the selected model: Compute the ratio of the Pearson chi-square statistic to residual degrees of freedom. If this ratio is still well above 1 in your chosen model, the overdispersion hasn't been adequately addressed.
Formal tests: The Pearson chi-square test and the deviance goodness-of-fit test can flag poor fit, though they have limited power with sparse data.

Predictive Performance Evaluation

Diagnostics based on the training data can be overly optimistic. Predictive checks give you a more honest picture.

Rootograms compare observed and expected frequencies across count values. A "hanging" rootogram makes it easy to spot where the model over- or under-predicts specific counts (especially zeros).
Cross-validation: k-fold or leave-one-out cross-validation estimates how well the model predicts new data. Compare models on a proper scoring rule (e.g., log-score or ranked probability score) rather than just raw prediction error.
Probability integral transform (PIT): If the model is well-calibrated, the PIT values should be approximately uniform. Deviations reveal specific ways the model fails (e.g., underdispersion in the predictive distribution, poor tail behavior).

Pay particular attention to whether the model captures excess zeros and the overall spread of the data, since these are exactly the features overdispersion distorts.

Interpreting Model Selection Results

Coefficient Interpretation

Once you've selected a model, interpret its coefficients carefully.

Standard errors from quasi-likelihood models are inflated by $\sqrt{\hat{c}}$ relative to the naive Poisson model. This means confidence intervals will be wider and some previously "significant" effects may no longer be.
For negative binomial or zero-inflated models, the coefficients in the count component have the same interpretation as in Poisson regression (log-rate ratios), but the zero-inflation component has its own set of coefficients with a different interpretation (typically on a logit scale, modeling the probability of a structural zero).
Always consider the link function. In a log-link model, exponentiating a coefficient gives you a multiplicative effect on the expected count: $e^{\beta}$ is the rate ratio for a one-unit increase in the predictor.
Report effect sizes alongside p-values. A statistically significant coefficient with a tiny rate ratio may not matter in practice.

Drawing Valid Conclusions

Use confidence intervals rather than relying solely on p-values. With overdispersed data, the corrected intervals give a more honest picture of uncertainty.
If you've compared many predictors or many models, be aware of multiple comparisons. The more tests you run, the higher the chance of spurious findings.
Acknowledge limitations: unmeasured confounders, sensitivity to the assumed variance structure, and the possibility that none of your candidate models is truly correct.
Translate results back to the research question. A rate ratio of 1.3 means a 30% increase in the expected count per unit change in the predictor. State this in substantive terms your audience can act on.
Document the model selection process (which candidates you considered, which criterion you used, how large the differences were) so readers can evaluate your choices.

2,589 studying →