The exponential family is a broad class of probability distributions that share a common mathematical form. This shared structure is what makes GLMs possible: because these distributions all follow the same template, you can build a single modeling framework that handles normal, binomial, Poisson, and many other response types.

A distribution belongs to the exponential family if its density (or mass) function can be written as:

$f(x; \theta) = h(x) \exp\bigl(\eta(\theta)\, T(x) - A(\theta)\bigr)$

Each piece of this formula has a specific role:

$\eta(\theta)$ is the natural (canonical) parameter, a reparameterization of the original distribution parameter(s) that puts the density into this standard form.
$T(x)$ is the sufficient statistic, a function of the data that captures all the information the data contain about $\theta$ .
$h(x)$ is the base measure, a term that depends only on the data and acts as a normalizing weight.
$A(\theta)$ is the log-partition function (also called the cumulant function). It ensures the density integrates (or sums) to 1, and it turns out to be the key to deriving moments.

Two properties worth highlighting:

Sufficiency. Because $T(x)$ is sufficient, you don't lose any information about $\theta$ by reducing your entire dataset to $T(x)$ . This connects directly to the factorization theorem: a statistic is sufficient if and only if the joint density factors into one part that depends on the data only through that statistic and another part that depends only on the data.
Moments from derivatives. The mean and variance of $T(X)$ can be read off from derivatives of $A(\theta)$ , with no integration required. (More on this below.)

The family includes both discrete and continuous distributions. Which specific distribution you get depends on the choices of $\eta$ , $T$ , $h$ , and $A$ .

Versatility and applicability

Many of the distributions you already know belong to this family:

Normal (Gaussian)
Binomial
Poisson
Gamma
Beta, Exponential, Geometric, Negative Binomial

Because they all share the same canonical form, a single estimation and inference pipeline (maximum likelihood, score equations, Fisher information) works across all of them. That's exactly the idea behind GLMs.

Common distributions in the exponential family

For each distribution below, the components refer to the canonical form $f(x;\theta) = h(x)\exp\bigl(\eta\, T(x) - A(\theta)\bigr)$ . Working through these mappings is the best way to build intuition for the general formula.

Normal (Gaussian) distribution

The normal distribution has two natural parameters because it has two unknown quantities ( $\mu$ and $\sigma^2$ ):

Natural parameter: $\theta = \bigl(\mu/\sigma^2,\; -1/(2\sigma^2)\bigr)$
Sufficient statistic: $T(x) = (x,\; x^2)$
Base measure: $h(x) = (2\pi)^{-1/2}$
Log-partition function: $A(\theta) = -\dfrac{\theta_1^2}{4\theta_2} - \dfrac{1}{2}\log(-2\theta_2)$

Notice how the sufficient statistic is a vector here. For a sample of size $n$ , you only need $\sum x_i$ and $\sum x_i^2$ to estimate both parameters.

Definition and properties, Exponential distribution - Wikipedia

Binomial and Poisson distributions

Binomial (with $n$ fixed and success probability $p$ ):

Natural parameter: $\theta = \log\!\bigl(p/(1-p)\bigr)$ (the log-odds)
Sufficient statistic: $T(x) = x$
Base measure: $h(x) = \binom{n}{x}$
Log-partition function: $A(\theta) = n\log(1 + e^\theta)$

The natural parameter here is the logit of $p$ . This is why logistic regression uses a logit link: it connects the linear predictor directly to the canonical parameter.

Poisson (with rate $\lambda$ ):

Natural parameter: $\theta = \log(\lambda)$
Sufficient statistic: $T(x) = x$
Base measure: $h(x) = 1/x!$
Log-partition function: $A(\theta) = e^\theta$

Similarly, the natural parameter is $\log(\lambda)$ , which is why Poisson regression defaults to a log link.

Gamma and other distributions

Gamma (with shape $\alpha$ and rate $\beta$ ):

Natural parameter: $\theta = (\alpha - 1,\; -\beta)$
Sufficient statistic: $T(x) = (\log x,\; x)$
Base measure: $h(x) = 1$
Log-partition function: $A(\theta) = -\alpha\log(-\theta_2) + \log\!\bigl(\Gamma(\alpha)\bigr)$

Other exponential-family members (Beta, Exponential, Geometric, Negative Binomial) each have their own specific mappings for $\eta$ , $T$ , $h$ , and $A$ . The procedure for deriving them is always the same: start from the standard density, algebraically rearrange it into the canonical form, and read off the components.

Natural parameters and sufficient statistics

Role in exponential family distributions

Natural parameters are not just an arbitrary reparameterization. They're chosen so that the distribution takes the clean canonical form shown above. In that form, the natural parameter and the sufficient statistic appear together as a dot product $\eta(\theta)\, T(x)$ in the exponent. This pairing is what gives the exponential family its analytical tractability.

Sufficient statistics compress the data without losing information about $\theta$ . For example, if you have $n$ observations from a Poisson distribution, the single number $\sum x_i$ is sufficient for $\lambda$ . You could throw away the individual observations and still estimate $\lambda$ just as well.

Relationship between natural parameters and sufficient statistics

The natural parameter and sufficient statistic are tightly coupled:

They always appear multiplied together in the exponent of the canonical form.
Changing the natural parameter changes which member of the exponential family you're working with; the sufficient statistic tells you what function of the data is relevant for that parameter.

For the normal distribution, the natural parameters are functions of $\mu$ and $\sigma^2$ , while the sufficient statistics are $\sum x_i$ and $\sum x_i^2$ . For the Poisson, the natural parameter is $\log \lambda$ and the sufficient statistic is simply $\sum x_i$ .

Importance in inference and modeling

These properties make estimation straightforward:

Maximum likelihood estimation: The MLE for exponential family distributions reduces to matching the expected sufficient statistic to the observed sufficient statistic. That is, you solve $\mathbb{E}_{\hat{\theta}}[T(X)] = T_{\text{obs}}$ .
Bayesian inference: Conjugate priors exist naturally for exponential family likelihoods, which simplifies posterior computation.

This is a big part of why GLMs work so well in practice. The exponential family structure guarantees that score equations are well-behaved and that iterative fitting algorithms (like IRLS) converge reliably.

Mean and variance of exponential family distributions

Deriving the mean

One of the most useful results: the mean of the sufficient statistic equals the first derivative of the log-partition function with respect to the natural parameter.

$\mathbb{E}[T(X)] = \frac{\partial A(\theta)}{\partial \theta}$

To see this in action:

Poisson: $A(\theta) = e^\theta$ , so $\mathbb{E}[X] = \frac{\partial}{\partial \theta} e^\theta = e^\theta = \lambda$ . You recover the rate parameter directly.
Normal: $\mathbb{E}[X] = \frac{\partial A}{\partial \theta_1} = -\frac{\theta_1}{2\theta_2} = \mu$ . You recover the location parameter.

Deriving the variance

Take one more derivative and you get the variance:

$\text{Var}[T(X)] = \frac{\partial^2 A(\theta)}{\partial \theta^2}$

Poisson: $\text{Var}[X] = \frac{\partial^2}{\partial \theta^2} e^\theta = e^\theta = \lambda$ . The variance equals the mean, which is the defining equidispersion property of the Poisson.
Normal: $\text{Var}[X] = \frac{\partial^2 A}{\partial \theta_1^2} = -\frac{1}{2\theta_2} = \sigma^2$ .

Because $A(\theta)$ is always convex (its second derivative is a variance, which is non-negative), this also guarantees that the variance is non-negative for any member of the family.

Power of the exponential family representation

The log-partition function $A(\theta)$ acts as a moment-generating device. Differentiating it once gives the mean; differentiating it twice gives the variance. Higher cumulants follow from higher derivatives.

This eliminates the need for explicit integration or summation to compute moments. For GLM theory specifically, the relationship $\text{Var}[T(X)] = A''(\theta)$ is what defines the variance function, which in turn determines how the variance of the response relates to its mean. That connection is central to how GLMs handle non-constant variance across different distribution families.

2,589 studying →