Fiveable

📊Bayesian Statistics Unit 4 Review

QR code for Bayesian Statistics practice questions

4.2 Maximum likelihood estimation

4.2 Maximum likelihood estimation

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📊Bayesian Statistics
Unit & Topic Study Guides

Concept of Maximum Likelihood

Maximum likelihood estimation (MLE) is a method for estimating the parameters of a statistical model by finding the values that make your observed data most probable. It's one of the most widely used tools in frequentist statistics, but it also connects directly to Bayesian analysis since the likelihood function sits at the heart of both approaches.

Definition and Purpose

MLE answers a straightforward question: given the data you observed, which parameter values would have been most likely to produce it?

You start with a probability model (say, a normal distribution), observe some data, and then search for the parameter values (like the mean and variance) that maximize the probability of seeing that exact data. The resulting estimates have strong statistical properties:

  • Consistency: as your sample size grows, the MLE converges to the true parameter value
  • Asymptotic efficiency: among unbiased estimators, MLEs achieve the lowest possible variance (in large samples)

Historical Background

R.A. Fisher developed MLE in the 1920s as a principled alternative to older techniques like the method of moments and least squares. The approach gained traction in the mid-20th century as computational power made it practical to maximize complex likelihood functions. It also spawned related tools like likelihood ratio tests and information criteria (AIC, BIC).

Relationship to Bayesian Inference

This is where MLE becomes especially relevant for a Bayesian statistics course. The connection is direct:

  • The likelihood function used in MLE is the same likelihood that appears in Bayes' theorem
  • MLE is equivalent to finding the mode of the posterior when you use a flat (uniform) prior
  • Bayesian methods treat parameters as random variables with distributions, while MLE treats them as fixed unknown quantities
  • In practice, MLE often serves as a starting point or comparison benchmark for Bayesian models

The key difference: MLE gives you a single point estimate, while Bayesian inference gives you an entire posterior distribution over parameter values.

Likelihood Function

Mathematical Formulation

The likelihood function expresses how probable your observed data is for a given set of parameter values. It's written as:

L(θx)=f(xθ)L(\theta | x) = f(x | \theta)

where θ\theta represents the parameters and xx represents the observed data. Numerically, this is the same as the probability (or density) function, but the perspective flips: the data is fixed, and you vary θ\theta.

For independent and identically distributed (i.i.d.) observations, the likelihood factors into a product:

L(θx1,,xn)=i=1nf(xiθ)L(\theta | x_1, \ldots, x_n) = \prod_{i=1}^{n} f(x_i | \theta)

You use the probability density function (pdf) for continuous data or the probability mass function (pmf) for discrete data.

Properties of Likelihood

  • The likelihood is not a probability distribution over θ\theta. It doesn't integrate to 1 over the parameter space. This is a common point of confusion.
  • It's invariant under one-to-one parameter transformations (if you reparameterize, the MLE transforms accordingly).
  • The likelihood principle states that all the information the data provides about θ\theta is contained in the likelihood function.

Log-Likelihood Function

In practice, you almost always work with the log-likelihood instead:

(θx)=logL(θx)\ell(\theta | x) = \log L(\theta | x)

Why? Two reasons:

  1. Products turn into sums, which are much easier to differentiate and compute: (θx1,,xn)=i=1nlogf(xiθ)\ell(\theta | x_1, \ldots, x_n) = \sum_{i=1}^{n} \log f(x_i | \theta)

  2. Multiplying many small probabilities together causes numerical underflow on computers. Summing log-probabilities avoids this.

Because the logarithm is a strictly increasing function, maximizing the log-likelihood gives you the same answer as maximizing the likelihood itself.

Maximum Likelihood Estimators

Definition and Characteristics

The MLE is the parameter value that maximizes the likelihood (or equivalently, the log-likelihood):

θ^MLE=argmaxθL(θx)\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \, L(\theta | x)

A few things to keep in mind:

  • MLEs are invariant under reparameterization: if θ^\hat{\theta} is the MLE of θ\theta, then g(θ^)g(\hat{\theta}) is the MLE of g(θ)g(\theta) for any one-to-one function gg
  • MLEs don't always exist (the likelihood might not have a finite maximum), and they aren't always unique

Asymptotic Properties

As the sample size nn grows large, MLEs have three key properties under standard regularity conditions:

  1. Consistency: θ^MLEθtrue\hat{\theta}_{\text{MLE}} \to \theta_{\text{true}} as nn \to \infty
  2. Asymptotic normality: the distribution of θ^MLE\hat{\theta}_{\text{MLE}} approaches a normal distribution centered at the true value
  3. Asymptotic efficiency: the variance approaches the Cramér-Rao lower bound, meaning no unbiased estimator can do better

Formally, for large nn:

θ^MLEN ⁣(θtrue,1I(θ))\hat{\theta}_{\text{MLE}} \approx N\!\left(\theta_{\text{true}}, \, \frac{1}{I(\theta)}\right)

where I(θ)I(\theta) is the Fisher information. The convergence rate is O(1/n)O(1/\sqrt{n}). These asymptotic results are the basis for constructing confidence intervals and hypothesis tests from MLEs.

Methods for Finding MLEs

Definition and purpose, Maximum likelihood estimation - Wikipedia

Analytical Solutions

For some distributions, you can solve for the MLE by hand:

  1. Write down the log-likelihood (θx)\ell(\theta | x)
  2. Take the derivative with respect to θ\theta and set it equal to zero: ddθ=0\frac{d\ell}{d\theta} = 0
  3. Solve for θ\theta
  4. Verify it's a maximum (check the second derivative is negative)

Example (Normal distribution): For data x1,,xnx_1, \ldots, x_n drawn from N(μ,σ2)N(\mu, \sigma^2), the MLEs are the sample mean μ^=xˉ\hat{\mu} = \bar{x} and the sample variance σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2. (Note: this MLE for variance divides by nn, not n1n-1, so it's slightly biased in small samples.)

Closed-form solutions also exist for exponential and Poisson distributions, among others.

Numerical Optimization Techniques

Most real-world models don't have neat closed-form MLEs. In those cases, you use iterative algorithms:

  • Newton-Raphson: uses both the gradient and the Hessian (second derivative matrix) to take smart steps toward the maximum. Fast convergence near the solution.
  • Gradient descent/ascent: follows the slope of the log-likelihood uphill. Simpler but can be slower.
  • Quasi-Newton methods (like BFGS): approximate the Hessian to reduce computation per step.

These methods require you to choose starting values carefully. Poor starting points can lead to convergence at a local maximum rather than the global one, or failure to converge at all.

EM Algorithm

The Expectation-Maximization (EM) algorithm handles models with latent (hidden) variables or missing data:

  1. E-step: Compute the expected value of the log-likelihood, averaging over the latent variables using current parameter estimates
  2. M-step: Maximize that expected log-likelihood to get updated parameter estimates
  3. Repeat until the estimates stabilize (converge)

The EM algorithm guarantees the likelihood increases at every iteration, but it can get stuck at local maxima. It's widely used for mixture models, hidden Markov models, and factor analysis.

Applications in Statistical Models

Linear Regression

Under the assumption of normally distributed errors with constant variance, the MLEs for regression coefficients are identical to the ordinary least squares (OLS) estimates:

β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y

This is one of the rare cases with a clean closed-form solution. The MLE framework adds a probabilistic interpretation to what's otherwise a purely algebraic fitting procedure.

Logistic Regression

For binary outcomes (0/1), logistic regression models the probability of success using the logistic function. Each observation follows a Bernoulli distribution, so the log-likelihood is:

(β)=i=1n[yilogpi+(1yi)log(1pi)]\ell(\beta) = \sum_{i=1}^{n} \left[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \right]

where pi=11+eXiβp_i = \frac{1}{1 + e^{-X_i \beta}}. There's no closed-form solution here because of the nonlinearity, so MLEs are found through iterative methods like Newton-Raphson or iteratively reweighted least squares (IRLS).

Poisson Regression

Poisson regression models count data (number of events) using a log-link function to keep predicted values non-negative. MLEs are found numerically. Common applications include modeling accident rates, disease counts, or species abundance.

Likelihood Ratio Tests

Test Statistic Formulation

Likelihood ratio tests compare how well two models explain the data. The test statistic is:

Λ=2logL(θ0)L(θ^)\Lambda = -2 \log \frac{L(\theta_0)}{L(\hat{\theta})}

where θ0\theta_0 is the parameter value under the null hypothesis and θ^\hat{\theta} is the unrestricted MLE. Larger values of Λ\Lambda mean the data are much more probable under the alternative than under the null.

Asymptotic Distribution

Under the null hypothesis and standard regularity conditions, Λ\Lambda follows a chi-squared distribution as the sample size grows:

Λχk2\Lambda \sim \chi^2_k

where kk is the difference in the number of free parameters between the two models. This result lets you compute p-values and critical values for hypothesis testing.

Power of Likelihood Ratio Tests

The power of a test is the probability of correctly rejecting a false null hypothesis. For likelihood ratio tests:

  • Power increases with sample size and effect size
  • In finite samples, likelihood ratio tests are often more powerful than alternatives like the Wald test or score test
  • For complex models, power can be estimated through simulation
Definition and purpose, Maximum Likelihood Estimate and Logistic Regression simplified — Pavan Mirla

Limitations and Alternatives

Small Sample Issues

The appealing asymptotic properties of MLEs (consistency, efficiency, normality) rely on large samples. In small samples:

  • MLEs can be biased (recall the nn vs. n1n-1 issue for variance)
  • Confidence intervals based on asymptotic normality may have poor coverage
  • Bayesian methods with informative priors often produce more stable and reliable estimates when data is scarce

Regularization Techniques

In high-dimensional settings (many parameters relative to observations), MLEs tend to overfit. Regularization adds a penalty to the likelihood to constrain parameter estimates:

  • Ridge regression (L2 penalty): shrinks coefficients toward zero
  • Lasso (L1 penalty): shrinks some coefficients exactly to zero, performing variable selection
  • Elastic net: combines L1 and L2 penalties

These methods trade a small amount of bias for a large reduction in variance.

Bayesian vs. Maximum Likelihood

MLEBayesian
ParametersFixed unknown quantitiesRandom variables with distributions
Prior informationNot usedIncorporated via prior
OutputPoint estimate (+ confidence intervals)Full posterior distribution
Small samplesCan be unreliablePriors stabilize estimates
ComputationOften simplerCan be more demanding (MCMC)

MLE can be viewed as the Bayesian MAP (maximum a posteriori) estimate with a uniform prior. In large samples, the two approaches typically agree closely because the likelihood dominates the prior.

Computational Aspects

Software Implementations

  • R: optim(), nlm(), glm(), and the bbmle package
  • Python: scipy.optimize, statsmodels, scikit-learn
  • Bayesian tools like Stan and PyMC often use MLE as a starting point for MCMC sampling

Computational Complexity

Complexity depends heavily on the model. Simple distributions with closed-form MLEs are trivial to compute. Iterative methods for complex models may require many evaluations of the likelihood and its derivatives, and high-dimensional problems can become computationally expensive.

Parallel Processing

Because the likelihood for i.i.d. data is a product (or sum of logs) over independent observations, likelihood evaluations are naturally parallelizable. This is especially useful for bootstrap resampling and cross-validation on large datasets.

Advanced Topics

Profile Likelihood

When your model has nuisance parameters (parameters you need but don't care about), the profile likelihood lets you focus on the parameter of interest. For each candidate value of that parameter, you maximize the likelihood over all the nuisance parameters. The resulting curve gives you a likelihood that depends only on the parameter you care about, which is useful for constructing confidence intervals in complex models.

Penalized Maximum Likelihood

This generalizes regularization within the likelihood framework. You maximize:

(θ)λP(θ)\ell(\theta) - \lambda \cdot P(\theta)

where P(θ)P(\theta) is a penalty function (L1, L2, or something else) and λ\lambda controls the strength of penalization. This connects naturally to Bayesian inference: an L2 penalty corresponds to a Gaussian prior on the parameters, and an L1 penalty corresponds to a Laplace prior.

Empirical Likelihood Methods

Empirical likelihood is a nonparametric approach that constructs a likelihood function without assuming a specific distributional form. It combines the flexibility of nonparametric statistics with the inferential power of likelihood-based methods, and it's used for constructing confidence regions and hypothesis tests in semiparametric models.