upgrade
upgrade

🎣Statistical Inference

Key Concepts of Maximum Likelihood Estimation

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Maximum Likelihood Estimation sits at the heart of statistical inference—it's the workhorse method you'll encounter again and again when fitting models to data. Understanding MLE isn't just about memorizing formulas; you're being tested on your ability to connect the likelihood function, optimization techniques, and asymptotic properties into a coherent framework for parameter estimation. These concepts form the foundation for everything from simple proportion estimates to complex regression models.

When exam questions ask about MLE, they're probing whether you understand why this method works, not just how to compute it. Can you explain why we take logarithms? Do you know what Fisher Information tells us about precision? Can you connect the Cramér-Rao bound to efficiency? Don't just memorize the steps—know what statistical principle each concept illustrates and how they fit together.


The Foundation: Likelihood and Log-Likelihood

The likelihood function is your starting point for MLE. It flips the usual probability question: instead of asking "what's the probability of this data given these parameters," we ask "which parameters make this observed data most probable?" This reframing is the conceptual key to understanding MLE.

Likelihood Function

  • Measures how well parameters explain observed data—defined as the product of probability density (or mass) functions across all data points: L(θ)=i=1nf(xiθ)L(\theta) = \prod_{i=1}^{n} f(x_i | \theta)
  • Not a probability distribution over parameters—it's a function of θ\theta with data held fixed, which is why it doesn't need to integrate to 1
  • Foundation for all MLE calculations—every derivation starts here, so understanding this concept unlocks everything else

Log-Likelihood Function

  • Transforms products into sums—taking (θ)=lnL(θ)\ell(\theta) = \ln L(\theta) converts f(xiθ)\prod f(x_i|\theta) into lnf(xiθ)\sum \ln f(x_i|\theta), making differentiation far simpler
  • Preserves the maximum—since ln\ln is monotonically increasing, maximizing log-likelihood is mathematically equivalent to maximizing likelihood
  • Computationally essential—prevents numerical underflow when multiplying many small probabilities together

Compare: Likelihood vs. Log-Likelihood—both encode the same information about parameter plausibility, but log-likelihood converts products to sums. If an FRQ asks you to derive an MLE, always convert to log-likelihood first—it's expected and saves time.


The Optimization Framework: Finding the Maximum

Once you have the log-likelihood, MLE becomes an optimization problem. The tools here—the score function and systematic derivation steps—give you a roadmap for finding parameter estimates analytically or numerically.

Definition of Maximum Likelihood Estimation

  • Selects parameters that maximize the likelihood of observed data—formally, θ^MLE=argmaxθL(θx)\hat{\theta}_{MLE} = \arg\max_\theta L(\theta|x)
  • Widely used due to strong large-sample propertiesconsistency, asymptotic normality, and efficiency make MLE the default choice in most applications
  • Requires a specified probability model—you must assume a distributional form before applying MLE

Score Function

  • The gradient of the log-likelihood—defined as U(θ)=θ(θ)U(\theta) = \frac{\partial}{\partial \theta} \ell(\theta), it points in the direction of steepest likelihood increase
  • Setting U(θ)=0U(\theta) = 0 yields the MLE—this first-order condition identifies critical points where the likelihood is maximized (or minimized)
  • Expected value equals zero at true parameterthis property, E[U(θ0)]=0E[U(\theta_0)] = 0, is fundamental to proving MLE's asymptotic properties

Steps to Derive MLE

  • Write the likelihood, then take the log—start with L(θ)=f(xiθ)L(\theta) = \prod f(x_i|\theta), convert to (θ)=lnf(xiθ)\ell(\theta) = \sum \ln f(x_i|\theta)
  • Differentiate and set equal to zero—solve θ=0\frac{\partial \ell}{\partial \theta} = 0 for θ\theta; verify it's a maximum using the second derivative
  • Check boundary conditions if necessarysome problems have constrained parameter spaces where the maximum occurs at a boundary, not an interior critical point

Compare: Analytical vs. Numerical MLE—simple distributions (Normal, Binomial, Poisson) yield closed-form solutions, while complex models require iterative methods. Know which distributions have clean MLEs for quick exam answers.


Precision and Information: Quantifying Uncertainty

MLE doesn't just give you point estimates—it comes with a built-in framework for understanding how precise those estimates are. Fisher Information and the Cramér-Rao bound tell you the theoretical limits of estimation precision.

Fisher Information

  • Quantifies information data carries about parameters—defined as I(θ)=E[(lnfθ)2]I(\theta) = E\left[\left(\frac{\partial \ln f}{\partial \theta}\right)^2\right] or equivalently I(θ)=E[2lnfθ2]I(\theta) = -E\left[\frac{\partial^2 \ln f}{\partial \theta^2}\right]
  • Higher information means more precise estimatesintuitively, if the likelihood curve is sharply peaked, small parameter changes produce large likelihood changes
  • Scales linearly with sample size—for nn i.i.d. observations, total Fisher Information is nI(θ)nI(\theta), explaining why larger samples yield better estimates

Cramér-Rao Lower Bound

  • Sets the floor for estimator variance—for any unbiased estimator θ^\hat{\theta}, we have Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}
  • MLE achieves this bound asymptoticallythis is what "efficient" means: no unbiased estimator can do better in large samples
  • Connects information to precision—the bound shows exactly how Fisher Information translates into the best possible variance

Compare: Fisher Information vs. Cramér-Rao Bound—Fisher Information measures how much you can learn; the CR bound tells you the best precision achievable with that information. FRQs often ask you to compute one to find the other.


Asymptotic Properties: Why MLE Works

The theoretical justification for MLE comes from its behavior in large samples. These three properties—consistency, asymptotic normality, and efficiency—are why statisticians trust MLE as their go-to estimation method.

Properties of MLE

  • Consistency—as nn \to \infty, θ^MLEpθ0\hat{\theta}_{MLE} \xrightarrow{p} \theta_0, meaning estimates converge in probability to the true parameter value
  • Asymptotic normality—for large nn, n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1}), enabling confidence intervals and hypothesis tests
  • Efficiency—MLE achieves the Cramér-Rao lower bound asymptotically, making it the minimum-variance unbiased estimator among consistent estimators

Applications: MLE for Common Distributions

Knowing the MLEs for standard distributions saves time on exams and builds intuition. Notice how each MLE has an intuitive interpretation as a sample analog of the population parameter.

MLE for the Normal Distribution

  • Sample mean estimates μ\muμ^=xˉ=1nxi\hat{\mu} = \bar{x} = \frac{1}{n}\sum x_i, the natural "center" of the data
  • Sample variance estimates σ2\sigma^2σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2, note this uses nn not n1n-1, making it biased but still the MLE
  • Derivation requires completing the square—a common exam exercise that tests your calculus and algebra skills together

MLE for the Binomial Distribution

  • Sample proportion estimates ppp^=xn\hat{p} = \frac{x}{n}, where xx is the number of successes in nn trials
  • Intuitive interpretationthe best estimate of the success probability is simply the observed success rate
  • Foundation for proportion inference—confidence intervals and hypothesis tests for proportions build directly on this MLE

MLE for the Poisson Distribution

  • Sample mean estimates λ\lambdaλ^=xˉ\hat{\lambda} = \bar{x}, the average count per interval
  • Makes sense given Poisson properties—since E[X]=λE[X] = \lambda for Poisson, the sample mean is the natural estimator
  • Used in count data applications—from disease incidence to website traffic, this MLE appears constantly in practice

Compare: Normal vs. Poisson MLE—both use the sample mean, but for different parameters (μ\mu vs. λ\lambda). The Normal also requires estimating variance separately, while Poisson's single parameter controls both center and spread.


Computational Methods: When Calculus Isn't Enough

Many realistic models don't yield closed-form MLEs. Numerical optimization methods become essential tools for finding estimates when you can't solve θ=0\frac{\partial \ell}{\partial \theta} = 0 analytically.

Numerical Methods for Finding MLE

  • Newton-Raphson iterates toward the maximum—updates via θ(t+1)=θ(t)U(θ(t))I(θ(t))\theta^{(t+1)} = \theta^{(t)} - \frac{U(\theta^{(t)})}{I(\theta^{(t)})}, using the score and Fisher Information
  • Requires good starting valuesconvergence is fast near the solution but can fail with poor initialization or multimodal likelihoods
  • Fisher scoring variant uses expected information—replaces observed second derivative with Fisher Information, often improving stability

Quick Reference Table

ConceptBest Examples
Likelihood constructionProduct of PDFs/PMFs, function of parameters not data
Log-likelihood benefitsConverts products to sums, prevents underflow, same maximum
Score functionGradient of log-likelihood, set to zero for MLE
Fisher InformationCurvature of log-likelihood, precision measure, scales with nn
Cramér-Rao boundVariance floor 1/I(θ)\geq 1/I(\theta), achieved by efficient estimators
Asymptotic propertiesConsistency, normality, efficiency—all require large nn
Closed-form MLEsNormal (xˉ\bar{x}, s2s^2), Binomial (p^\hat{p}), Poisson (xˉ\bar{x})
Numerical methodsNewton-Raphson, Fisher scoring, require initialization

Self-Check Questions

  1. Why do we maximize the log-likelihood instead of the likelihood directly? What two computational advantages does this provide?

  2. Compare Fisher Information and the Cramér-Rao lower bound: how does one determine the other, and what does it mean for MLE to be "efficient"?

  3. The MLE for Normal variance uses nn in the denominator, not n1n-1. Is this estimator biased? How does this relate to MLE's asymptotic properties?

  4. Which two distributions covered here have MLEs equal to the sample mean? Why does this make intuitive sense given the parameters being estimated?

  5. If you're asked to derive the MLE for a new distribution on an FRQ, what four steps should you follow, and at which step do most computational errors occur?