is a fundamental technique in statistical inference, used to estimate parameters of probability distributions. It finds the parameter values that make observed data most probable, providing a bridge between frequentist and Bayesian approaches to statistics.

This method plays a crucial role in developing and evaluating statistical models. By maximizing the , it yields point estimates with desirable properties like consistency and efficiency, forming the basis for many statistical techniques used in data analysis and modeling.

Concept of maximum likelihood

  • Maximum likelihood estimation forms a cornerstone of frequentist statistical inference used to estimate parameters of statistical models
  • Connects to Bayesian statistics through its role in parameter estimation and model selection, providing a foundation for comparing frequentist and Bayesian approaches
  • Serves as a crucial tool in developing and evaluating statistical models, including those used in Bayesian analysis

Definition and purpose

Top images from around the web for Definition and purpose
Top images from around the web for Definition and purpose
  • Statistical method for estimating parameters of a probability distribution by maximizing the likelihood function
  • Aims to find parameter values that make the observed data most probable
  • Provides a principled approach to parameter estimation in various statistical models
  • Yields point estimates that possess desirable statistical properties (consistency, efficiency)

Historical background

  • Developed by R.A. Fisher in the 1920s as part of his work on statistical inference
  • Evolved from earlier methods of moment matching and least squares estimation
  • Gained widespread adoption in the mid-20th century with advances in computational power
  • Influenced the development of other statistical techniques (likelihood ratio tests, information criteria)

Relationship to Bayesian inference

  • Serves as a special case of when using uniform prior distributions
  • Provides the basis for constructing likelihood functions used in Bayesian analysis
  • Differs from Bayesian methods in its treatment of parameters as fixed unknown quantities rather than random variables
  • Often used as a starting point for more complex Bayesian models and analyses

Likelihood function

Mathematical formulation

  • Expresses the probability of observing the data given specific parameter values
  • Defined as L(θx)=f(xθ)L(\theta|x) = f(x|\theta) where θ\theta represents parameters and xx represents observed data
  • For independent and identically distributed (i.i.d.) observations, likelihood factorizes as L(θx1,...,xn)=i=1nf(xiθ)L(\theta|x_1, ..., x_n) = \prod_{i=1}^n f(x_i|\theta)
  • Incorporates the probability density function (continuous data) or probability mass function (discrete data)

Properties of likelihood

  • Not a probability distribution over parameters, but a function of parameters given fixed data
  • Invariant under one-to-one transformations of parameters
  • Allows for comparison of different parameter values within the same model
  • Satisfies the likelihood principle, which states that all relevant information about parameters is contained in the likelihood function

Log-likelihood function

  • Logarithm of the likelihood function, often denoted as (θx)=logL(θx)\ell(\theta|x) = \log L(\theta|x)
  • Simplifies calculations by converting products to sums
  • Preserves the location of maxima and minima due to monotonicity of the logarithm
  • Improves numerical stability in computations, especially for large datasets
  • Often used in optimization algorithms for finding maximum likelihood estimates

Maximum likelihood estimators

Definition and characteristics

  • Parameter values that maximize the likelihood (or log-likelihood) function
  • Formally defined as θ^MLE=argmaxθL(θx)\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta|x)
  • Invariant under one-to-one transformations of parameters
  • Possess desirable asymptotic properties (consistency, efficiency, normality)
  • May not always exist or may not be unique in some cases

Consistency and efficiency

  • Consistency ensures convergence to true parameter values as sample size increases
  • Efficiency refers to achieving the lowest possible variance among unbiased estimators
  • MLEs are asymptotically efficient under certain regularity conditions
  • Achieve the Cramér-Rao lower bound asymptotically, making them minimum variance unbiased estimators

Asymptotic properties

  • Asymptotically normally distributed with mean equal to the true parameter value
  • Variance approaches the inverse of the Fisher information as sample size increases
  • Convergence rate typically of order O(1/n)O(1/\sqrt{n}) where n is the sample size
  • Asymptotic properties form the basis for constructing confidence intervals and hypothesis tests

Methods for finding MLEs

Analytical solutions

  • Closed-form solutions exist for some simple distributions (normal, exponential, Poisson)
  • Involve setting the derivative of the log-likelihood to zero and solving for parameters
  • Example for : sample mean and variance are MLEs for population parameters
  • Analytical solutions provide exact results but are not available for many complex models

Numerical optimization techniques

  • Used when analytical solutions are not available or difficult to derive
  • Include methods like Newton-Raphson, gradient descent, and conjugate gradient
  • Iteratively update parameter estimates to maximize the likelihood function
  • Require careful selection of starting values and convergence criteria
  • May encounter issues with local maxima or flat likelihood surfaces

EM algorithm

  • used for finding MLEs in models with latent variables or missing data
  • Consists of two steps: E-step (compute expected log-likelihood) and M-step (maximize expected log-likelihood)
  • Iteratively refines parameter estimates until convergence
  • Widely used in mixture models, hidden Markov models, and factor analysis
  • Guarantees increase in likelihood at each iteration but may converge to local maxima

Applications in statistical models

Linear regression

  • MLEs for coefficients in linear regression equivalent to ordinary least squares estimates
  • Assumes normally distributed errors with constant variance
  • Closed-form solution exists: β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty
  • MLE approach provides a probabilistic interpretation of linear regression parameters

Logistic regression

  • Used for binary outcome data, models probability of success as a function of predictors
  • MLEs found through iterative numerical optimization (Newton-Raphson, iteratively reweighted least squares)
  • No closed-form solution exists due to non-linearity of logistic function
  • Likelihood function based on Bernoulli distribution for each observation

Poisson regression

  • Models count data assuming Poisson-distributed response variable
  • MLEs typically found through iterative numerical methods
  • Log-link function used to ensure non-negative predicted values
  • Applications include modeling rare events, accident frequencies, or species counts in ecology

Likelihood ratio tests

Test statistic formulation

  • Compares likelihood of data under null hypothesis to likelihood under alternative hypothesis
  • Test statistic defined as Λ=2logL(θ0)L(θ^)\Lambda = -2 \log\frac{L(\theta_0)}{L(\hat{\theta})} where θ0\theta_0 is the parameter value under the null hypothesis
  • Measures the evidence against the null hypothesis relative to the alternative
  • Large values of Λ\Lambda indicate strong evidence against the null hypothesis

Asymptotic distribution

  • Under certain regularity conditions, -2 log Λ follows a chi-square distribution asymptotically
  • Degrees of freedom equal to the difference in dimensionality between alternative and null models
  • Asymptotic result holds as sample size approaches infinity
  • Provides basis for calculating p-values and critical values for hypothesis testing

Power of likelihood ratio tests

  • Probability of correctly rejecting the null hypothesis when it is false
  • Increases with sample size and effect size
  • Often more powerful than other test statistics (Wald test, score test) in finite samples
  • Can be calculated analytically for simple hypotheses or through simulation for complex models

Limitations and alternatives

Small sample issues

  • MLEs may be biased or inefficient in small samples
  • Asymptotic properties may not hold, leading to unreliable inference
  • Alternatives include bias-corrected estimators or exact methods for specific distributions
  • Bayesian methods with informative priors can provide more stable estimates in small samples

Regularization techniques

  • Address overfitting and improve generalization in high-dimensional settings
  • Include methods like ridge regression, lasso, and elastic net
  • Modify likelihood function by adding penalty terms on parameter magnitudes
  • Balance between model fit and complexity, often leading to sparse solutions

Bayesian vs maximum likelihood

  • Bayesian methods incorporate prior information, while MLE uses only the likelihood
  • Bayesian inference provides full posterior distributions rather than point estimates
  • MLE can be viewed as a special case of Bayesian inference with uniform priors
  • Bayesian methods handle uncertainty more naturally but require specification of priors

Computational aspects

Software implementations

  • Statistical software packages (R, SAS, STATA) provide built-in functions for MLE
  • Machine learning libraries (scikit-learn, TensorFlow) implement MLE for various models
  • Specialized optimization libraries (scipy.optimize, nlopt) offer flexible tools for custom likelihood functions
  • Bayesian software (Stan, PyMC) often include MLE as a special case or starting point

Computational complexity

  • Varies widely depending on model complexity and dataset size
  • Simple models with closed-form solutions have low computational cost
  • Iterative methods for complex models may require many function evaluations
  • High-dimensional problems can become computationally intensive or intractable

Parallel processing for MLEs

  • Embarrassingly parallel nature of likelihood calculations for independent observations
  • Enables efficient use of multi-core processors and distributed computing systems
  • Particularly useful for bootstrap resampling and cross-validation procedures
  • Implemented in modern statistical software to handle large-scale data analysis

Advanced topics

Profile likelihood

  • Technique for dealing with nuisance parameters in likelihood-based inference
  • Involves maximizing likelihood over nuisance parameters for each value of parameter of interest
  • Useful for constructing confidence intervals and hypothesis tests in complex models
  • Provides more accurate inference than methods based on asymptotic normality in some cases

Penalized maximum likelihood

  • Incorporates penalty terms into likelihood function to address overfitting or enforce constraints
  • Examples include L1 (lasso) and L2 (ridge) penalties in regression models
  • Balances model fit with complexity or prior beliefs about parameter values
  • Often results in sparse solutions, facilitating variable selection in high-dimensional settings

Empirical likelihood methods

  • Nonparametric approach to likelihood-based inference
  • Constructs a likelihood function without assuming a specific parametric form
  • Combines flexibility of nonparametric methods with efficiency of likelihood-based inference
  • Applications include constructing confidence regions and hypothesis testing in semiparametric models

Key Terms to Review (17)

AIC: AIC, or Akaike Information Criterion, is a measure used to compare the relative quality of statistical models for a given dataset. It helps in identifying the model that best explains the data while penalizing for complexity to avoid overfitting. A lower AIC value indicates a better-fitting model, making it a valuable tool in model selection, particularly in maximum likelihood estimation.
Bayesian inference: Bayesian inference is a statistical method that utilizes Bayes' theorem to update the probability for a hypothesis as more evidence or information becomes available. This approach allows for the incorporation of prior knowledge, making it particularly useful in contexts where data may be limited or uncertain, and it connects to various statistical concepts and techniques that help improve decision-making under uncertainty.
BIC: The Bayesian Information Criterion (BIC) is a statistical tool used for model selection among a finite set of models. It provides a way to assess the trade-off between the goodness of fit of the model and its complexity, allowing for a balance between underfitting and overfitting. BIC is particularly useful when comparing models with different numbers of parameters, as it penalizes more complex models to prevent them from being favored solely due to their ability to fit the data closely.
Binomial Distribution: The binomial distribution is a probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. This distribution is crucial for understanding the behavior of random variables that have two possible outcomes, like flipping a coin or passing a test, and plays a key role in probability distributions and maximum likelihood estimation.
Confidence interval: A confidence interval is a range of values used to estimate the true parameter of a population, with a specified level of confidence. It provides an interval estimate, indicating how much uncertainty exists around the sample estimate. The width of the confidence interval can give insight into the precision of the estimate and is influenced by sample size and variability in the data.
Convex Optimization: Convex optimization is a subfield of optimization that deals with minimizing convex functions over convex sets. This approach is crucial because it guarantees that any local minimum is also a global minimum, which simplifies the problem significantly. In many statistical methods, including maximum likelihood estimation, convex optimization provides efficient algorithms to find parameter estimates that best fit the data.
Expectation-Maximization Algorithm: The expectation-maximization (EM) algorithm is a statistical method used for finding maximum likelihood estimates of parameters in models with latent variables. It works iteratively, alternating between estimating the expected value of the log-likelihood function (the E-step) and maximizing this expected value to update the parameter estimates (the M-step). This process continues until convergence is reached, making it especially useful for handling incomplete data or data with missing values.
Gradient ascent: Gradient ascent is an optimization algorithm used to maximize a function by iteratively moving in the direction of the steepest increase of that function. This technique is especially useful in maximum likelihood estimation, where the goal is to find the parameter values that maximize the likelihood function. By calculating the gradient, or the slope of the function, and taking steps proportional to that slope, gradient ascent efficiently zeroes in on the optimal parameters.
Identifiability: Identifiability refers to the property of a statistical model that allows unique estimation of model parameters based on the observed data. If a model is identifiable, it means that different parameter values will lead to different distributions of the data, ensuring that the true parameter values can be determined without ambiguity. This concept is crucial when performing maximum likelihood estimation because it directly affects the reliability of the estimated parameters.
Independence Assumption: The independence assumption is the notion that the occurrences of events or variables are not influenced by each other within a given model. This concept is crucial in statistical modeling, as it simplifies the analysis and interpretation of data by allowing researchers to treat different levels of data or parameters as separate entities without worrying about interdependencies.
Likelihood Function: The likelihood function measures the plausibility of a statistical model given observed data. It expresses how likely different parameter values would produce the observed outcomes, playing a crucial role in both Bayesian and frequentist statistics, particularly in the context of random variables, probabilities, and model inference.
Maximum Likelihood Estimate: The maximum likelihood estimate (MLE) is a statistical method used to determine the parameters of a statistical model by maximizing the likelihood function. This technique helps identify the parameter values that make the observed data most probable under the specified model, thereby providing a point estimate of the parameters based on the available data.
Maximum Likelihood Estimation: Maximum likelihood estimation (MLE) is a statistical method for estimating the parameters of a statistical model by maximizing the likelihood function. This approach provides estimates that make the observed data most probable under the assumed model, connecting closely with concepts like prior distributions in Bayesian statistics and the selection of optimal models based on fit and complexity.
Newton-Raphson Method: The Newton-Raphson method is an iterative numerical technique used to find approximate solutions to equations, particularly in optimization problems like maximum likelihood estimation. This method employs the use of derivatives to refine guesses for the root of a function, rapidly converging towards a solution. It is especially useful when dealing with complex functions where analytical solutions may be difficult or impossible to obtain.
Normal Distribution: Normal distribution is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. This bell-shaped curve is fundamental in statistics because it describes how variables are distributed and plays a crucial role in many statistical methods and theories.
Point Estimation: Point estimation refers to the process of providing a single value, or point estimate, as the best guess for an unknown parameter in a statistical model. This method is essential for making inferences about populations based on sample data, and it connects to various concepts such as the likelihood principle, loss functions, and optimal decision rules, which further guide how point estimates can be derived and evaluated.
Posterior Distribution: The posterior distribution is the probability distribution that represents the updated beliefs about a parameter after observing data, combining prior knowledge and the likelihood of the observed data. It plays a crucial role in Bayesian statistics by allowing for inference about parameters and models after incorporating evidence from new observations.
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.