Foundations of Bayesian estimation
Bayesian estimation provides a framework for combining what you already know (or assume) about an unknown quantity with observed data to produce a refined estimate. Unlike classical methods that treat parameters as fixed unknowns, the Bayesian approach treats them as random variables with probability distributions. This distinction is what makes the framework so useful in signal processing: you can systematically fold in prior knowledge and quantify your remaining uncertainty through a posterior distribution.
Bayes' theorem
Bayes' theorem is the engine behind every Bayesian estimator. It tells you how to update a prior belief once you observe data.
where:
- is the unknown parameter (or vector of parameters)
- is the observed data
- is the prior — your belief about before seeing data
- is the likelihood — probability of the data given
- is the evidence (or marginal likelihood) — a normalizing constant ensuring the posterior integrates to 1
- is the posterior — your updated belief after observing
The evidence term is often the hardest part to compute, since it requires integrating over the entire parameter space: . Much of the computational machinery in Bayesian estimation exists to deal with this integral.
Prior and posterior distributions
The prior distribution encodes what you know (or assume) about before any data arrives. Priors fall on a spectrum:
- Uninformative (diffuse) priors express minimal assumptions — for example, a uniform distribution over a wide range. These let the data dominate the posterior.
- Informative priors encode specific domain knowledge — for instance, if you know a channel gain is typically near 1.0, you might use a Gaussian centered there.
The posterior distribution is the result of combining the prior with the likelihood via Bayes' theorem. It captures everything you know about after observing the data. The shape of the posterior depends heavily on the prior when data is scarce, but as more data accumulates, the likelihood tends to dominate and the influence of the prior diminishes.
Likelihood functions
The likelihood function specifies the statistical model connecting your observations to the unknown parameter. It answers: given a particular value of , how probable is the data you actually observed?
For example, if you observe samples corrupted by additive white Gaussian noise with known variance , the likelihood for a single observation given parameter is:
For independent observations, the joint likelihood is the product of individual likelihoods. The likelihood function doesn't need to integrate to 1 over — it's a function of , not a probability distribution over it.
Conjugate priors
A prior is conjugate to a given likelihood if the resulting posterior belongs to the same distributional family as the prior. This is valuable because it gives you a closed-form posterior, avoiding expensive numerical integration.
Common conjugate pairs:
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Binomial | Beta | Beta |
| Poisson | Gamma | Gamma |
| Gaussian (known variance) | Gaussian | Gaussian |
| Multinomial | Dirichlet | Dirichlet |
For the Gaussian-Gaussian case: if your prior on is and your likelihood is Gaussian with known variance , the posterior is also Gaussian. The posterior mean turns out to be a precision-weighted average of the prior mean and the sample mean — a clean, interpretable result.
Bayesian estimators
Once you have the posterior distribution, you need to extract a point estimate from it. Different estimators optimize different criteria, and the right choice depends on what cost you're trying to minimize.
Minimum mean square error (MMSE) estimator
The MMSE estimator minimizes the expected squared error . The solution is the posterior mean:
This is the optimal estimator under squared-error loss. In the Gaussian-Gaussian conjugate case, it reduces to a weighted combination of the prior mean and the maximum likelihood estimate, with weights determined by the relative precisions (inverse variances).
Maximum a posteriori (MAP) estimator
The MAP estimator picks the value of at which the posterior is maximized:
Since doesn't depend on , this is equivalent to:
Taking the log converts the product into a sum, which is often easier to optimize. Note that for a uniform prior, the MAP estimator reduces to the maximum likelihood estimator (MLE). For a symmetric, unimodal posterior (like a Gaussian), MAP and MMSE coincide. They diverge for skewed or multimodal posteriors.
Linear MMSE estimator
The linear MMSE (LMMSE) estimator restricts the estimate to be an affine function of the data: . It minimizes mean squared error within this restricted class.
The LMMSE estimate is:
where is the cross-covariance between and , and is the auto-covariance of .
The LMMSE estimator only requires knowledge of first- and second-order statistics (means and covariances), not the full posterior. This makes it computationally attractive and robust when the full distribution is unknown. For jointly Gaussian and , the LMMSE estimator is identical to the MMSE estimator.
Recursive Bayesian estimation
In dynamic systems, the unknown state evolves over time and new measurements arrive sequentially. Recursive Bayesian estimation handles this by cycling through two steps at each time index :
- Prediction: Use the system model to propagate the posterior from time forward to a prior at time
- Update: Incorporate the new measurement at time to compute the posterior at time
This predict-update cycle avoids reprocessing all past data at every step.
Kalman filter
The Kalman filter is the exact recursive Bayesian solution for linear Gaussian systems. The state-space model is:
- State equation: , where
- Measurement equation: , where
The filter alternates between:
-
Predict:
-
Update:
- Kalman gain:
- State update:
- Covariance update:
The Kalman gain controls how much you trust the new measurement versus the prediction. When measurement noise is low (small ), the gain is large and the filter leans heavily on the measurement. When process noise is low (small ), the filter trusts its prediction more.

Extended Kalman filter
The extended Kalman filter (EKF) handles nonlinear state and measurement models by linearizing them around the current estimate. If the system is:
the EKF computes Jacobian matrices and , then applies the standard Kalman filter equations using these Jacobians. The EKF is a first-order approximation, so it can perform poorly when the nonlinearities are severe or when the state uncertainty is large (making the linearization point unreliable).
Unscented Kalman filter
The unscented Kalman filter (UKF) avoids linearization entirely. Instead, it uses the unscented transform:
- Select a set of deterministic sigma points around the current mean, spread according to the covariance
- Propagate each sigma point through the actual nonlinear function or
- Compute the predicted mean and covariance from the transformed sigma points using weighted averages
The UKF captures the posterior mean and covariance accurately to at least second order for any nonlinearity, compared to the EKF's first-order accuracy. It also avoids the need to compute Jacobians, which can be analytically difficult or numerically unstable.
Particle filters
Particle filters (sequential Monte Carlo methods) represent the posterior distribution using a set of weighted samples (particles) . They handle arbitrary nonlinearities and non-Gaussian distributions.
The algorithm at each time step:
- Propagate each particle through the state model (with added process noise) to generate predicted particles
- Weight each particle by the likelihood of the current measurement given that particle's state
- Normalize the weights so they sum to 1
- Resample particles according to their weights to prevent weight degeneracy (most weight concentrating on a few particles)
Particle filters converge to the true posterior as , but computational cost scales linearly with . In practice, the required number of particles can grow exponentially with the state dimension, which limits particle filters to low- or moderate-dimensional problems unless specialized techniques (like Rao-Blackwellization) are used.
Applications of Bayesian estimation
Parameter estimation
Bayesian parameter estimation infers unknown model parameters from data while maintaining a full posterior distribution over the parameter space. This is more informative than a single point estimate because it quantifies uncertainty directly.
In signal processing, this includes estimating signal amplitudes, frequencies, or noise variances. In machine learning, Bayesian approaches are used for model fitting, hyperparameter tuning, and model comparison. The posterior distribution naturally penalizes overly complex models through the evidence term, providing a built-in form of regularization.
State estimation
State estimation targets the hidden state of a dynamic system from noisy observations. Typical applications include:
- Object tracking: estimating the position and velocity of a target from radar or camera measurements
- Robot localization: determining a robot's pose from sensor data (SLAM problems)
- Sensor fusion: combining measurements from multiple sensors (e.g., GPS, IMU, lidar) into a single coherent state estimate
The Kalman filter, EKF, UKF, and particle filters are all tools for this task, chosen based on the linearity and distribution assumptions of the specific problem.
Bayesian inference in signal processing
Beyond parameter and state estimation, Bayesian inference supports signal detection, classification, and denoising. For detection, you can compute the posterior probability that a signal is present versus absent and apply a decision rule. For classification, the posterior over class labels given observed features leads directly to optimal classifiers under various loss functions. Bayesian denoising uses the posterior mean (MMSE estimate) of the clean signal given the noisy observation, which naturally adapts to the signal and noise statistics.
Bayesian vs. classical estimation
Philosophical differences
The core distinction: Bayesian estimation treats unknown parameters as random variables with probability distributions, while classical (frequentist) estimation treats them as fixed but unknown constants.
- In the Bayesian view, probability represents a degree of belief, and it's meaningful to say "there's a 95% probability that lies in this interval."
- In the frequentist view, probability refers to long-run frequencies. A 95% confidence interval means that if you repeated the experiment many times, 95% of such intervals would contain the true . The parameter itself is not random.
This philosophical difference has practical consequences: Bayesian methods require specifying a prior, while frequentist methods do not. Bayesian methods produce a full posterior distribution; frequentist methods produce point estimates and confidence intervals based on sampling distributions.
Advantages and disadvantages
Bayesian advantages: Principled incorporation of prior knowledge. Full uncertainty quantification via the posterior. Natural handling of small sample sizes and complex models. Coherent framework for sequential updating.
Bayesian disadvantages: Computational cost can be high (especially for high-dimensional posteriors). Results can be sensitive to the choice of prior, particularly with limited data. Specifying a prior introduces subjectivity.
Frequentist advantages: No prior specification needed. Often computationally simpler. Well-established theoretical properties (consistency, efficiency, unbiasedness).
Frequentist disadvantages: Cannot incorporate prior knowledge. Confidence intervals are often misinterpreted. Can produce poor estimates with small samples or complex models.

Performance comparison
When the sample size is large, Bayesian and frequentist estimates typically converge to similar values — the data overwhelms the prior. The differences become most pronounced with:
- Small samples: An informative, accurate prior gives Bayesian methods a clear edge
- Misspecified priors: A poor prior can hurt Bayesian performance, especially with limited data
- Complex models: Bayesian methods handle hierarchical and latent variable models more naturally
The choice between approaches should be driven by the problem: how much reliable prior information is available, how much data you have, and what computational resources you can afford.
Computational aspects
The central computational challenge in Bayesian estimation is evaluating the posterior, which typically requires integrating over high-dimensional parameter spaces. Analytical solutions exist only for conjugate models. Everything else requires approximation.
Numerical integration techniques
Standard quadrature methods (trapezoidal rule, Simpson's rule, Gaussian quadrature) work well in low dimensions but scale poorly. For a -dimensional parameter space with grid points per dimension, the cost is . This "curse of dimensionality" makes quadrature impractical beyond roughly 4-5 dimensions.
Monte Carlo methods
Monte Carlo methods estimate integrals by drawing random samples and computing sample averages. Their convergence rate of is independent of dimension, making them the go-to approach for high-dimensional problems.
Markov Chain Monte Carlo (MCMC) methods construct a Markov chain whose stationary distribution is the target posterior. Key algorithms include:
- Metropolis-Hastings: Proposes a candidate sample, then accepts or rejects it based on an acceptance ratio involving the posterior. Simple to implement but can be slow to explore the space if the proposal distribution is poorly chosen.
- Gibbs sampling: Iteratively samples each parameter from its conditional distribution given all other parameters. Efficient when conditionals are easy to sample from (common in conjugate models), but can mix slowly when parameters are highly correlated.
MCMC methods require careful diagnostics (burn-in period, convergence checks, effective sample size) to ensure the samples are representative of the posterior.
Variational Bayesian methods
Variational inference (VI) recasts the integration problem as an optimization problem. You choose a family of tractable distributions and find the member that is closest to the true posterior by minimizing the Kullback-Leibler (KL) divergence:
This is equivalent to maximizing the evidence lower bound (ELBO). The most common approach is mean-field approximation, which assumes the variational distribution factorizes across parameters: . This ignores posterior correlations but makes optimization tractable.
VI is generally faster than MCMC but provides a biased approximation (it tends to underestimate posterior variance). It's widely used in large-scale machine learning applications where MCMC would be too slow.
Advanced topics in Bayesian estimation
Hierarchical Bayesian models
Hierarchical models introduce multiple levels of parameters, with priors on priors (hyperpriors). For example, if you're estimating signal parameters across multiple channels, a hierarchical model lets each channel have its own parameter while sharing a common hyperprior that captures cross-channel structure.
This "partial pooling" effect is powerful: channels with little data borrow strength from channels with more data, producing better estimates overall than either fully pooled (one parameter for all) or fully independent (separate parameters, no sharing) approaches.
Nonparametric Bayesian methods
Nonparametric Bayesian methods don't assume a fixed number of parameters. Instead, the model complexity grows with the data. Key examples:
- Gaussian processes: Place a prior directly over functions, useful for regression and classification when you don't want to commit to a parametric model
- Dirichlet processes: Used for clustering when the number of clusters is unknown; the model can create new clusters as more data arrives
- Infinite mixture models: Extend finite mixture models to allow an unbounded number of components
"Nonparametric" is somewhat misleading — these models have infinitely many parameters, not zero. The term means the model doesn't have a fixed, finite parameterization.
Bayesian model selection
When you have competing models, Bayesian model selection compares them through their marginal likelihoods (evidence):
The Bayes factor between two models is the ratio of their marginal likelihoods: . A Bayes factor much greater than 1 favors model 1.
The marginal likelihood naturally implements Occam's razor: complex models spread their prior probability over a larger parameter space, so they're penalized unless the data strongly supports the added complexity. This avoids the need for separate regularization or cross-validation.
Bayesian decision theory
Bayesian decision theory connects estimation to action. Given a posterior and a loss function that quantifies the cost of taking action when the true parameter is , the optimal action minimizes the expected posterior loss (Bayes risk):
Different loss functions lead to different optimal actions:
- Squared-error loss → optimal action is the posterior mean (MMSE)
- Absolute-error loss → optimal action is the posterior median
- 0-1 loss (for discrete hypotheses) → optimal action is the MAP estimate
This framework unifies estimation, detection, and classification under a single principled theory. In signal processing, it's used for designing optimal detectors, classifiers, and decision rules that account for both uncertainty and the costs of different errors.