Bayesian prediction uses posterior distributions to forecast future, unobserved data. Instead of producing a single best guess, it gives you a full probability distribution over possible outcomes, which means uncertainty from both the parameters and the data-generating process is baked right into your forecasts.

Predictive distributions

A predictive distribution is the probability distribution of a future observation given what you currently know. Rather than collapsing your forecast to one number, it tells you the relative likelihood of every possible future value.

Predictive distributions are calculated by integrating over the parameter space, weighted by whatever distribution you have on the parameters (prior or posterior). This integration is what "averages out" parameter uncertainty, so the result reflects not just your best parameter guess but the full range of plausible parameter values.

Posterior predictive distribution

The posterior predictive distribution is the distribution of future observations after you've seen data. It's the workhorse of Bayesian prediction.

$p(y_{new}|y) = \int p(y_{new}|\theta)\,p(\theta|y)\,d\theta$

$y_{new}$ = the new observation you want to predict
$y$ = the data you've already observed
$\theta$ = model parameters
$p(\theta|y)$ = the posterior distribution of parameters

What this integral does: for every possible value of $\theta$ , it asks "how likely is $y_{new}$ under this $\theta$ ?" and then weights that by how plausible $\theta$ is given the data. The result captures two sources of uncertainty: parameter uncertainty (we don't know $\theta$ exactly) and sampling variability (even if we knew $\theta$ , future data would still be random).

Prior predictive distribution

The prior predictive distribution describes what data you'd expect to see before observing anything, based only on your prior beliefs:

$p(y) = \int p(y|\theta)\,p(\theta)\,d\theta$

This is useful for two things:

Prior elicitation: if the prior predictive puts most of its mass on absurd data values, your prior is probably poorly chosen
Model checking: comparing the prior predictive to the kind of data you actually expect helps you sanity-check your model before collecting data

Point predictions

Sometimes you need a single number rather than a full distribution. Bayesian point predictions are summaries extracted from the predictive distribution, and the "right" summary depends on how you define prediction error.

Bayesian point estimators

The three most common point estimators from the posterior (or posterior predictive) are:

Posterior mean: minimizes expected squared error loss
Posterior median: minimizes expected absolute error loss
Posterior mode (MAP): the single most probable value

Your choice depends on the loss function that best matches the cost of being wrong in your application.

Posterior mean vs. median

The mean is optimal when you care about squared error. It's sensitive to the tails of the distribution, so in a heavily skewed posterior predictive, the mean can get pulled toward extreme values.
The median is optimal when you care about absolute error. It's more robust to skew and outliers, and it represents the "typical" predicted value.

For symmetric distributions, mean and median coincide, so the choice doesn't matter. For skewed distributions, the median often gives a more representative single prediction, while the mean better reflects the average outcome.

Prediction intervals

A prediction interval gives a range of values that a future observation is expected to fall within at a specified probability level. These come directly from quantiles of the posterior predictive distribution.

For example, a 95% prediction interval uses the 2.5th and 97.5th percentiles of $p(y_{new}|y)$ . Prediction intervals are wider than credible intervals for $\theta$ because they account for both parameter uncertainty and the inherent randomness in future data.

Bayesian model averaging

When you're unsure which model is correct, Bayesian model averaging (BMA) avoids committing to a single model by combining predictions across multiple models, weighted by how well each model is supported by the data.

Ensemble predictions

BMA is a principled form of ensemble prediction. Instead of picking one "best" model, you let several models contribute to the forecast. Models that explain the observed data well get more influence; models that don't get less. This tends to produce better-calibrated predictions than any single model, especially when no single model clearly dominates.

Weighted model combinations

The BMA predictive distribution is:

$p(y_{new}|y) = \sum_{k=1}^{K} p(y_{new}|M_k, y)\,p(M_k|y)$

$M_k$ is the $k$ -th candidate model
$p(M_k|y)$ is the posterior probability of model $k$ , which serves as its weight
$p(y_{new}|M_k, y)$ is the posterior predictive under model $k$

The posterior model probabilities $p(M_k|y)$ are typically computed using Bayes factors or marginal likelihoods. Models with stronger evidence get higher weights.

Uncertainty in model selection

BMA explicitly acknowledges that the "true" model may not be in your candidate set. By averaging over models rather than selecting one, you avoid the overconfidence that comes from conditioning on a single model choice. This is especially valuable when different models yield substantially different forecasts.

Predictive model assessment

Predictive assessment asks: how well does your model actually predict new data? Bayesian approaches focus on evaluating the full predictive distribution, not just point predictions.

Posterior predictive checks

Posterior predictive checks compare your observed data to data simulated from the posterior predictive distribution. The logic: if your model is reasonable, simulated datasets should look similar to the real data.

Pick a test statistic (e.g., mean, variance, max value, or a domain-specific quantity)
Compute it on the observed data
Compute it on many simulated datasets from $p(y_{rep}|y)$
A posterior predictive p-value measures how extreme the observed statistic is relative to the simulated distribution

Systematic discrepancies point to model misspecification. Graphical checks (overlaying observed and simulated data) are often more informative than any single p-value.

Predictive distributions, Frontiers | Increasing Interpretability of Bayesian Probabilistic Programming Models Through ...

Cross-validation techniques

Cross-validation estimates out-of-sample predictive accuracy by repeatedly holding out portions of the data:

Leave-one-out (LOO) cross-validation holds out one observation at a time and predicts it from the rest
k-fold cross-validation partitions data into $k$ subsets and rotates the held-out set
The log predictive density (lpd) is the standard scoring rule: higher values mean better predictions
LOO can be approximated efficiently using Pareto-smoothed importance sampling (PSIS-LOO), avoiding the need to refit the model $n$ times

Information criteria

Information criteria balance model fit against complexity:

DIC (Deviance Information Criterion): estimates expected predictive error using an effective number of parameters. Commonly used for hierarchical models, though it has known limitations with non-normal posteriors.
WAIC (Widely Applicable Information Criterion): a fully Bayesian alternative that approximates out-of-sample predictive accuracy. It's computed from the pointwise log predictive density and is generally preferred over DIC.

Lower values indicate better predictive performance for both criteria.

Prediction in regression models

Bayesian linear regression predictions

In Bayesian linear regression, the coefficients $\beta$ and error variance $\sigma^2$ are treated as random variables with prior distributions. After observing data, you get posterior distributions for these parameters, and predictions for a new input $x_{new}$ come from:

$p(y_{new}|x_{new}, y, X) = \int p(y_{new}|x_{new}, \beta, \sigma^2)\,p(\beta, \sigma^2|y, X)\,d\beta\, d\sigma^2$

This integral averages the regression prediction over all plausible parameter values, so the resulting predictive distribution is wider (more honest about uncertainty) than a plug-in prediction from classical regression.

Hierarchical model predictions

Hierarchical (multilevel) models are designed for grouped or nested data, such as students within schools or patients within hospitals.

Predictions incorporate both within-group and between-group variability
Groups with limited data get partial pooling: their predictions are pulled toward the overall population estimate, which reduces overfitting
You can generate predictions at the individual level, the group level, or the population level
This "borrowing strength" across groups is one of the biggest practical advantages of Bayesian hierarchical models

Non-linear prediction methods

When relationships between predictors and outcomes are non-linear, two common Bayesian approaches are:

Gaussian process (GP) regression: defines a prior directly over functions using a kernel function. The kernel controls assumptions about smoothness and length-scale. GPs are flexible but scale poorly to very large datasets.
Bayesian neural networks (BNNs): place prior distributions on network weights, producing a distribution over predictions rather than a single output. They can capture complex patterns but are computationally demanding.

Both methods are most useful when linear models clearly fail to capture the underlying structure in the data.

Time series prediction

Bayesian forecasting methods

Bayesian time series forecasting applies the same integrate-over-uncertainty logic to sequential data. Prior knowledge about trend, seasonality, and other components can be encoded directly into the model. The output is a full probabilistic forecast, including prediction intervals, rather than a single trajectory.

Common examples include Bayesian structural time series (BSTS) models and Bayesian ARIMA models.

State space models

A state space model represents a time series through two equations:

State equation: describes how latent (unobserved) states evolve over time
Observation equation: links the latent states to the data you actually see

Bayesian inference provides full posterior distributions for both the latent states and model parameters. These models are widely used in tracking, signal processing, and econometrics because they handle complex dynamics and external covariates naturally.

Dynamic linear models

Dynamic linear models (DLMs) are a special case of state space models where both the state and observation equations are linear. They're useful for modeling time-varying regression coefficients, trends, and seasonal patterns.

Bayesian inference for DLMs is typically performed using the Kalman filter (for forward filtering) and Kalman smoother (for backward smoothing). These algorithms allow efficient sequential updating of state estimates as new data arrives, and they handle missing data and irregular time spacing gracefully.

Prediction with missing data

Multiple imputation techniques

Multiple imputation generates several plausible completed datasets, runs the analysis on each, and then combines the results. In a Bayesian framework:

Specify a model for the missing values given the observed data
Draw multiple sets of imputed values from the posterior predictive distribution of the missing data
Analyze each completed dataset separately
Combine results using rules that properly propagate imputation uncertainty (e.g., Rubin's rules)

This is often implemented via MCMC and produces more honest uncertainty estimates than filling in missing values once.

Bayesian approaches to missingness

Bayesian models can handle different missing data mechanisms explicitly:

MCAR (Missing Completely at Random): missingness is unrelated to any data
MAR (Missing at Random): missingness depends on observed data but not on the missing values themselves
MNAR (Missing Not at Random): missingness depends on the unobserved values

A key advantage is that you can model the data and the missingness mechanism jointly, estimating parameters and imputing missing values simultaneously within a single posterior.

Sensitivity analysis for predictions

Because MNAR assumptions are untestable from the data alone, sensitivity analysis is important. You vary your assumptions about the missingness mechanism and check whether your predictions change substantially. If predictions are stable across plausible assumptions, you can be more confident in them. Bayesian model averaging over different missingness models is one systematic way to do this.

Computational methods for prediction

Monte Carlo methods

Monte Carlo methods approximate intractable integrals by drawing random samples. For Bayesian prediction, this means:

Draw parameter samples $\theta^{(1)}, \theta^{(2)}, \dots, \theta^{(S)}$ from the posterior $p(\theta|y)$
For each $\theta^{(s)}$ , simulate a future observation $y_{new}^{(s)} \sim p(y_{new}|\theta^{(s)})$
The collection $\{y_{new}^{(s)}\}$ approximates the posterior predictive distribution

Basic techniques include simple Monte Carlo, importance sampling, and rejection sampling. These work well for low- to moderate-dimensional problems.

Markov Chain Monte Carlo

MCMC methods generate correlated samples from the posterior by constructing a Markov chain whose stationary distribution is the target posterior. Key algorithms:

Metropolis-Hastings: proposes new parameter values and accepts/rejects based on the posterior ratio
Gibbs sampling: cycles through parameters one at a time, sampling each from its full conditional distribution
Hamiltonian Monte Carlo (HMC): uses gradient information to propose distant, high-probability states, making it efficient for high-dimensional problems

MCMC samples are asymptotically exact, meaning they converge to the true posterior given enough iterations. These methods are especially effective for hierarchical and non-conjugate models.

Approximate Bayesian Computation

ABC methods are designed for models where the likelihood $p(y|\theta)$ is intractable or too expensive to evaluate. The basic idea:

Sample $\theta$ from the prior
Simulate a dataset $y_{sim}$ from the model given $\theta$
Accept $\theta$ if $y_{sim}$ is "close enough" to the observed data (measured by summary statistics and a tolerance threshold)

Variants include rejection ABC, MCMC-ABC, and sequential Monte Carlo ABC. These are commonly used in population genetics, ecology, and systems biology where models are simulation-based.

Bayesian vs. frequentist prediction

Philosophical differences

The core distinction: Bayesian statistics treats parameters as random variables with probability distributions, while frequentist statistics treats them as fixed, unknown constants.

Bayesian inference uses prior distributions to encode existing knowledge, then updates to a posterior via Bayes' theorem
Frequentist inference relies on sampling distributions and long-run frequency properties
Bayesian predictions naturally integrate over parameter uncertainty through the predictive distribution
Frequentist predictions typically plug in point estimates, requiring separate procedures (like bootstrap) to capture parameter uncertainty

Performance comparisons

Small samples: Bayesian methods tend to perform better because informative priors regularize estimates and prevent overfitting
Large samples: the prior's influence fades, and Bayesian and frequentist predictions often converge
Complex models: Bayesian approaches handle hierarchical and latent variable models more naturally
Computational cost: frequentist methods can be faster for simple models, but modern MCMC and variational methods have made Bayesian computation practical for many applications

Hybrid approaches

Several methods blend Bayesian and frequentist ideas:

Empirical Bayes: estimates prior hyperparameters from the data itself, using frequentist methods to set the prior. This is practical when you lack strong prior information but still want the benefits of a Bayesian framework.
Frequentist model averaging: applies Bayesian-style weighting to frequentist model predictions
These hybrid methods are common in applied settings where pure approaches have practical limitations

Applications of Bayesian prediction

Financial forecasting

Bayesian time series models predict stock prices, interest rates, and economic indicators while quantifying forecast uncertainty. Hierarchical models capture dependencies across financial instruments, and Bayesian portfolio optimization balances risk and return using full posterior distributions rather than point estimates.

Environmental modeling

Bayesian hierarchical and spatial-temporal models forecast climate impacts, species distributions, and air/water quality. These models are well-suited to environmental science because they can integrate multiple data sources (satellite, ground-station, expert judgment) and propagate uncertainty through complex physical systems.

Clinical trial predictions

Bayesian methods predict treatment effects and patient outcomes, and they enable adaptive trial designs that modify enrollment or dosing rules as data accumulates. Bayesian longitudinal models track disease progression, and subgroup analyses can inform personalized treatment decisions while properly accounting for multiplicity and uncertainty.