Fiveable

📊Bayesian Statistics Unit 6 Review

QR code for Bayesian Statistics practice questions

6.5 Prediction

6.5 Prediction

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📊Bayesian Statistics
Unit & Topic Study Guides

Fundamentals of Bayesian prediction

Bayesian prediction uses posterior distributions to forecast future, unobserved data. Instead of producing a single best guess, it gives you a full probability distribution over possible outcomes, which means uncertainty from both the parameters and the data-generating process is baked right into your forecasts.

Predictive distributions

A predictive distribution is the probability distribution of a future observation given what you currently know. Rather than collapsing your forecast to one number, it tells you the relative likelihood of every possible future value.

Predictive distributions are calculated by integrating over the parameter space, weighted by whatever distribution you have on the parameters (prior or posterior). This integration is what "averages out" parameter uncertainty, so the result reflects not just your best parameter guess but the full range of plausible parameter values.

Posterior predictive distribution

The posterior predictive distribution is the distribution of future observations after you've seen data. It's the workhorse of Bayesian prediction.

p(ynewy)=p(ynewθ)p(θy)dθp(y_{new}|y) = \int p(y_{new}|\theta)\,p(\theta|y)\,d\theta

  • ynewy_{new} = the new observation you want to predict
  • yy = the data you've already observed
  • θ\theta = model parameters
  • p(θy)p(\theta|y) = the posterior distribution of parameters

What this integral does: for every possible value of θ\theta, it asks "how likely is ynewy_{new} under this θ\theta?" and then weights that by how plausible θ\theta is given the data. The result captures two sources of uncertainty: parameter uncertainty (we don't know θ\theta exactly) and sampling variability (even if we knew θ\theta, future data would still be random).

Prior predictive distribution

The prior predictive distribution describes what data you'd expect to see before observing anything, based only on your prior beliefs:

p(y)=p(yθ)p(θ)dθp(y) = \int p(y|\theta)\,p(\theta)\,d\theta

This is useful for two things:

  • Prior elicitation: if the prior predictive puts most of its mass on absurd data values, your prior is probably poorly chosen
  • Model checking: comparing the prior predictive to the kind of data you actually expect helps you sanity-check your model before collecting data

Point predictions

Sometimes you need a single number rather than a full distribution. Bayesian point predictions are summaries extracted from the predictive distribution, and the "right" summary depends on how you define prediction error.

Bayesian point estimators

The three most common point estimators from the posterior (or posterior predictive) are:

  • Posterior mean: minimizes expected squared error loss
  • Posterior median: minimizes expected absolute error loss
  • Posterior mode (MAP): the single most probable value

Your choice depends on the loss function that best matches the cost of being wrong in your application.

Posterior mean vs. median

  • The mean is optimal when you care about squared error. It's sensitive to the tails of the distribution, so in a heavily skewed posterior predictive, the mean can get pulled toward extreme values.
  • The median is optimal when you care about absolute error. It's more robust to skew and outliers, and it represents the "typical" predicted value.

For symmetric distributions, mean and median coincide, so the choice doesn't matter. For skewed distributions, the median often gives a more representative single prediction, while the mean better reflects the average outcome.

Prediction intervals

A prediction interval gives a range of values that a future observation is expected to fall within at a specified probability level. These come directly from quantiles of the posterior predictive distribution.

For example, a 95% prediction interval uses the 2.5th and 97.5th percentiles of p(ynewy)p(y_{new}|y). Prediction intervals are wider than credible intervals for θ\theta because they account for both parameter uncertainty and the inherent randomness in future data.

Bayesian model averaging

When you're unsure which model is correct, Bayesian model averaging (BMA) avoids committing to a single model by combining predictions across multiple models, weighted by how well each model is supported by the data.

Ensemble predictions

BMA is a principled form of ensemble prediction. Instead of picking one "best" model, you let several models contribute to the forecast. Models that explain the observed data well get more influence; models that don't get less. This tends to produce better-calibrated predictions than any single model, especially when no single model clearly dominates.

Weighted model combinations

The BMA predictive distribution is:

p(ynewy)=k=1Kp(ynewMk,y)p(Mky)p(y_{new}|y) = \sum_{k=1}^{K} p(y_{new}|M_k, y)\,p(M_k|y)

  • MkM_k is the kk-th candidate model
  • p(Mky)p(M_k|y) is the posterior probability of model kk, which serves as its weight
  • p(ynewMk,y)p(y_{new}|M_k, y) is the posterior predictive under model kk

The posterior model probabilities p(Mky)p(M_k|y) are typically computed using Bayes factors or marginal likelihoods. Models with stronger evidence get higher weights.

Uncertainty in model selection

BMA explicitly acknowledges that the "true" model may not be in your candidate set. By averaging over models rather than selecting one, you avoid the overconfidence that comes from conditioning on a single model choice. This is especially valuable when different models yield substantially different forecasts.

Predictive model assessment

Predictive assessment asks: how well does your model actually predict new data? Bayesian approaches focus on evaluating the full predictive distribution, not just point predictions.

Posterior predictive checks

Posterior predictive checks compare your observed data to data simulated from the posterior predictive distribution. The logic: if your model is reasonable, simulated datasets should look similar to the real data.

  • Pick a test statistic (e.g., mean, variance, max value, or a domain-specific quantity)
  • Compute it on the observed data
  • Compute it on many simulated datasets from p(yrepy)p(y_{rep}|y)
  • A posterior predictive p-value measures how extreme the observed statistic is relative to the simulated distribution

Systematic discrepancies point to model misspecification. Graphical checks (overlaying observed and simulated data) are often more informative than any single p-value.

Predictive distributions, Frontiers | Increasing Interpretability of Bayesian Probabilistic Programming Models Through ...

Cross-validation techniques

Cross-validation estimates out-of-sample predictive accuracy by repeatedly holding out portions of the data:

  • Leave-one-out (LOO) cross-validation holds out one observation at a time and predicts it from the rest
  • k-fold cross-validation partitions data into kk subsets and rotates the held-out set
  • The log predictive density (lpd) is the standard scoring rule: higher values mean better predictions
  • LOO can be approximated efficiently using Pareto-smoothed importance sampling (PSIS-LOO), avoiding the need to refit the model nn times

Information criteria

Information criteria balance model fit against complexity:

  • DIC (Deviance Information Criterion): estimates expected predictive error using an effective number of parameters. Commonly used for hierarchical models, though it has known limitations with non-normal posteriors.
  • WAIC (Widely Applicable Information Criterion): a fully Bayesian alternative that approximates out-of-sample predictive accuracy. It's computed from the pointwise log predictive density and is generally preferred over DIC.

Lower values indicate better predictive performance for both criteria.

Prediction in regression models

Bayesian linear regression predictions

In Bayesian linear regression, the coefficients β\beta and error variance σ2\sigma^2 are treated as random variables with prior distributions. After observing data, you get posterior distributions for these parameters, and predictions for a new input xnewx_{new} come from:

p(ynewxnew,y,X)=p(ynewxnew,β,σ2)p(β,σ2y,X)dβdσ2p(y_{new}|x_{new}, y, X) = \int p(y_{new}|x_{new}, \beta, \sigma^2)\,p(\beta, \sigma^2|y, X)\,d\beta\, d\sigma^2

This integral averages the regression prediction over all plausible parameter values, so the resulting predictive distribution is wider (more honest about uncertainty) than a plug-in prediction from classical regression.

Hierarchical model predictions

Hierarchical (multilevel) models are designed for grouped or nested data, such as students within schools or patients within hospitals.

  • Predictions incorporate both within-group and between-group variability
  • Groups with limited data get partial pooling: their predictions are pulled toward the overall population estimate, which reduces overfitting
  • You can generate predictions at the individual level, the group level, or the population level
  • This "borrowing strength" across groups is one of the biggest practical advantages of Bayesian hierarchical models

Non-linear prediction methods

When relationships between predictors and outcomes are non-linear, two common Bayesian approaches are:

  • Gaussian process (GP) regression: defines a prior directly over functions using a kernel function. The kernel controls assumptions about smoothness and length-scale. GPs are flexible but scale poorly to very large datasets.
  • Bayesian neural networks (BNNs): place prior distributions on network weights, producing a distribution over predictions rather than a single output. They can capture complex patterns but are computationally demanding.

Both methods are most useful when linear models clearly fail to capture the underlying structure in the data.

Time series prediction

Bayesian forecasting methods

Bayesian time series forecasting applies the same integrate-over-uncertainty logic to sequential data. Prior knowledge about trend, seasonality, and other components can be encoded directly into the model. The output is a full probabilistic forecast, including prediction intervals, rather than a single trajectory.

Common examples include Bayesian structural time series (BSTS) models and Bayesian ARIMA models.

State space models

A state space model represents a time series through two equations:

  • State equation: describes how latent (unobserved) states evolve over time
  • Observation equation: links the latent states to the data you actually see

Bayesian inference provides full posterior distributions for both the latent states and model parameters. These models are widely used in tracking, signal processing, and econometrics because they handle complex dynamics and external covariates naturally.

Dynamic linear models

Dynamic linear models (DLMs) are a special case of state space models where both the state and observation equations are linear. They're useful for modeling time-varying regression coefficients, trends, and seasonal patterns.

Bayesian inference for DLMs is typically performed using the Kalman filter (for forward filtering) and Kalman smoother (for backward smoothing). These algorithms allow efficient sequential updating of state estimates as new data arrives, and they handle missing data and irregular time spacing gracefully.

Prediction with missing data

Multiple imputation techniques

Multiple imputation generates several plausible completed datasets, runs the analysis on each, and then combines the results. In a Bayesian framework:

  1. Specify a model for the missing values given the observed data
  2. Draw multiple sets of imputed values from the posterior predictive distribution of the missing data
  3. Analyze each completed dataset separately
  4. Combine results using rules that properly propagate imputation uncertainty (e.g., Rubin's rules)

This is often implemented via MCMC and produces more honest uncertainty estimates than filling in missing values once.

Bayesian approaches to missingness

Bayesian models can handle different missing data mechanisms explicitly:

  • MCAR (Missing Completely at Random): missingness is unrelated to any data
  • MAR (Missing at Random): missingness depends on observed data but not on the missing values themselves
  • MNAR (Missing Not at Random): missingness depends on the unobserved values

A key advantage is that you can model the data and the missingness mechanism jointly, estimating parameters and imputing missing values simultaneously within a single posterior.

Predictive distributions, Frontiers | Increasing Interpretability of Bayesian Probabilistic Programming Models Through ...

Sensitivity analysis for predictions

Because MNAR assumptions are untestable from the data alone, sensitivity analysis is important. You vary your assumptions about the missingness mechanism and check whether your predictions change substantially. If predictions are stable across plausible assumptions, you can be more confident in them. Bayesian model averaging over different missingness models is one systematic way to do this.

Computational methods for prediction

Monte Carlo methods

Monte Carlo methods approximate intractable integrals by drawing random samples. For Bayesian prediction, this means:

  1. Draw parameter samples θ(1),θ(2),,θ(S)\theta^{(1)}, \theta^{(2)}, \dots, \theta^{(S)} from the posterior p(θy)p(\theta|y)
  2. For each θ(s)\theta^{(s)}, simulate a future observation ynew(s)p(ynewθ(s))y_{new}^{(s)} \sim p(y_{new}|\theta^{(s)})
  3. The collection {ynew(s)}\{y_{new}^{(s)}\} approximates the posterior predictive distribution

Basic techniques include simple Monte Carlo, importance sampling, and rejection sampling. These work well for low- to moderate-dimensional problems.

Markov Chain Monte Carlo

MCMC methods generate correlated samples from the posterior by constructing a Markov chain whose stationary distribution is the target posterior. Key algorithms:

  • Metropolis-Hastings: proposes new parameter values and accepts/rejects based on the posterior ratio
  • Gibbs sampling: cycles through parameters one at a time, sampling each from its full conditional distribution
  • Hamiltonian Monte Carlo (HMC): uses gradient information to propose distant, high-probability states, making it efficient for high-dimensional problems

MCMC samples are asymptotically exact, meaning they converge to the true posterior given enough iterations. These methods are especially effective for hierarchical and non-conjugate models.

Approximate Bayesian Computation

ABC methods are designed for models where the likelihood p(yθ)p(y|\theta) is intractable or too expensive to evaluate. The basic idea:

  1. Sample θ\theta from the prior
  2. Simulate a dataset ysimy_{sim} from the model given θ\theta
  3. Accept θ\theta if ysimy_{sim} is "close enough" to the observed data (measured by summary statistics and a tolerance threshold)

Variants include rejection ABC, MCMC-ABC, and sequential Monte Carlo ABC. These are commonly used in population genetics, ecology, and systems biology where models are simulation-based.

Bayesian vs. frequentist prediction

Philosophical differences

The core distinction: Bayesian statistics treats parameters as random variables with probability distributions, while frequentist statistics treats them as fixed, unknown constants.

  • Bayesian inference uses prior distributions to encode existing knowledge, then updates to a posterior via Bayes' theorem
  • Frequentist inference relies on sampling distributions and long-run frequency properties
  • Bayesian predictions naturally integrate over parameter uncertainty through the predictive distribution
  • Frequentist predictions typically plug in point estimates, requiring separate procedures (like bootstrap) to capture parameter uncertainty

Performance comparisons

  • Small samples: Bayesian methods tend to perform better because informative priors regularize estimates and prevent overfitting
  • Large samples: the prior's influence fades, and Bayesian and frequentist predictions often converge
  • Complex models: Bayesian approaches handle hierarchical and latent variable models more naturally
  • Computational cost: frequentist methods can be faster for simple models, but modern MCMC and variational methods have made Bayesian computation practical for many applications

Hybrid approaches

Several methods blend Bayesian and frequentist ideas:

  • Empirical Bayes: estimates prior hyperparameters from the data itself, using frequentist methods to set the prior. This is practical when you lack strong prior information but still want the benefits of a Bayesian framework.
  • Frequentist model averaging: applies Bayesian-style weighting to frequentist model predictions
  • These hybrid methods are common in applied settings where pure approaches have practical limitations

Applications of Bayesian prediction

Financial forecasting

Bayesian time series models predict stock prices, interest rates, and economic indicators while quantifying forecast uncertainty. Hierarchical models capture dependencies across financial instruments, and Bayesian portfolio optimization balances risk and return using full posterior distributions rather than point estimates.

Environmental modeling

Bayesian hierarchical and spatial-temporal models forecast climate impacts, species distributions, and air/water quality. These models are well-suited to environmental science because they can integrate multiple data sources (satellite, ground-station, expert judgment) and propagate uncertainty through complex physical systems.

Clinical trial predictions

Bayesian methods predict treatment effects and patient outcomes, and they enable adaptive trial designs that modify enrollment or dosing rules as data accumulates. Bayesian longitudinal models track disease progression, and subgroup analyses can inform personalized treatment decisions while properly accounting for multiplicity and uncertainty.