Fiveable

🔀Stochastic Processes Unit 3 Review

QR code for Stochastic Processes practice questions

3.5 Gaussian processes

3.5 Gaussian processes

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🔀Stochastic Processes
Unit & Topic Study Guides

Gaussian process definition

A Gaussian process (GP) is a collection of random variables where any finite subset has a joint Gaussian distribution. This generalizes the familiar Gaussian (normal) distribution from finite-dimensional vectors to functions, letting you place a probability distribution over an entire space of functions rather than just over numbers or vectors.

Two objects fully specify a GP: a mean function and a covariance function. Everything else follows from these two choices.

Gaussian random variables

A quick refresher on the building blocks. A Gaussian random variable follows a normal distribution, completely determined by its mean and variance. The crucial property for GPs is that any finite collection of jointly Gaussian random variables is itself Gaussian. This closure under finite marginalization is exactly what makes the GP definition consistent: no matter which finite set of input points you pick, you get a valid multivariate Gaussian.

Mean function

The mean function m(x)m(x) gives the expected value of the process at each input xx:

m(x)=E[f(x)]m(x) = \mathbb{E}[f(x)]

In practice, m(x)m(x) is often set to zero. This isn't as restrictive as it sounds: the covariance function and observed data will pull the posterior away from zero where needed. When you do have strong prior knowledge about the trend (e.g., a known linear drift), encoding it in m(x)m(x) can improve predictions.

Covariance function

The covariance function (also called the kernel) k(x,x)k(x, x') specifies how the random variables at two input points xx and xx' co-vary:

k(x,x)=Cov(f(x),f(x))k(x, x') = \text{Cov}(f(x), f(x'))

This is where you encode your assumptions about the function: how smooth it is, how quickly correlations decay with distance, whether it's periodic, etc. The covariance function must be positive semi-definite for any finite collection of inputs. Popular choices include the squared exponential, Matérn, and periodic kernels.

Gaussian process properties

GPs have several structural properties that make inference tractable and elegant. These aren't just theoretical niceties; they're the reason GP computations work out cleanly.

Marginalization property

If fGP(m,k)f \sim \mathcal{GP}(m, k) and you select any subset of input points, the marginal distribution over that subset is still Gaussian (with mean and covariance given by restricting mm and kk to those points). This means you can ignore variables you don't care about without breaking the model. It's what lets you evaluate the GP at arbitrary test points without worrying about all the points you haven't observed.

Conditioning property

When you observe some function values, the conditional distribution of the remaining variables is again a Gaussian process. The posterior mean shifts toward the observed data, and the posterior variance shrinks near observations. This is the engine behind GP regression: you condition on training data and get a full posterior GP over the function.

Translation invariance

A GP is stationary (translation invariant) when its covariance function depends only on the displacement between inputs:

k(x,x)=k(xx)k(x, x') = k(x - x')

Stationarity means the statistical properties of the process don't change as you shift across the input space. The squared exponential and Matérn kernels are both stationary. Non-stationary kernels (like the linear kernel) allow the function's behavior to vary by region.

Isotropic vs. anisotropic

  • An isotropic covariance function depends only on the Euclidean distance xx||x - x'||, so it's invariant to both translations and rotations. A single lengthscale governs all input dimensions equally.
  • An anisotropic covariance function uses different lengthscales along different input dimensions (sometimes called automatic relevance determination, or ARD). This gives the model more flexibility when some input features vary on very different scales.

Covariance functions

The kernel is the single most important modeling choice in a GP. It controls smoothness, periodicity, amplitude, and how the function extrapolates beyond observed data.

Stationary vs. non-stationary

  • Stationary kernels (squared exponential, Matérn) depend only on xxx - x'. They assume the function's character is uniform across the input space.
  • Non-stationary kernels (linear, polynomial, neural network kernel) let the function's variability change with location. These are useful when the data has trends or region-dependent behavior.

Squared exponential

The squared exponential (SE), also called the RBF kernel, is the most widely used kernel:

k(x,x)=σ2exp ⁣(xx22l2)k(x, x') = \sigma^2 \exp\!\left(-\frac{||x - x'||^2}{2l^2}\right)

  • σ2\sigma^2 (signal variance) controls the overall amplitude of function variations.
  • ll (lengthscale) controls how far apart two points need to be before their function values become nearly uncorrelated.

Functions drawn from an SE kernel are infinitely differentiable, which makes them very smooth. This is sometimes too smooth for real data, which is why the Matérn class exists.

Matérn class

The Matérn kernel generalizes the SE by adding a smoothness parameter ν\nu:

k(x,x)=21νΓ(ν)(2νxxl)νKν ⁣(2νxxl)k(x, x') = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\,||x - x'||}{l}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\,||x - x'||}{l}\right)

where KνK_\nu is the modified Bessel function of the second kind.

  • As ν\nu \to \infty, the Matérn converges to the SE kernel.
  • At ν=1/2\nu = 1/2, it reduces to the exponential kernel (Ornstein-Uhlenbeck process), which produces rough, non-differentiable sample paths.
  • The most common choices in practice are ν=3/2\nu = 3/2 (once differentiable) and ν=5/2\nu = 5/2 (twice differentiable), which often fit real-world data better than the SE.

Periodic covariance functions

When the underlying function repeats, a periodic kernel is appropriate. A standard choice is the exponential sine squared kernel:

k(x,x)=σ2exp ⁣(2sin2(πxx/p)l2)k(x, x') = \sigma^2 \exp\!\left(-\frac{2\sin^2(\pi |x - x'| / p)}{l^2}\right)

where pp is the period. You can also multiply a periodic kernel by a stationary kernel (like the SE) to model functions whose periodic pattern slowly evolves over time.

Gaussian random variables, Introduction to Normal Random Variables | Concepts in Statistics

Gaussian process regression

Gaussian process regression (GPR) is a non-parametric Bayesian approach to regression. Instead of fitting a fixed number of parameters, GPR places a GP prior over the function and updates it with observed data to get a posterior GP. The result is both a prediction and a calibrated uncertainty estimate at every test point.

Bayesian linear regression

Bayesian linear regression is a useful stepping stone. You place a Gaussian prior on the weight vector w\mathbf{w}, assume a linear model f(x)=wxf(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}, and add Gaussian noise. Because everything is Gaussian, the posterior over w\mathbf{w} is available in closed form. This turns out to be a special case of GPR with a linear covariance function.

Weight-space view

In the weight-space view, you work with the distribution over parameters:

  1. Specify a Gaussian prior wN(0,Σp)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_p).
  2. Define the likelihood p(yX,w)=N(Xw,σn2I)p(\mathbf{y} | X, \mathbf{w}) = \mathcal{N}(X^\top \mathbf{w},\, \sigma_n^2 I).
  3. Compute the posterior p(wX,y)p(\mathbf{w} | X, \mathbf{y}) using Bayes' rule (still Gaussian).
  4. Predict at new inputs by integrating over the posterior on w\mathbf{w}.

This view generalizes to non-linear models by replacing raw inputs with feature vectors ϕ(x)\phi(x), but the number of features can become very large or infinite.

Function-space view

The function-space view sidesteps parameters entirely. You place a GP prior directly on ff:

fGP(m,k)f \sim \mathcal{GP}(m, k)

Given nn training observations (X,y)(\mathbf{X}, \mathbf{y}) with noise variance σn2\sigma_n^2, the predictive distribution at a test point xx_* is Gaussian with:

  • Mean: fˉ(x)=k(K+σn2I)1y\bar{f}(x_*) = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}
  • Variance: Var(f(x))=k(x,x)k(K+σn2I)1k\text{Var}(f(x_*)) = k(x_*, x_*) - \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*

where KK is the n×nn \times n training covariance matrix and k\mathbf{k}_* is the vector of covariances between xx_* and the training points. The key computational cost is inverting the n×nn \times n matrix, which is O(n3)\mathcal{O}(n^3).

The weight-space and function-space views give identical predictions. The function-space view is more natural for GPs because it works directly with the kernel and avoids explicit features.

Hyperparameter selection

The kernel parameters (lengthscale ll, signal variance σ2\sigma^2) and the noise variance σn2\sigma_n^2 are hyperparameters. They strongly affect model behavior:

  • Too short a lengthscale → the model overfits, interpolating noise.
  • Too long a lengthscale → the model underfits, producing overly smooth predictions.

Common approaches for setting hyperparameters:

  • Type-II maximum likelihood (empirical Bayes): Maximize the log marginal likelihood logp(yX,θ)\log p(\mathbf{y} | X, \theta) with respect to θ\theta. This is the most common approach and has a built-in Occam's razor that penalizes overly complex models.
  • Cross-validation: Evaluate predictive performance on held-out folds.
  • Full Bayesian treatment: Place priors on the hyperparameters and use MCMC to integrate over them. More principled but computationally heavier.

Gaussian process classification

Gaussian process classification (GPC) adapts the GP framework to predict discrete class labels. The core idea: place a GP prior on a latent function f(x)f(x), then pass it through a squashing function (link function) to get class probabilities. Because the likelihood is no longer Gaussian, exact inference is intractable, so approximations are needed.

Probit likelihood

For binary classification, the probit likelihood maps the latent function to a class probability via the standard normal CDF Φ\Phi:

p(y=+1f(x))=Φ(f(x))p(y = +1 \mid f(x)) = \Phi(f(x))

The probit is analytically convenient because Gaussian integrals against Φ\Phi have closed-form solutions. An alternative is the logistic (sigmoid) likelihood, which behaves similarly but lacks these closed forms.

Laplace approximation

The Laplace approximation finds a Gaussian approximation to the non-Gaussian posterior over the latent function values:

  1. Find the posterior mode f^\hat{\mathbf{f}} by iteratively solving a penalized likelihood problem (Newton's method).
  2. Compute the Hessian of the negative log posterior at f^\hat{\mathbf{f}}.
  3. Approximate the posterior as q(f)=N(f^,(logp(fy))1)q(\mathbf{f}) = \mathcal{N}(\hat{\mathbf{f}},\, (-\nabla\nabla \log p(\mathbf{f} | \mathbf{y}))^{-1}).

This is fast and simple but can be inaccurate when the true posterior is skewed or multimodal.

Expectation propagation

Expectation propagation (EP) provides a more refined Gaussian approximation by iteratively refining local approximations to each likelihood factor:

  1. Initialize approximate Gaussian factors for each data point.
  2. For each factor, remove it from the current approximation, compute the "cavity" distribution, then update the factor by matching moments (minimizing KL divergence locally).
  3. Repeat until convergence.

EP tends to be more accurate than the Laplace approximation, especially for multi-class problems, but each iteration is more expensive and convergence isn't always guaranteed.

Sparse Gaussian processes

Exact GP inference scales as O(n3)\mathcal{O}(n^3) in time and O(n2)\mathcal{O}(n^2) in memory, which becomes prohibitive for datasets larger than roughly 10,000 points. Sparse methods address this by summarizing the data through a smaller set of mm inducing points.

Inducing point methods

The idea is to introduce mnm \ll n inducing inputs Z\mathbf{Z} and their corresponding function values u=f(Z)\mathbf{u} = f(\mathbf{Z}). The full covariance matrix is then approximated using the covariances between training points and inducing points:

KKnmKmm1KmnK \approx K_{nm} K_{mm}^{-1} K_{mn}

Different approximation schemes include:

  • Subset of Regressors (SoR): Uses the low-rank approximation directly.
  • Deterministic Training Conditional (DTC): Adds a diagonal correction for the noise.
  • Fully Independent Training Conditional (FITC): Adds a diagonal correction that preserves the exact diagonal of KK.

All of these reduce complexity to O(nm2)\mathcal{O}(nm^2), making GPs feasible for much larger datasets. The inducing point locations can be optimized jointly with the hyperparameters.

Variational inference

Titsias (2009) introduced a variational approach that treats the inducing points as variational parameters rather than model parameters. The method maximizes a variational lower bound (ELBO) on the log marginal likelihood:

logp(y)Eq(u)[logp(yu)]KL(q(u)p(u))\log p(\mathbf{y}) \geq \mathbb{E}_{q(\mathbf{u})}[\log p(\mathbf{y} | \mathbf{u})] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))

This formulation avoids overfitting the inducing point locations and provides a principled trade-off between approximation quality and computational cost. The gap between the ELBO and the true marginal likelihood also serves as a diagnostic for how well the sparse approximation is working.

Stochastic variational inference

Stochastic variational inference (SVI) extends the variational framework to handle very large datasets by using mini-batches:

  1. Sample a mini-batch of training points.
  2. Compute a noisy estimate of the ELBO gradient.
  3. Update the variational parameters via stochastic gradient ascent.

Because each step only touches a subset of the data, SVI scales to millions of data points. It combines naturally with the inducing point framework and is the basis for most modern scalable GP implementations (e.g., in GPflow or GPyTorch).

Gaussian random variables, Gaussian distributions & statistical tests – TikZ.net

Gaussian process latent variable models

Gaussian process latent variable models (GPLVMs) flip the usual GP setup: instead of mapping known inputs to outputs, they learn a low-dimensional latent representation of high-dimensional observed data. Think of it as non-linear dimensionality reduction with uncertainty quantification.

Probabilistic PCA

Probabilistic PCA (PPCA) is the linear special case. It assumes:

y=Wx+ϵ,ϵN(0,σ2I)\mathbf{y} = W\mathbf{x} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I)

where x\mathbf{x} is a low-dimensional latent variable and WW is a linear mapping. PPCA gives a probabilistic interpretation of standard PCA and can estimate the latent dimensionality from data. It's a useful baseline, but it can only capture linear structure.

Gaussian process latent variable model

The GPLVM replaces the linear mapping WW with a GP, allowing non-linear relationships between latent and observed spaces. Each output dimension is modeled as an independent GP over the latent inputs:

yd=fd(x)+ϵd,fdGP(0,k)y_d = f_d(\mathbf{x}) + \epsilon_d, \quad f_d \sim \mathcal{GP}(0, k)

The latent coordinates x\mathbf{x} are learned by maximizing the marginal likelihood (integrating out the GP functions). This is a non-convex optimization, so initialization matters; starting from PCA coordinates is standard practice.

Dynamical variants

For time-series or sequential data, dynamical extensions add temporal structure to the latent space:

  • Gaussian Process Dynamical Model (GPDM): Places a GP prior on the latent dynamics xt+1=g(xt)+noise\mathbf{x}_{t+1} = g(\mathbf{x}_t) + \text{noise}, so the latent trajectory is smooth and structured.
  • Variational GP Dynamical Systems (VGPDS): Uses variational inference for more scalable and principled learning of both the dynamics and the observation model.

These models have been applied to motion capture analysis, video synthesis, and speech processing, where the latent dynamics capture meaningful temporal patterns.

Gaussian processes for time series

GPs offer a natural framework for time series because they directly model temporal correlations through the kernel and provide uncertainty estimates that grow in regions with sparse data.

Autoregressive GPs

Autoregressive GPs (ARGPs) condition each observation on a window of previous observations, similar to classical AR(pp) models. The difference is that the relationship between past and future values is modeled by a GP rather than fixed linear coefficients. This allows the autoregressive structure to be non-linear. Hyperparameters (kernel parameters and the AR order) are typically learned via maximum likelihood or MCMC.

Gaussian process state space models

GP state space models (GP-SSMs) combine GPs with the state space framework:

  • A transition GP models the latent state dynamics: xt+1=f(xt)+noise\mathbf{x}_{t+1} = f(\mathbf{x}_t) + \text{noise}.
  • An observation model (another GP or a parametric function) maps latent states to observations.

Because both the transition and observation models can be non-linear, exact inference is intractable. Approximate methods like variational inference, particle filtering, or combinations of the two are used.

Online learning with GPs

Standard GP inference reprocesses all data when a new point arrives, which is impractical for streaming settings. Online GP methods address this by:

  • Maintaining a fixed-size set of inducing points and updating them incrementally.
  • Using recursive Kalman-style updates when the kernel has a state-space representation.
  • Applying stochastic variational inference with streaming mini-batches.

The goal is to keep computational cost per update roughly constant, independent of how much data has been seen so far.

Advanced topics in Gaussian processes

Multi-output Gaussian processes

Multi-output GPs (MOGPs) model several correlated outputs jointly. The kernel must now capture correlations both across inputs and across outputs. Common constructions include:

  • Linear Model of Coregionalization (LMC): Each output is a linear combination of shared latent GPs.
  • Intrinsic Coregionalization Model (ICM): A special case of LMC with a single shared kernel scaled by an output correlation matrix.

MOGPs are useful for multi-task learning, sensor fusion, and any setting where outputs share statistical structure.

Deep Gaussian processes

Deep GPs (DGPs) stack multiple GP layers, where the output of one layer feeds into the next as input. This creates a hierarchical model capable of learning highly non-linear functions with richer representations than a single-layer GP. The trade-off is that the marginal likelihood becomes intractable, requiring approximate inference (typically doubly stochastic variational inference). DGPs can be seen as a Bayesian, non-parametric analog of deep neural networks.

Bayesian optimization with GPs

Bayesian optimization uses a GP surrogate model to efficiently optimize expensive black-box functions (e.g., tuning neural network hyperparameters, optimizing experimental conditions):

  1. Fit a GP to the observed function evaluations.
  2. Use an acquisition function (e.g., Expected Improvement, Upper Confidence Bound) to decide where to evaluate next, balancing exploration of uncertain regions with exploitation of promising regions.
  3. Evaluate the true function at the chosen point and update the GP.
  4. Repeat until the budget is exhausted.

The GP's uncertainty estimates are what make this work: they tell you where the function might be better than the current best, guiding the search efficiently.

Gaussian processes for big data

Scaling GPs beyond the O(n3)\mathcal{O}(n^3) barrier is an active area of research. The main strategies:

  • Sparse approximations (inducing points, variational methods) reduce complexity to O(nm2)\mathcal{O}(nm^2) as discussed above.
  • Structured kernel interpolation (SKI/KISS-GP): Exploits grid structure and Kronecker/Toeplitz algebra for kernels that factor across dimensions, achieving near-linear scaling.
  • Local GP methods: Partition the input space and fit independent GPs on each partition, then combine predictions. Simple to parallelize but can introduce discontinuities at boundaries.
  • Distributed computing: Spread the matrix computations across multiple machines using frameworks like MapReduce or specialized GPU implementations.

In practice, combining sparse methods with GPU acceleration (as in GPyTorch) is currently the most common approach for datasets in the hundreds of thousands to millions range.