Fiveable

🔀Stochastic Processes Unit 2 Review

QR code for Stochastic Processes practice questions

2.3 Joint probability distributions

2.3 Joint probability distributions

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🔀Stochastic Processes
Unit & Topic Study Guides

Joint probability distribution definition

Joint probability distributions describe how two or more random variables behave together. Rather than looking at each variable in isolation, a joint distribution captures the probabilistic relationship between them, telling you how likely different combinations of values are to occur simultaneously. This is foundational in stochastic processes, where you're almost always dealing with multiple uncertain quantities evolving together.

Discrete vs continuous

Discrete joint distributions apply when the random variables take on a countable number of values (integers, specific categories). Continuous joint distributions apply when the variables range over uncountably many values (real numbers on an interval or the whole real line).

The distinction matters because it changes how you compute probabilities:

  • Discrete case: you sum the joint PMF over the outcomes of interest
  • Continuous case: you integrate the joint PDF over a region of interest

Marginal vs conditional distributions

Marginal distributions extract the behavior of a single variable from the joint distribution, ignoring the others.

  • Discrete: sum the joint PMF over all values of the other variables
  • Continuous: integrate the joint PDF over all values of the other variables
  • The result is a univariate distribution for that one variable alone

Conditional distributions describe how one variable behaves given that you know the value of another.

  • Calculated by dividing the joint PMF/PDF by the marginal of the conditioning variable:

p(xy)=p(x,y)pY(y)orf(xy)=f(x,y)fY(y)p(x | y) = \frac{p(x, y)}{p_Y(y)} \quad \text{or} \quad f(x | y) = \frac{f(x, y)}{f_Y(y)}

  • This tells you how the distribution of XX shifts once you observe a specific value of YY

Joint probability mass functions

A joint probability mass function (PMF) assigns a probability to each possible combination of values for discrete random variables. Formally, p(x1,x2,,xn)p(x_1, x_2, \ldots, x_n) maps from the set of all possible value-tuples to probabilities in [0,1][0, 1].

Two properties must hold for a valid joint PMF:

  • p(x1,,xn)0p(x_1, \ldots, x_n) \geq 0 for all combinations
  • all (x1,,xn)p(x1,,xn)=1\sum_{\text{all } (x_1, \ldots, x_n)} p(x_1, \ldots, x_n) = 1

Discrete random variables

Joint PMFs are defined over a countable sample space consisting of all possible value-tuples. Common multivariate discrete distributions include the multinomial (generalization of the binomial to multiple categories) and joint Poisson models.

Most univariate concepts extend naturally: you can compute expected values, variances, and moment generating functions from the joint PMF, just with sums over multiple indices instead of one.

Multivariate distributions

A multivariate PMF can be laid out in a table or matrix. For two variables XX and YY, you'd have a grid where each cell (xi,yj)(x_i, y_j) contains p(xi,yj)p(x_i, y_j). Row sums give the marginal PMF of YY, and column sums give the marginal PMF of XX.

From this table you can derive marginals, conditionals, and moments by applying the appropriate sums.

Calculating probabilities

To find the probability of an event AA defined by conditions on the variables, sum the PMF over all outcomes in AA:

P(A)=(x1,,xn)Ap(x1,,xn)P(A) = \sum_{(x_1,\ldots,x_n) \in A} p(x_1,\ldots,x_n)

For complex events, the inclusion-exclusion principle and combinatorial counting techniques help you identify which outcomes belong to AA.

Joint probability density functions

A joint probability density function (PDF) specifies a continuous multivariate distribution. Unlike a PMF, the value f(x1,,xn)f(x_1, \ldots, x_n) at a single point is not a probability. It represents relative likelihood. To get an actual probability, you integrate the PDF over a region.

Continuous random variables

Joint PDFs apply when the random variables can take any value in a continuous range. Common examples include the multivariate normal, joint exponential, and Dirichlet distributions.

Working with densities lets you model continuous quantities (time, distance, temperature) without artificially discretizing them.

Multivariate density functions

A valid joint PDF f(x1,,xn)f(x_1, \ldots, x_n) must satisfy:

  • f(x1,,xn)0f(x_1, \ldots, x_n) \geq 0 everywhere
  • f(x1,,xn)dx1dxn=1\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} f(x_1, \ldots, x_n) \, dx_1 \cdots dx_n = 1

Marginal PDFs are obtained by integrating out the other variables, and conditional PDFs are found by dividing the joint PDF by the appropriate marginal, just as in the discrete case.

Probability calculations with integrals

For an event AA defined by conditions on continuous random variables:

P(A)=Af(x1,,xn)dx1dxnP(A) = \int_{A} f(x_1,\ldots,x_n) \, dx_1 \cdots dx_n

This often requires setting up multiple integrals with carefully chosen limits that describe the region AA. For complex regions, changing the order of integration or switching to polar/spherical coordinates can simplify the computation.

Joint cumulative distribution functions

The joint cumulative distribution function (CDF) is defined as:

F(x1,,xn)=P(X1x1,,Xnxn)F(x_1,\ldots,x_n) = P(X_1 \leq x_1,\ldots, X_n \leq x_n)

It gives the probability that every variable simultaneously falls at or below its specified value. The CDF works for both discrete and continuous distributions, making it a universal way to specify any joint distribution.

CDF definition for joint distributions

For discrete variables, the CDF is a cumulative sum:

F(x1,,xn)=y1x1ynxnp(y1,,yn)F(x_1,\ldots,x_n) = \sum_{y_1 \leq x_1} \cdots \sum_{y_n \leq x_n} p(y_1,\ldots,y_n)

For continuous variables, it's a cumulative integral:

F(x1,,xn)=x1xnf(y1,,yn)dyndy1F(x_1,\ldots,x_n) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_n} f(y_1,\ldots,y_n) \, dy_n \cdots dy_1

You can recover the PDF from the CDF by taking mixed partial derivatives: f(x1,,xn)=nFx1xnf(x_1, \ldots, x_n) = \frac{\partial^n F}{\partial x_1 \cdots \partial x_n}.

Properties of joint CDFs

  • Monotonicity: FF is non-decreasing in each argument. If xiyix_i \leq y_i for all ii, then F(x1,,xn)F(y1,,yn)F(x_1,\ldots,x_n) \leq F(y_1,\ldots,y_n).
  • Boundary behavior: F1F \to 1 as all arguments go to ++\infty, and F0F \to 0 if any single argument goes to -\infty.
  • Marginal CDFs are obtained by sending the other arguments to ++\infty:

FXi(xi)=limx1,,xi1,xi+1,,xnF(x1,,xn)F_{X_i}(x_i) = \lim_{x_1,\ldots,x_{i-1},x_{i+1},\ldots,x_n \to \infty} F(x_1,\ldots,x_n)

  • Right-continuity: The joint CDF is right-continuous in each argument.

Relationship to probability

The CDF directly gives the probability of rectangular regions. For two variables, the probability that XX falls in (a1,b1](a_1, b_1] and YY falls in (a2,b2](a_2, b_2] is:

P(a1<Xb1,a2<Yb2)=F(b1,b2)F(a1,b2)F(b1,a2)+F(a1,a2)P(a_1 < X \leq b_1,\, a_2 < Y \leq b_2) = F(b_1, b_2) - F(a_1, b_2) - F(b_1, a_2) + F(a_1, a_2)

This is an inclusion-exclusion over the four corners of the rectangle. For nn variables, the formula generalizes to 2n2^n terms with alternating signs:

P(a1<X1b1,,an<Xnbn)=corners(1)number of ai’sF(corner)P(a_1 < X_1 \leq b_1, \ldots, a_n < X_n \leq b_n) = \sum_{\text{corners}} (-1)^{\text{number of } a_i\text{'s}} F(\text{corner})

Discrete vs continuous, Introduction to Statistics Using LibreOffice.org Calc

Independent vs dependent variables

Independence and dependence describe whether knowing the value of one variable tells you anything about the others. This distinction has major consequences for how you work with joint distributions: independence dramatically simplifies calculations, while dependence requires you to carefully track how variables interact.

Definition of independence

Random variables X1,,XnX_1, \ldots, X_n are independent if their joint distribution factors into a product of marginals:

p(x1,,xn)=p1(x1)pn(xn)(discrete)p(x_1,\ldots,x_n) = p_1(x_1) \cdots p_n(x_n) \quad \text{(discrete)}

f(x1,,xn)=f1(x1)fn(xn)(continuous)f(x_1,\ldots,x_n) = f_1(x_1) \cdots f_n(x_n) \quad \text{(continuous)}

Equivalently, the joint CDF factors: F(x1,,xn)=F1(x1)Fn(xn)F(x_1, \ldots, x_n) = F_1(x_1) \cdots F_n(x_n).

The intuition: knowing the realized values of some variables doesn't change the probability distribution of the others.

Factoring joint distributions

When variables are independent, the product structure lets you:

  • Compute joint probabilities by simply multiplying marginal probabilities
  • Treat each variable separately for expectations: E[g(X)h(Y)]=E[g(X)]E[h(Y)]E[g(X)h(Y)] = E[g(X)] \cdot E[h(Y)]
  • Derive that Cov(X,Y)=0\text{Cov}(X, Y) = 0 (though the converse is not generally true)

Many results about sums, maxima, and other functions of random variables become tractable specifically because of this factorization.

Conditional distributions for dependence

When variables are not independent, conditional distributions are the primary tool for describing their relationship:

p(x1,,xkxk+1,,xn)=p(x1,,xn)p(xk+1,,xn)p(x_1,\ldots,x_k \mid x_{k+1},\ldots,x_n) = \frac{p(x_1,\ldots,x_n)}{p(x_{k+1},\ldots,x_n)}

f(x1,,xkxk+1,,xn)=f(x1,,xn)f(xk+1,,xn)f(x_1,\ldots,x_k \mid x_{k+1},\ldots,x_n) = \frac{f(x_1,\ldots,x_n)}{f(x_{k+1},\ldots,x_n)}

The denominator must be nonzero (positive marginal probability or density at the conditioning point). Conditional distributions are central to Bayesian inference, where you update beliefs about unknown quantities after observing data.

Covariance and correlation

Covariance and correlation quantify the linear dependence between two random variables. They appear throughout probability and statistics, from the variance of sums to the multivariate normal distribution.

Measures of dependence

Covariance between XX and YY:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X,Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

  • Positive covariance: XX and YY tend to be large (or small) together
  • Negative covariance: when one is large, the other tends to be small
  • Zero covariance: no linear association (but nonlinear dependence can still exist)

Correlation (Pearson correlation coefficient):

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\,\text{Var}(Y)}}

  • Always between 1-1 and 11
  • ρ=±1\rho = \pm 1 means YY is an exact linear function of XX
  • ρ=0\rho = 0 means no linear relationship, but a nonlinear one may still exist

Covariance matrix

For a random vector X=(X1,,Xn)\mathbf{X} = (X_1, \ldots, X_n), the covariance matrix Σ\Sigma is the n×nn \times n matrix with (i,j)(i,j) entry Cov(Xi,Xj)\text{Cov}(X_i, X_j).

Key properties:

  • Symmetric: Σij=Σji\Sigma_{ij} = \Sigma_{ji} since Cov(Xi,Xj)=Cov(Xj,Xi)\text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i)
  • Positive semi-definite: aTΣa0\mathbf{a}^T \Sigma \mathbf{a} \geq 0 for any vector a\mathbf{a}
  • Diagonal entries are the variances: Σii=Var(Xi)\Sigma_{ii} = \text{Var}(X_i)

The covariance matrix appears in the multivariate normal PDF, in the multivariate CLT, and in linear transformations of random vectors.

Correlation coefficient

The correlation matrix R\mathbf{R} has (i,j)(i,j) entry ρ(Xi,Xj)\rho(X_i, X_j). It's the covariance matrix of the standardized variables Zi=(Xiμi)/σiZ_i = (X_i - \mu_i)/\sigma_i.

  • Diagonal entries are all 1
  • Off-diagonal entries lie in [1,1][-1, 1]
  • Often easier to interpret than the covariance matrix because it's scale-free

Transformations of random vectors

Transformations create new random variables or vectors from existing ones. You might standardize variables, rotate coordinate systems, or apply nonlinear mappings. The goal is to find the distribution of the transformed vector given the distribution of the original.

Linear transformations

A linear transformation takes the form Y=AX+b\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}, where A\mathbf{A} is an m×nm \times n matrix and b\mathbf{b} is an m×1m \times 1 vector.

The mean and covariance transform as:

  • E[Y]=AE[X]+bE[\mathbf{Y}] = \mathbf{A}\,E[\mathbf{X}] + \mathbf{b}
  • Cov(Y)=ACov(X)AT\text{Cov}(\mathbf{Y}) = \mathbf{A}\,\text{Cov}(\mathbf{X})\,\mathbf{A}^T

These formulas are used constantly in statistics and signal processing (PCA, Kalman filtering, whitening transforms, etc.).

Jacobian matrix

For a general (possibly nonlinear) transformation Y=g(X)\mathbf{Y} = g(\mathbf{X}) where gg is one-to-one and differentiable, the joint PDF of Y\mathbf{Y} is:

fY(y)=fX(g1(y))det(Jg1(y))f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(g^{-1}(\mathbf{y})) \cdot |\det(J_{g^{-1}}(\mathbf{y}))|

Here Jg1(y)J_{g^{-1}}(\mathbf{y}) is the Jacobian matrix of the inverse transformation, with (i,j)(i,j) entry xiyj\frac{\partial x_i}{\partial y_j}.

The absolute value of the Jacobian determinant accounts for how the transformation stretches or compresses volume in the sample space. If the transformation expands a region, the density decreases proportionally, and vice versa.

Distribution of transformed variables

Continuous case: The joint CDF of Y=g(X)\mathbf{Y} = g(\mathbf{X}) is:

FY(y)=P(g(X)y)=g(x)yfX(x)dxF_{\mathbf{Y}}(\mathbf{y}) = P(g(\mathbf{X}) \leq \mathbf{y}) = \int_{g(\mathbf{x}) \leq \mathbf{y}} f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}

For linear transformations with invertible A\mathbf{A}, the Jacobian simplifies to Jg1=A1J_{g^{-1}} = \mathbf{A}^{-1}, so det(J)=1/det(A)|\det(J)| = 1/|\det(\mathbf{A})|.

Discrete case: The PMF of Y\mathbf{Y} is found by summing over all pre-images:

pY(y)=x:g(x)=ypX(x)p_{\mathbf{Y}}(\mathbf{y}) = \sum_{\mathbf{x}:\, g(\mathbf{x}) = \mathbf{y}} p_{\mathbf{X}}(\mathbf{x})

Sums of random variables

Sums of random variables come up everywhere: total measurement error, aggregate demand, cumulative waiting times. The distribution of a sum depends on the joint distribution of the summands, and when the variables are independent, convolution gives you a clean formula.

Convolution for discrete variables

For independent discrete random variables XX and YY, the PMF of Z=X+YZ = X + Y is the convolution of their individual PMFs:

pZ(z)=kpX(k)pY(zk)p_Z(z) = \sum_k p_X(k)\, p_Y(z - k)

Each term in the sum represents one way to split the total zz between X=kX = k and Y=zkY = z - k. For more than two variables:

pX1++Xn(z)=k1++kn=zpX1(k1)pXn(kn)p_{X_1 + \cdots + X_n}(z) = \sum_{k_1 + \cdots + k_n = z} p_{X_1}(k_1) \cdots p_{X_n}(k_n)

These sums can become computationally expensive, but probability generating functions and the discrete Fourier transform (via the convolution theorem) make them more tractable.

Convolution integral for continuous variables

For independent continuous random variables XX and YY with PDFs fXf_X and fYf_Y, the PDF of Z=X+YZ = X + Y is:

fZ(z)=fX(t)fY(zt)dtf_Z(z) = \int_{-\infty}^{\infty} f_X(t)\, f_Y(z - t)\, dt

The logic is the same as the discrete case, but with integration replacing summation. Moment generating functions (or characteristic functions) often provide a faster route: if XX and YY are independent, then MZ(s)=MX(s)MY(s)M_Z(s) = M_X(s) \cdot M_Y(s), and you can identify the distribution of ZZ from its MGF.