Joint probability distribution definition
Joint probability distributions describe how two or more random variables behave together. Rather than looking at each variable in isolation, a joint distribution captures the probabilistic relationship between them, telling you how likely different combinations of values are to occur simultaneously. This is foundational in stochastic processes, where you're almost always dealing with multiple uncertain quantities evolving together.
Discrete vs continuous
Discrete joint distributions apply when the random variables take on a countable number of values (integers, specific categories). Continuous joint distributions apply when the variables range over uncountably many values (real numbers on an interval or the whole real line).
The distinction matters because it changes how you compute probabilities:
- Discrete case: you sum the joint PMF over the outcomes of interest
- Continuous case: you integrate the joint PDF over a region of interest
Marginal vs conditional distributions
Marginal distributions extract the behavior of a single variable from the joint distribution, ignoring the others.
- Discrete: sum the joint PMF over all values of the other variables
- Continuous: integrate the joint PDF over all values of the other variables
- The result is a univariate distribution for that one variable alone
Conditional distributions describe how one variable behaves given that you know the value of another.
- Calculated by dividing the joint PMF/PDF by the marginal of the conditioning variable:
- This tells you how the distribution of shifts once you observe a specific value of
Joint probability mass functions
A joint probability mass function (PMF) assigns a probability to each possible combination of values for discrete random variables. Formally, maps from the set of all possible value-tuples to probabilities in .
Two properties must hold for a valid joint PMF:
- for all combinations
Discrete random variables
Joint PMFs are defined over a countable sample space consisting of all possible value-tuples. Common multivariate discrete distributions include the multinomial (generalization of the binomial to multiple categories) and joint Poisson models.
Most univariate concepts extend naturally: you can compute expected values, variances, and moment generating functions from the joint PMF, just with sums over multiple indices instead of one.
Multivariate distributions
A multivariate PMF can be laid out in a table or matrix. For two variables and , you'd have a grid where each cell contains . Row sums give the marginal PMF of , and column sums give the marginal PMF of .
From this table you can derive marginals, conditionals, and moments by applying the appropriate sums.
Calculating probabilities
To find the probability of an event defined by conditions on the variables, sum the PMF over all outcomes in :
For complex events, the inclusion-exclusion principle and combinatorial counting techniques help you identify which outcomes belong to .
Joint probability density functions
A joint probability density function (PDF) specifies a continuous multivariate distribution. Unlike a PMF, the value at a single point is not a probability. It represents relative likelihood. To get an actual probability, you integrate the PDF over a region.
Continuous random variables
Joint PDFs apply when the random variables can take any value in a continuous range. Common examples include the multivariate normal, joint exponential, and Dirichlet distributions.
Working with densities lets you model continuous quantities (time, distance, temperature) without artificially discretizing them.
Multivariate density functions
A valid joint PDF must satisfy:
- everywhere
Marginal PDFs are obtained by integrating out the other variables, and conditional PDFs are found by dividing the joint PDF by the appropriate marginal, just as in the discrete case.
Probability calculations with integrals
For an event defined by conditions on continuous random variables:
This often requires setting up multiple integrals with carefully chosen limits that describe the region . For complex regions, changing the order of integration or switching to polar/spherical coordinates can simplify the computation.
Joint cumulative distribution functions
The joint cumulative distribution function (CDF) is defined as:
It gives the probability that every variable simultaneously falls at or below its specified value. The CDF works for both discrete and continuous distributions, making it a universal way to specify any joint distribution.
CDF definition for joint distributions
For discrete variables, the CDF is a cumulative sum:
For continuous variables, it's a cumulative integral:
You can recover the PDF from the CDF by taking mixed partial derivatives: .
Properties of joint CDFs
- Monotonicity: is non-decreasing in each argument. If for all , then .
- Boundary behavior: as all arguments go to , and if any single argument goes to .
- Marginal CDFs are obtained by sending the other arguments to :
- Right-continuity: The joint CDF is right-continuous in each argument.
Relationship to probability
The CDF directly gives the probability of rectangular regions. For two variables, the probability that falls in and falls in is:
This is an inclusion-exclusion over the four corners of the rectangle. For variables, the formula generalizes to terms with alternating signs:

Independent vs dependent variables
Independence and dependence describe whether knowing the value of one variable tells you anything about the others. This distinction has major consequences for how you work with joint distributions: independence dramatically simplifies calculations, while dependence requires you to carefully track how variables interact.
Definition of independence
Random variables are independent if their joint distribution factors into a product of marginals:
Equivalently, the joint CDF factors: .
The intuition: knowing the realized values of some variables doesn't change the probability distribution of the others.
Factoring joint distributions
When variables are independent, the product structure lets you:
- Compute joint probabilities by simply multiplying marginal probabilities
- Treat each variable separately for expectations:
- Derive that (though the converse is not generally true)
Many results about sums, maxima, and other functions of random variables become tractable specifically because of this factorization.
Conditional distributions for dependence
When variables are not independent, conditional distributions are the primary tool for describing their relationship:
The denominator must be nonzero (positive marginal probability or density at the conditioning point). Conditional distributions are central to Bayesian inference, where you update beliefs about unknown quantities after observing data.
Covariance and correlation
Covariance and correlation quantify the linear dependence between two random variables. They appear throughout probability and statistics, from the variance of sums to the multivariate normal distribution.
Measures of dependence
Covariance between and :
- Positive covariance: and tend to be large (or small) together
- Negative covariance: when one is large, the other tends to be small
- Zero covariance: no linear association (but nonlinear dependence can still exist)
Correlation (Pearson correlation coefficient):
- Always between and
- means is an exact linear function of
- means no linear relationship, but a nonlinear one may still exist
Covariance matrix
For a random vector , the covariance matrix is the matrix with entry .
Key properties:
- Symmetric: since
- Positive semi-definite: for any vector
- Diagonal entries are the variances:
The covariance matrix appears in the multivariate normal PDF, in the multivariate CLT, and in linear transformations of random vectors.
Correlation coefficient
The correlation matrix has entry . It's the covariance matrix of the standardized variables .
- Diagonal entries are all 1
- Off-diagonal entries lie in
- Often easier to interpret than the covariance matrix because it's scale-free
Transformations of random vectors
Transformations create new random variables or vectors from existing ones. You might standardize variables, rotate coordinate systems, or apply nonlinear mappings. The goal is to find the distribution of the transformed vector given the distribution of the original.
Linear transformations
A linear transformation takes the form , where is an matrix and is an vector.
The mean and covariance transform as:
These formulas are used constantly in statistics and signal processing (PCA, Kalman filtering, whitening transforms, etc.).
Jacobian matrix
For a general (possibly nonlinear) transformation where is one-to-one and differentiable, the joint PDF of is:
Here is the Jacobian matrix of the inverse transformation, with entry .
The absolute value of the Jacobian determinant accounts for how the transformation stretches or compresses volume in the sample space. If the transformation expands a region, the density decreases proportionally, and vice versa.
Distribution of transformed variables
Continuous case: The joint CDF of is:
For linear transformations with invertible , the Jacobian simplifies to , so .
Discrete case: The PMF of is found by summing over all pre-images:
Sums of random variables
Sums of random variables come up everywhere: total measurement error, aggregate demand, cumulative waiting times. The distribution of a sum depends on the joint distribution of the summands, and when the variables are independent, convolution gives you a clean formula.
Convolution for discrete variables
For independent discrete random variables and , the PMF of is the convolution of their individual PMFs:
Each term in the sum represents one way to split the total between and . For more than two variables:
These sums can become computationally expensive, but probability generating functions and the discrete Fourier transform (via the convolution theorem) make them more tractable.
Convolution integral for continuous variables
For independent continuous random variables and with PDFs and , the PDF of is:
The logic is the same as the discrete case, but with integration replacing summation. Moment generating functions (or characteristic functions) often provide a faster route: if and are independent, then , and you can identify the distribution of from its MGF.