Discrete probability distributions describe the likelihood of outcomes for random variables that take on countable values. They're the building blocks for modeling random events in stochastic processes, and nearly every topic in this course builds on them.

This section covers the key discrete distributions (Bernoulli, binomial, geometric, Poisson, and others), along with the core machinery: probability mass functions, cumulative distribution functions, expected value, variance, moment generating functions, and joint distributions.

Types of discrete distributions

Discrete distributions model scenarios where a random variable can only take on isolated, countable values (integers, for instance). In stochastic processes, they show up constantly: modeling arrivals to a queue, counting defects in manufacturing, tracking failures in a reliability system.

The distributions you need to know for this unit:

Bernoulli and Binomial (success/failure trials)
Geometric and Negative Binomial (waiting for successes)
Poisson (counting events over an interval)
Hypergeometric (sampling without replacement)

Each has a specific PMF, expected value, and variance that you should be comfortable deriving and applying.

Probability mass functions

The probability mass function (PMF) is the primary way to specify a discrete distribution. It tells you the probability that a random variable equals each of its possible values.

Definition of PMF

A PMF assigns a probability to every possible value of a discrete random variable $X$ . It's written as $P(X = x)$ , and it answers the question: "What's the probability that $X$ equals exactly $x$ ?"

Properties of valid PMFs

For a PMF to be valid, two conditions must hold:

Non-negativity: $P(X = x) \geq 0$ for all $x$
Normalization: $\sum_{x} P(X = x) = 1$

If either condition fails, you don't have a legitimate probability distribution.

Cumulative distribution functions

The cumulative distribution function (CDF) gives a running total of probability. It's especially useful when you need to compute probabilities involving inequalities, like $P(X \leq 5)$ .

Definition of CDF

The CDF of a discrete random variable $X$ is defined as:

$F(x) = P(X \leq x) = \sum_{t \leq x} P(X = t)$

It's a non-decreasing, right-continuous step function. As $x \to -\infty$ , $F(x) \to 0$ , and as $x \to \infty$ , $F(x) \to 1$ .

Relationship between PMF and CDF

You can go back and forth between the two:

CDF from PMF: Sum up PMF values: $F(x) = \sum_{t \leq x} P(X = t)$
PMF from CDF: Take differences at jump points: $P(X = x) = F(x) - \lim_{y \to x^-} F(y)$

For integer-valued random variables, this simplifies to $P(X = x) = F(x) - F(x - 1)$ .

Expected value

The expected value summarizes where a distribution is "centered." Think of it as the long-run average if you could repeat the random experiment infinitely many times.

Definition of expected value

For a discrete random variable $X$ :

$E[X] = \sum_{x} x \cdot P(X = x)$

Each possible value gets weighted by its probability. Values that are more likely pull the expected value toward them.

Linearity of expectation

This is one of the most useful properties in probability:

$E[X + Y] = E[X] + E[Y]$

This holds regardless of whether $X$ and $Y$ are independent. That's what makes it so powerful. You can also pull out constants: $E[aX + b] = aE[X] + b$ .

Variance and standard deviation

Variance measures how spread out a distribution is around its mean. Two distributions can have the same expected value but very different variances.

Definition of variance

$Var(X) = E[(X - E[X])^2] = \sum_{x} (x - E[X])^2 \cdot P(X = x)$

A computationally easier form that's often faster to use:

$Var(X) = E[X^2] - (E[X])^2$

Definition of PMF, Probability mass function - Wikipedia

Properties of variance

Scaling: $Var(aX) = a^2 \, Var(X)$ (the square matters; variance is not linear)
Shifting: $Var(X + b) = Var(X)$ (adding a constant doesn't change spread)
Sum of independents: If $X$ and $Y$ are independent, $Var(X + Y) = Var(X) + Var(Y)$

Note the independence requirement for the sum rule. Unlike linearity of expectation, this does not hold for dependent variables. For dependent variables, you need the covariance term: $Var(X + Y) = Var(X) + Var(Y) + 2\,Cov(X, Y)$ .

Standard deviation vs variance

The standard deviation is $\sigma = \sqrt{Var(X)}$ . Its main advantage is that it's in the same units as $X$ itself, making it more interpretable. Variance is in squared units, which is useful for calculations but harder to reason about directly.

Moment generating functions

Moment generating functions (MGFs) encode the entire distribution of a random variable into a single function. They're especially handy for finding moments and for working with sums of independent variables.

Definition of MGF

$M_X(t) = E[e^{tX}] = \sum_{x} e^{tx} \cdot P(X = x)$

The MGF exists if this sum converges in some neighborhood of $t = 0$ . When it exists, it uniquely determines the distribution.

Properties and applications of MGFs

Extracting moments: The $n$ -th moment is $E[X^n] = M_X^{(n)}(0)$ , the $n$ -th derivative evaluated at $t = 0$ . So $E[X] = M_X'(0)$ and $E[X^2] = M_X''(0)$ .
Sums of independents: If $X$ and $Y$ are independent, $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ . This makes MGFs a clean way to derive the distribution of a sum.
Identifying distributions: If you compute an MGF and recognize it as the MGF of a known distribution, you've identified the distribution of your random variable.

Common discrete distributions

Bernoulli and binomial distributions

A Bernoulli random variable models a single trial with two outcomes: success ( $X = 1$ ) with probability $p$ , or failure ( $X = 0$ ) with probability $1 - p$ .

PMF: $P(X = x) = p^x(1-p)^{1-x}$ for $x \in \{0, 1\}$
$E[X] = p$ , $Var(X) = p(1-p)$

The Binomial distribution counts the number of successes in $n$ independent Bernoulli trials.

PMF: $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$ for $k = 0, 1, \ldots, n$
$E[X] = np$ , $Var(X) = np(1-p)$

Applications include quality control (number of defective items in a batch) and survey sampling (number of respondents choosing a particular option).

Geometric and negative binomial distributions

The Geometric distribution models the number of trials until the first success.

PMF: $P(X = k) = (1-p)^{k-1} p$ for $k = 1, 2, 3, \ldots$
$E[X] = 1/p$ , $Var(X) = (1-p)/p^2$

Be careful: some textbooks define the geometric as the number of failures before the first success, which shifts the support to $k = 0, 1, 2, \ldots$ and changes the PMF to $P(X = k) = (1-p)^k p$ . Check which convention your course uses.

The Negative Binomial generalizes this to the number of trials until the $r$ -th success.

PMF: $P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}$ for $k = r, r+1, \ldots$
$E[X] = r/p$ , $Var(X) = r(1-p)/p^2$

These distributions are natural models for waiting times in stochastic processes.

Poisson distribution

The Poisson distribution models the count of events in a fixed interval, given that events occur independently at a constant average rate $\lambda$ .

PMF: $P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}$ for $k = 0, 1, 2, \ldots$
$E[X] = \lambda$ , $Var(X) = \lambda$

The fact that the mean equals the variance is a distinctive feature. Typical applications: customer arrivals per hour, number of typos per page, radioactive decay events per second.

The Poisson also arises as a limit of the binomial when $n$ is large, $p$ is small, and $\lambda = np$ stays moderate.

Hypergeometric distribution

The Hypergeometric distribution models successes in $n$ draws from a finite population of size $N$ containing $K$ successes, without replacement.

PMF: $P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}$
$E[X] = nK/N$ , $Var(X) = n \frac{K}{N} \frac{N-K}{N} \frac{N-n}{N-1}$

The key difference from the binomial: draws are dependent because there's no replacement. As $N \to \infty$ with $K/N \to p$ , the hypergeometric converges to the binomial.

Joint distributions

When you're working with two or more random variables simultaneously, you need joint distributions to capture how they behave together.

Definition of PMF, Discrete Random Variables (2 of 5) | Concepts in Statistics

Joint probability mass functions

The joint PMF of $X$ and $Y$ is $P(X = x, Y = y)$ , giving the probability that $X = x$ and $Y = y$ at the same time.

Validity requirements are the same as for single-variable PMFs: non-negativity, and the double sum over all $(x, y)$ pairs must equal 1.

Marginal and conditional distributions

Marginal PMFs recover the distribution of a single variable by summing out the other:

$P(X = x) = \sum_{y} P(X = x, Y = y)$

Conditional PMFs describe one variable given a specific value of the other:

$P(Y = y \mid X = x) = \frac{P(X = x, Y = y)}{P(X = x)}$

This is only defined when $P(X = x) > 0$ .

Independent vs dependent random variables

$X$ and $Y$ are independent if and only if their joint PMF factors into the product of marginals for every pair of values:

$P(X = x, Y = y) = P(X = x) \cdot P(Y = y) \quad \text{for all } x, y$

Equivalently, independence means conditioning on one variable doesn't change the distribution of the other: $P(Y = y \mid X = x) = P(Y = y)$ .

If this factorization fails for even a single pair $(x, y)$ , the variables are dependent, and you'll need the full joint PMF (or covariance information) to analyze their combined behavior.

Sums of discrete random variables

Adding random variables together comes up constantly in stochastic processes (total service time, aggregate demand, cumulative arrivals, etc.).

Convolution formula for PMFs

For independent random variables $X$ and $Y$ , the PMF of $Z = X + Y$ is:

$P(Z = z) = \sum_{x} P(X = x) \cdot P(Y = z - x)$

This is called the convolution of the two PMFs. You iterate over all values $x$ that $X$ can take and accumulate the products.

For sums of three or more independent variables, apply convolution iteratively (or use MGFs, which is usually cleaner).

Distribution of sum of independent variables

Certain families of distributions are "closed" under summation of independent variables:

Sum of $n$ i.i.d. Bernoulli( $p$ ) variables $\to$ Binomial( $n, p$ )
Sum of independent Poisson( $\lambda_1$ ) and Poisson( $\lambda_2$ ) $\to$ Poisson( $\lambda_1 + \lambda_2$ )
Sum of independent Negative Binomial( $r_1, p$ ) and Negative Binomial( $r_2, p$ ) $\to$ Negative Binomial( $r_1 + r_2, p$ )

Recognizing these closure properties saves you from doing convolution by hand.

Transformations of discrete random variables

Transformations create new random variables as functions of existing ones. This is how you go from a model of raw data to a model of some derived quantity.

PMF and CDF under transformations

Given $Y = g(X)$ where $X$ has a known PMF, the PMF of $Y$ is:

$P(Y = y) = \sum_{x:\, g(x) = y} P(X = x)$

You collect all values of $x$ that map to the same $y$ and add their probabilities. If $g$ is one-to-one, each sum has only one term.

Functions of multiple discrete variables

For a transformation $(U, V) = (g_1(X, Y),\, g_2(X, Y))$ :

$P(U = u, V = v) = \sum_{\substack{(x,y):\, g_1(x,y) = u \\ g_2(x,y) = v}} P(X = x, Y = y)$

Once you have the joint PMF of $(U, V)$ , you can extract marginals and conditionals using the standard formulas from the joint distributions section.

Applications and examples

Modeling real-world scenarios

Queueing systems: Customer arrivals are often modeled as Poisson; the number of customers served in a time window can be binomial or geometric depending on the service mechanism.
Inventory management: Daily demand for a product might follow a Poisson distribution with rate $\lambda = 12$ units/day, which lets you set reorder points and safety stock levels.
Reliability engineering: The number of component failures in a system over a fixed period can be modeled as Poisson or binomial, depending on whether components are independent and identical.

Solving problems using discrete distributions

Quality control: Use the hypergeometric distribution to find the probability that a sample of 10 items from a lot of 200 contains 2 or more defectives.
Risk assessment: Model the number of insurance claims per month as Poisson( $\lambda$ ) to estimate the probability of exceeding a threshold.
Network analysis: Node degree distributions in random graphs often follow Poisson (Erdős–Rényi model) or power-law distributions, which determine connectivity and resilience properties.

The common thread: pick the distribution whose assumptions match your scenario, then use PMFs, CDFs, expected values, and variances to answer quantitative questions about the system.