Random variables give you a way to assign numbers to the outcomes of random processes, which turns uncertainty into something you can actually calculate with. They're the bridge between abstract probability and the kind of quantitative analysis you'll use throughout statistics, science, and decision-making.

This topic covers the definition and types of random variables, their key properties (expected value, variance), how to work with multiple random variables together, and the major theorems that make large-sample statistics possible.

Definition of random variables

A random variable is a function that maps each outcome of a random experiment to a real number. Instead of talking about "heads" or "tails," you assign numerical values (like 1 and 0), which lets you use all the tools of algebra and calculus on probabilistic outcomes.

Discrete vs continuous variables

The two main types of random variables differ in what values they can take:

Discrete random variables take on countable, distinct values. Think of things you can list out: the number of customers in a store (0, 1, 2, ...), the number of heads in 10 coin flips, or the result of rolling a die.
Continuous random variables can assume any value within a range. These describe measurements like temperature, height, or time, where values aren't restricted to whole numbers.

This distinction matters because each type uses a different mathematical tool to describe its probabilities: discrete variables use probability mass functions (PMFs), while continuous variables use probability density functions (PDFs).

Probability distributions

A probability distribution is a mathematical function that tells you how likely each outcome (or range of outcomes) is for a given random variable.

Probability mass function (PMF) applies to discrete variables and gives the exact probability of each specific outcome.
Probability density function (PDF) applies to continuous variables and gives the relative likelihood of values. For continuous variables, you find probabilities by integrating the PDF over an interval, not by evaluating it at a single point.
Cumulative distribution function (CDF) works for both types and gives the probability that the variable is less than or equal to a given value: $F(x) = P(X \leq x)$ .

Common named distributions include the uniform, normal, exponential, and Poisson, each of which models a different kind of real-world scenario.

Expected value

The expected value (or mean) of a random variable represents its long-run average if you repeated the experiment many times. It's a measure of central tendency.

For discrete variables: $E[X] = \sum_{i} x_i P(X = x_i)$
For continuous variables: $E[X] = \int_{-\infty}^{\infty} x f(x) \, dx$

You multiply each possible outcome by its probability, then add everything up. For example, if a game pays $10 with probability 0.3 and $0 with probability 0.7, the expected value is $10(0.3) + 0(0.7) = 3$ dollars per play. This kind of calculation shows up constantly in risk assessment and decision-making.

Types of random variables

Different situations call for different probability models. Recognizing which random variable fits a scenario is a key skill, because each one comes with its own formulas for probabilities, expected values, and variances.

Bernoulli random variables

The Bernoulli random variable is the simplest discrete type: it has only two outcomes, typically coded as 1 ("success") and 0 ("failure").

PMF: $P(X = 1) = p, \quad P(X = 0) = 1 - p$
Expected value: $E[X] = p$
Variance: $Var(X) = p(1 - p)$

A single coin flip is the classic example. Any yes/no question with a fixed probability of "yes" can be modeled this way: Does a manufactured part pass inspection? Does a patient test positive?

Binomial random variables

If you repeat a Bernoulli trial $n$ independent times and count the total number of successes, you get a binomial random variable.

PMF: $P(X = k) = \binom{n}{k} p^k (1 - p)^{n-k}$
Expected value: $E[X] = np$
Variance: $Var(X) = np(1 - p)$

For example, if a factory produces items with a 5% defect rate and you inspect a batch of 20, the number of defective items follows a binomial distribution with $n = 20$ and $p = 0.05$ . The expected number of defects is $20 \times 0.05 = 1$ .

Poisson random variables

The Poisson distribution models the number of events occurring in a fixed interval of time or space, where events happen independently at a constant average rate.

PMF: $P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$
Both the expected value and variance equal $\lambda$ (the rate parameter).

This is the go-to model for counting rare or random occurrences: the number of emails you receive per hour, radioactive decay events per second, or traffic accidents at an intersection per month. The Poisson distribution also serves as a good approximation to the binomial when $n$ is large and $p$ is small (with $\lambda = np$ ).

Normal random variables

The normal (Gaussian) distribution is the most important continuous distribution. Its PDF produces the familiar symmetric, bell-shaped curve.

PDF: $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$
Characterized by two parameters: the mean $\mu$ (center) and standard deviation $\sigma$ (spread).

The standard normal distribution is the special case where $\mu = 0$ and $\sigma = 1$ . You convert any normal variable to a standard normal using the Z-score: $Z = \frac{X - \mu}{\sigma}$ .

The normal distribution appears everywhere because of the Central Limit Theorem (covered below), which says that averages of many independent random variables tend toward a normal distribution regardless of the original distribution. Heights, IQ scores, and measurement errors all approximately follow normal distributions.

Properties of random variables

Variance and standard deviation

While expected value tells you where a distribution is centered, variance tells you how spread out it is. It measures the average squared distance from the mean.

For discrete variables: $Var(X) = \sum_{i} (x_i - E[X])^2 P(X = x_i)$
For continuous variables: $Var(X) = \int_{-\infty}^{\infty} (x - E[X])^2 f(x) \, dx$
A useful shortcut: $Var(X) = E[X^2] - (E[X])^2$

Standard deviation is the square root of variance: $\sigma = \sqrt{Var(X)}$ . It's often more intuitive because it's in the same units as the original variable. If heights have a standard deviation of 3 inches, that directly tells you how much heights typically vary from the mean.

Moments of random variables

Moments are a family of summary statistics that describe different aspects of a distribution's shape:

The 1st moment is the mean (expected value).
The 2nd central moment is the variance.
The 3rd central moment relates to skewness (how asymmetric the distribution is).
The 4th central moment relates to kurtosis (how heavy the tails are).

The moment generating function (MGF) compactly encodes all moments:

$M_X(t) = E[e^{tX}]$

You can extract the $n$ th moment by taking the $n$ th derivative and evaluating at $t = 0$ : $E[X^n] = \frac{d^n}{dt^n} M_X(t) \Big|_{t=0}$ . MGFs are especially useful for proving theoretical results about distributions.

Probability density function

A few key properties of the PDF that are worth keeping straight:

The PDF $f(x)$ must be non-negative everywhere and must integrate to 1 over its entire domain.
The PDF value at a single point is not a probability. For continuous variables, $P(X = x) = 0$ for any specific $x$ . Instead, you find probabilities over intervals by integrating: $P(a \leq X \leq b) = \int_a^b f(x) \, dx$ .
The PDF is the derivative of the CDF: $f(x) = \frac{d}{dx} F(x)$ .

Discrete vs continuous variables, Discrete Random Variables (3 of 5) | Statistics for the Social Sciences

Functions of random variables

When you apply a function to a random variable, you create a new random variable. Understanding how transformations affect expected values and variances is essential for modeling.

Linear transformations

A linear transformation takes the form $Y = aX + b$ , where $a$ and $b$ are constants.

Expected value: $E[Y] = aE[X] + b$
Variance: $Var(Y) = a^2 Var(X)$

Notice that adding a constant $b$ shifts the mean but doesn't change the variance. Multiplying by $a$ scales both the mean and the standard deviation (by $|a|$ ), while the variance scales by $a^2$ .

The most common application is standardization (computing Z-scores): $Z = \frac{X - \mu}{\sigma}$ , which is a linear transformation with $a = 1/\sigma$ and $b = -\mu/\sigma$ .

Non-linear transformations

When $Y = g(X)$ for some non-linear function $g$ , the distribution of $Y$ can look very different from that of $X$ .

Expected value: $E[Y] = E[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) \, dx$
An important caution: in general, $E[g(X)] \neq g(E[X])$ . For instance, $E[X^2] \neq (E[X])^2$ . This is a common mistake.

Examples include logarithmic transformations (used in modeling multiplicative processes like stock returns) and squaring (used in computing variance itself).

Moment generating functions

The MGF $M_X(t) = E[e^{tX}]$ is a powerful theoretical tool for several reasons:

It uniquely determines a distribution. If two random variables have the same MGF, they have the same distribution.
For sums of independent random variables, MGFs multiply: $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ . This makes it much easier to find the distribution of a sum.
It provides a systematic way to compute moments by differentiation.

Joint random variables

So far we've looked at single random variables in isolation. In practice, you often care about the relationship between two or more variables simultaneously.

Joint probability distributions

A joint distribution describes the probability behavior of two (or more) random variables together.

For discrete variables, the joint PMF gives $P(X = x, Y = y)$ for every pair $(x, y)$ .
For continuous variables, the joint PDF $f_{X,Y}(x, y)$ gives the relative likelihood of each pair of values.
Both must be non-negative and sum (or integrate) to 1 over all possible pairs.

For example, you might model the joint distribution of height and weight in a population, where knowing someone's height gives you information about their likely weight.

Marginal distributions

A marginal distribution extracts the distribution of one variable from a joint distribution by "summing out" the other variable.

For discrete variables: $P_X(x) = \sum_y P_{X,Y}(x, y)$
For continuous variables: $f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy$

The marginal distribution tells you about one variable alone, ignoring the other. You can always recover marginals from a joint distribution, but you generally cannot reconstruct the joint distribution from marginals alone (because the marginals don't capture the relationship between the variables).

Conditional distributions

A conditional distribution describes one variable given that the other takes a specific value.

For discrete variables: $P_{X|Y}(x|y) = \frac{P_{X,Y}(x, y)}{P_Y(y)}$
For continuous variables: $f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}$

This is the probability version of asking "given that Y equals some value, what does X look like?" Conditional distributions are central to Bayesian inference, where you update your beliefs about a parameter after observing data. They also appear in medical diagnosis (probability of disease given a positive test) and weather forecasting (probability of rain given current atmospheric conditions).

Independence of random variables

Definition of independence

Two random variables $X$ and $Y$ are independent if knowing the value of one tells you nothing about the other. Formally, their joint distribution factors into the product of their marginals:

For discrete variables: $P_{X,Y}(x, y) = P_X(x) \cdot P_Y(y)$ for all $x, y$
For continuous variables: $f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y)$ for all $x, y$

Independence is a strong condition. It means the conditional distribution of $X$ given $Y$ is the same as the unconditional distribution of $X$ . This concept extends naturally to more than two variables.

Properties of independent variables

When $X$ and $Y$ are independent, several calculations simplify dramatically:

$E[XY] = E[X] \cdot E[Y]$
$Var(X + Y) = Var(X) + Var(Y)$
MGFs multiply: $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$

The variance addition property is especially important. For dependent variables, you'd need to include a covariance term: $Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)$ . Independence makes that covariance term zero.

These properties are also what make the Central Limit Theorem work, since it requires independent (or at least uncorrelated) variables.

Covariance and correlation

Covariance measures how two random variables move together:

$Cov(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]$

Positive covariance: when $X$ is above its mean, $Y$ tends to be above its mean too.
Negative covariance: they tend to move in opposite directions.
Zero covariance: no linear relationship.

The correlation coefficient standardizes covariance to a scale from $-1$ to $1$ :

$\rho_{X,Y} = \frac{Cov(X, Y)}{\sqrt{Var(X) \cdot Var(Y)}}$

A correlation of $\pm 1$ means a perfect linear relationship. A correlation of 0 means no linear relationship. One important subtlety: independent variables always have zero correlation, but zero correlation does not guarantee independence. Two variables can be uncorrelated yet still dependent through a non-linear relationship.

Discrete vs continuous variables, Binomial distribution - Wikipedia

Limit theorems

Limit theorems describe what happens to random variables as sample sizes grow large. They're the theoretical backbone of statistical inference.

Law of large numbers

The Law of Large Numbers (LLN) says that as you take more and more observations, the sample mean converges to the true expected value.

Weak law: the sample mean converges to $\mu$ in probability (the chance of being far from $\mu$ shrinks to zero).
Strong law: the sample mean converges to $\mu$ almost surely (with probability 1).

This formalizes the intuition that if you flip a fair coin thousands of times, the proportion of heads will settle close to 0.5. It's the foundation of insurance pricing, casino profitability, and any setting where long-run averages matter.

Central limit theorem

The Central Limit Theorem (CLT) is arguably the most important result in probability and statistics. It states that if you take $n$ independent, identically distributed random variables with mean $\mu$ and standard deviation $\sigma$ , their standardized sum approaches a standard normal distribution as $n$ grows:

$\frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1)$

The remarkable thing is that this works regardless of the original distribution. Whether the individual variables are uniform, exponential, or something else entirely, the sum still tends toward normal. This is why the normal distribution shows up so often in practice, and it's the reason confidence intervals and many hypothesis tests are built on normal approximations.

Chebyshev's inequality

Chebyshev's inequality gives a universal bound on how far a random variable can stray from its mean:

$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$

This holds for any distribution with a finite mean and variance. For example, at least 75% of any distribution's values lie within 2 standard deviations of the mean (since $1/2^2 = 0.25$ ), and at least 89% lie within 3 standard deviations.

The bound is often loose (the normal distribution does much better than Chebyshev predicts), but its power is in its generality. It's also used to prove the weak law of large numbers.

Applications in probability theory

Stochastic processes

A stochastic process is a collection of random variables indexed by time (or space). Instead of a single random outcome, you're tracking how randomness evolves.

Markov chains: discrete-time processes where the next state depends only on the current state.
Poisson processes: model random events occurring continuously over time at a constant rate.
Brownian motion: models continuous random movement, foundational in physics and finance.

Applications span stock price modeling, particle diffusion, population dynamics, and queueing systems.

Markov chains

A Markov chain is a stochastic process with the memoryless property: the probability of the next state depends only on the current state, not on how you got there.

The chain is fully described by its transition probability matrix, where entry $(i, j)$ gives the probability of moving from state $i$ to state $j$ . Over many steps, many Markov chains converge to a stationary distribution that doesn't change with further transitions.

Applications include Google's PageRank algorithm (modeling a random web surfer), weather prediction models, inventory management, and DNA sequence analysis.

Monte Carlo simulations

Monte Carlo methods use repeated random sampling to estimate quantities that are difficult or impossible to compute analytically.

The basic approach:

Define the problem and identify the random inputs.
Generate a large number of random samples from the appropriate distributions.
Compute the quantity of interest for each sample.
Average the results to get an estimate.

As the number of samples increases, the estimate converges to the true value (by the Law of Large Numbers). Monte Carlo methods are used in option pricing in finance, particle physics simulations, engineering reliability analysis, and estimating complex integrals.

Random variables in statistics

Random variables provide the theoretical foundation for statistical methods. Every dataset can be thought of as a collection of realized values from random variables, and statistical inference is about working backward from those values to learn about the underlying distributions.

Parameter estimation

Parameter estimation uses observed data to estimate unknown population parameters (like the true mean or variance).

Maximum likelihood estimation (MLE): finds the parameter values that make the observed data most probable.
Method of moments: matches sample moments (mean, variance, etc.) to their theoretical counterparts and solves for the parameters.
Point estimates give a single best guess for the parameter.
Interval estimates (confidence intervals) provide a range that likely contains the true parameter, quantifying the uncertainty in your estimate.

Hypothesis testing

Hypothesis testing is a structured way to use data to decide between competing claims.

State the null hypothesis ( $H_0$ , the default assumption) and the alternative hypothesis ( $H_a$ ).
Choose a significance level ( $\alpha$ , typically 0.05).
Compute a test statistic from the data.
Find the p-value: the probability of observing a result at least as extreme as yours, assuming $H_0$ is true.
If the p-value is less than $\alpha$ , reject $H_0$ .

The test statistic is itself a random variable, and its distribution under $H_0$ (often derived using the CLT) is what makes the whole framework work. Applications range from clinical drug trials to A/B testing in tech companies.

Confidence intervals

A confidence interval gives a range of plausible values for a population parameter based on sample data.

A 95% confidence interval means: if you repeated the sampling process many times, about 95% of the resulting intervals would contain the true parameter.
Wider intervals reflect greater uncertainty (smaller samples or higher variability).
The formula for a confidence interval for a mean often relies on the CLT: $\bar{X} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}$

Confidence intervals are used in polling (margin of error), medical studies (treatment effect ranges), and engineering (tolerance specifications).

2,589 studying →