Random variables are the mathematical foundation for modeling uncertainty. Whether you're analyzing signal noise, predicting system failures, or modeling network traffic, you're working with random variables. This topic connects directly to everything else in your probability course: from basic probability axioms to statistical inference.
You're being tested on more than definitions here. Exam questions will ask you to choose the right distribution for a scenario, calculate expected values and variances, and apply limit theorems to real problems. Don't just memorize formulas. Understand when each distribution applies, how the PMF/PDF/CDF relate to each other, and why concepts like independence and the Central Limit Theorem matter. Master the underlying mechanics, and the formulas will make sense.
Foundations: Types of Random Variables
Before diving into specific distributions, you need to understand the fundamental distinction between discrete and continuous random variables. This classification determines which mathematical tools you'll use: summations vs. integrals, PMFs vs. PDFs.
Discrete Random Variables
A discrete random variable takes on specific, separated values (often integers). Think of things you can count: the number of defects in a batch, the number of heads in 10 coin flips.
Probability Mass Function (PMF) assigns a probability to each possible value: P(X=x)โฅ0 and โxโP(X=x)=1
Key identifier: ask yourself "can I list all possible values?" If yes, it's discrete.
Continuous Random Variables
A continuous random variable can take any value within an interval. Voltage measurements, time between failures, and temperature readings are all continuous.
Probability Density Function (PDF) describes likelihood, but P(X=x)=0 for any single specific value. Only intervals have nonzero probability.
Integration required: probabilities are calculated as P(aโคXโคb)=โซabโf(x)dx where f(x) is the PDF, and the total area under the PDF equals 1.
Cumulative Distribution Function (CDF)
The CDF works for both discrete and continuous variables, defined as F(x)=P(Xโคx).
Properties to know: non-decreasing, limxโโโโF(x)=0, and limxโโโF(x)=1
PDF recovery: for continuous variables, f(x)=dxdF(x)โ. The derivative of the CDF gives you the PDF.
For discrete variables, the CDF is a staircase function with jumps at each value where the PMF is nonzero.
Compare: PMF vs. PDF. Both describe how probability is distributed, but PMF gives actual probabilities (sum to 1) while PDF gives probability density (integrates to 1). On exams, using P(X=x) with a continuous variable is an instant error.
Describing Distributions: Location and Spread
Every distribution can be characterized by its moments: numerical summaries that capture where the distribution is centered and how spread out it is. These quantities are essential for comparing distributions and solving problems.
Expected Value (Mean)
The expected value is the long-run average. If you repeated the experiment infinitely many times, this is the average outcome you'd observe.
Discrete:E[X]=โxโxโ P(X=x)
Continuous:E[X]=โซโโโโxโ f(x)dx
Linearity property:E[aX+b]=aE[X]+b. This always holds, whether or not variables are independent. It simplifies many calculations.
Variance and Standard Deviation
Variance measures spread around the mean: Var(X)=E[(Xโฮผ)2]=E[X2]โ(E[X])2. That second form (sometimes called the "computational formula") is almost always easier to use in practice.
Standard deviationฯ=Var(X)โ puts dispersion back in the same units as the original variable.
Scaling property:Var(aX+b)=a2Var(X). The constant a gets squared, and the additive constant b disappears entirely.
Moment Generating Functions
Definition:MXโ(t)=E[etX]. This function encodes all moments of a distribution.
Moment extraction: the nth moment is E[Xn]=MX(n)โ(0), meaning you take the nth derivative and evaluate at t=0.
Distribution identification: if two variables have the same MGF, they have the same distribution. This is a powerful tool for proofs and for identifying what distribution a transformed variable follows.
Compare: Variance vs. Standard Deviation. Variance is mathematically convenient (additive for independent variables), but standard deviation is more interpretable (same units as data). Know when to use which.
Discrete Distributions: Counting Events
These distributions model scenarios where you're counting occurrences. The key is matching the physical situation to the right model based on the underlying assumptions.
Bernoulli Distribution
The simplest random variable: a single trial with two outcomes. Success (1) occurs with probability p, failure (0) with probability 1โp.
Building block: every other discrete distribution in this section is built from Bernoulli trials.
Moments:E[X]=p and Var(X)=p(1โp). Notice that variance is maximized at p=0.5.
Binomial Distribution
Counts the number of successes in a fixed number n of independent trials, each with the same success probability p.
PMF:P(X=k)=(knโ)pk(1โp)nโk for k=0,1,โฆ,n
Moments:E[X]=np and Var(X)=np(1โp)
Typical scenarios: number of defective items in a batch, number of correct answers on a multiple-choice test (if guessing randomly)
Poisson Distribution
Counts events occurring in a continuous interval of time or space, at a constant average rate ฮป.
PMF:P(X=k)=k!ฮปkeโฮปโ for k=0,1,2,โฆ
Key property:E[X]=Var(X)=ฮป. When mean equals variance, think Poisson.
Typical scenarios: number of emails per hour, number of typos per page, number of arrivals at a service counter
Compare: Binomial vs. Poisson. Binomial requires a fixed number of trials n; Poisson models events in continuous intervals with no fixed upper bound on the count. Poisson approximates Binomial when n is large and p is small (with ฮป=np). If a problem gives you "average rate" language, go Poisson.
Continuous Distributions: Measuring Quantities
These distributions model measurements that can take any value in an interval. Each has distinct shapes and applications. Learn to recognize them from problem context.
Uniform Distribution
Every value in [a,b] is equally likely. The PDF is flat: f(x)=bโa1โ for aโคxโคb.
Moments:E[X]=2a+bโ (the midpoint) and Var(X)=12(bโa)2โ
When to use it: often the right model when you have no information favoring any particular value within a range. Also commonly used for random number generation.
Normal (Gaussian) Distribution
The classic bell curve, symmetric around mean ฮผ, with spread controlled by ฯ.
PDF:f(x)=ฯ2ฯโ1โeโ2ฯ2(xโฮผ)2โ
Standardization:Z=ฯXโฮผโ transforms any normal variable to N(0,1). This is essential for using Z-tables to look up probabilities.
Central role: the Central Limit Theorem makes this distribution appear everywhere in statistics, even when the underlying data isn't normal.
Exponential Distribution
Models the time until the first event, when events occur at a constant rate ฮป. It's the continuous counterpart to the Poisson distribution: if events arrive at a Poisson rate, the waiting time between them is exponential.
PDF:f(x)=ฮปeโฮปx for xโฅ0
CDF:F(x)=1โeโฮปx
Moments:E[X]=ฮป1โ and Var(X)=ฮป21โ
Memoryless property:P(X>s+tโฃX>s)=P(X>t). This means the probability of waiting another t units doesn't depend on how long you've already waited. The exponential is the only continuous distribution with this property.
Compare: Normal vs. Exponential. Normal is symmetric and defined on all real numbers; Exponential is right-skewed and non-negative. Normal models sums of many small effects; Exponential models waiting times. Mixing these up on distribution-selection problems is a common exam mistake.
Multivariate Concepts: Multiple Random Variables
Real problems often involve multiple interacting variables. Understanding how variables relate is crucial for system analysis.
Joint Probability Distributions
A joint distribution describes the simultaneous behavior of two (or more) random variables.
Joint PMF:P(X=x,Y=y) for discrete variables; Joint PDF:f(x,y) for continuous variables
Marginal distributions are recovered by summing out (discrete) or integrating out (continuous) the other variable. For example: fXโ(x)=โซโโโโf(x,y)dy
Conditional Probability Distributions
Conditional distributions describe one variable given knowledge of another.
Formula:f(xโฃy)=fYโ(y)f(x,y)โ. Joint divided by marginal.
This is the continuous analog of P(AโฃB)=P(AโฉB)/P(B) from basic probability. Bayesian updating and signal detection both rely on conditional distributions.
Independence of Random Variables
X and Y are independent if knowing the value of one tells you nothing about the other.
Formal condition:P(X=x,Y=y)=P(X=x)โ P(Y=y) for all x,y (discrete), or f(x,y)=fXโ(x)โ fYโ(y) (continuous). The joint factors into the product of the marginals.
Why it matters: independence dramatically simplifies calculations. For instance, Var(X+Y)=Var(X)+Var(Y) holds only when X and Y are independent (or more generally, uncorrelated).
Covariance and Correlation
Covariance:Cov(X,Y)=E[XY]โE[X]E[Y]. Positive covariance means the variables tend to increase together; negative means one tends to decrease when the other increases.
Correlation:ฯ=ฯXโฯYโCov(X,Y)โ. This is standardized to [โ1,1] and measures the strength of the linear relationship.
Independence implies zero correlation, but zero correlation does not imply independence. A classic counterexample: let X be uniform on [โ1,1] and Y=X2. Then ฯ=0, but Y is completely determined by X.
Compare: Covariance vs. Correlation. Covariance depends on units and scale, making it hard to interpret on its own. Correlation is dimensionless and bounded by [โ1,1], so you can compare relationship strengths across different variable pairs. If ฯ=0, variables are uncorrelated but not necessarily independent.
Limit Theorems: Large-Sample Behavior
These theorems explain why probability works in practice and justify most of statistical inference. They're conceptual cornerstones. Expect them on exams.
Law of Large Numbers
As the sample size n grows, the sample mean Xหnโ converges to the true population mean E[X].
Practical meaning: averages of large samples reliably estimate population means. This is why polling works, why casinos are profitable in the long run, and why Monte Carlo simulation converges.
Requirement: the observations must be independent and identically distributed (i.i.d.) with a finite mean.
Central Limit Theorem
For large n, the standardized sample mean is approximately normal:
ฯ/nโXหnโโฮผโโN(0,1)
This holds regardless of the original distribution of the individual observations, as long as they're i.i.d. with finite mean ฮผ and finite variance ฯ2.
Rule of thumb:nโฅ30 often suffices for a good approximation. Fewer samples are needed for symmetric distributions; more are needed for highly skewed ones.
Applications: this justifies confidence intervals, hypothesis tests, and normal approximations to the binomial and Poisson.
Compare: Law of Large Numbers vs. Central Limit Theorem. LLN tells you where the sample mean goes (converges to ฮผ). CLT tells you how it gets there (normally distributed around ฮผ with standard deviation ฯ/nโ). Both require i.i.d. observations, but CLT also needs finite variance.
Quick Reference Table
Concept
Best Examples
Discrete distributions
Bernoulli, Binomial, Poisson
Continuous distributions
Uniform, Normal, Exponential
Location measures
Expected value, Median
Spread measures
Variance, Standard deviation
Distribution functions
PMF, PDF, CDF, MGF
Multivariate relationships
Joint distributions, Covariance, Correlation
Independence concepts
Independent variables, Uncorrelated variables
Asymptotic results
Law of Large Numbers, Central Limit Theorem
Self-Check Questions
A quality engineer counts defective chips in batches of 100. Which distribution applies: Binomial or Poisson? What if she instead counts defects arriving per hour at a testing station?
You're given that E[X]=5 and Var(X)=5. Which distribution might X follow, and why does this moment relationship matter?
Compare the CDF for discrete vs. continuous random variables. How does the CDF behave at jump points for a discrete variable?
Two random variables have correlation ฯ=0. Are they necessarily independent? Provide a counterexample or explain why independence would follow.
You need to approximate the distribution of the sample mean from 50 independent measurements of a skewed variable. Which theorem justifies using a normal approximation, and what parameters would you use?