Definition of expectation
Expectation quantifies the "average" value a random variable takes, weighted by probability. It captures the central tendency, or long-run average, of a random variable and is denoted .
Discrete random variables
For a discrete random variable with probability mass function , the expectation is:
You multiply each possible value of by its probability, then sum everything up.
Example: For a fair six-sided die, each face has probability , so:
Notice the expected value doesn't have to be a value can actually take.
Continuous random variables
For a continuous random variable with probability density function , the sum becomes an integral:
The logic is the same: weight each value by how likely it is (its density), then integrate over the entire domain.
Example: For a standard normal distribution (), the density is symmetric about zero, so .
Linearity of expectation
For any constants and and random variables and :
This holds regardless of whether and are independent. That's what makes linearity so powerful: you can break apart complicated sums without worrying about dependence structure.
Law of the unconscious statistician (LOTUS)
LOTUS lets you compute directly from the distribution of , without first finding the distribution of :
- Discrete:
- Continuous:
Example: For , you can find without deriving the distribution of :
This result is used constantly when computing variances.
Properties of expectation
These properties follow from the definition and are used repeatedly in proofs throughout stochastic processes.
Non-negativity
If (with probability 1), then . This follows directly because you're summing or integrating non-negative quantities.
Monotonicity
If almost surely (i.e., ), then:
You can think of this as: if one random variable is always at most as large as another, its average can't exceed the other's average.
Bounds on expectation
The expectation is bounded by the extreme values of the random variable:
Example: If counts the number of heads in 3 fair coin tosses, then , so . (The actual value is .)
Conditional expectation
Conditional expectation extends expectation to settings where you have partial information. It's one of the most important tools in stochastic processes because it formalizes how predictions update as new information arrives.
Definition and properties
The conditional expectation of given an event with is:
where is the indicator function of event (equals 1 when occurs, 0 otherwise). Conditional expectation inherits linearity, non-negativity, and monotonicity from regular expectation.
Example: In a standard deck of 52 cards, assign values Jack = 11, Queen = 12, King = 13. The conditional expected value given that the card is a face card:

Tower property (law of iterated expectations)
For random variables and :
In words: if you first compute the expectation of conditional on , then average that over all possible values of , you recover the unconditional expectation of . This is especially useful when the conditional expectation is easier to compute than the unconditional one.
Law of total expectation
For a partition of the sample space:
This is the "discrete version" of the tower property. You break the sample space into cases, compute the expectation in each case, and take a weighted average.
Example: A factory produces items that are defective with probability 0.1. Non-defective items cost $10, defective items cost $50. The expected cost per item:
Variance and standard deviation
While expectation tells you the center, variance and standard deviation tell you how spread out the distribution is around that center.
Definition of variance
Variance is the average squared deviation from the mean. A useful computational shortcut:
This second form is almost always easier to work with. To use it:
- Compute (often via LOTUS or linearity).
- Compute (using LOTUS).
- Subtract: .
Properties of variance
- Non-negativity: for any random variable . Variance equals zero only if is constant with probability 1.
- Scaling: . Note the square on ; adding a constant shifts the distribution but doesn't change spread, so .
- Additivity for independent variables: If and are independent, . Unlike linearity of expectation, this requires independence (or at least zero covariance).
Standard deviation
Standard deviation has the same units as , making it more interpretable than variance. For a normal distribution with mean and standard deviation , the 68-95-99.7 rule applies: approximately 68% of values fall within , 95% within , and 99.7% within .
Coefficient of variation
The CV is dimensionless, so it lets you compare variability across random variables with different scales.
Example: Stock A has mean return 10% and standard deviation 5% (). Stock B has mean return 5% and standard deviation 5% (). Even though both have the same absolute spread, Stock B is twice as variable relative to its mean.
Covariance and correlation
These measure the linear relationship between two random variables. They're central to understanding how random variables interact, which matters a great deal in stochastic processes.
Definition of covariance
A computational shortcut analogous to the variance formula:
- Positive covariance: and tend to move in the same direction.
- Negative covariance: they tend to move in opposite directions.
- Zero covariance: no linear association (but they could still be dependent).
Properties of covariance
- Symmetry:
- Linearity in each argument: . Constants added to a variable don't affect covariance.
- Self-covariance:
- General variance of a sum: . This shows why independence (which implies zero covariance) simplifies things.

Correlation coefficient
Correlation is a normalized version of covariance, always satisfying .
- : perfect positive linear relationship
- : perfect negative linear relationship
- : no linear relationship (uncorrelated)
Keep in mind that uncorrelated does not imply independent, except in special cases (e.g., jointly normal random variables).
Cauchy-Schwarz inequality
This is the inequality that guarantees . Equality holds if and only if for some constants and (i.e., and are linearly dependent). The Cauchy-Schwarz inequality appears frequently in proofs throughout probability and statistics.
Moment-generating functions
Moment-generating functions (MGFs) encode all the moments of a distribution into a single function. They're a key tool for identifying distributions and working with sums of independent random variables.
Definition and properties
The MGF of a random variable is:
provided this expectation exists in a neighborhood of .
Key properties:
- Uniqueness: If two random variables have the same MGF (in a neighborhood of 0), they have the same distribution.
- Affine transformation:
- Independence and sums: If and are independent, . This makes MGFs extremely useful for finding the distribution of sums.
Relationship to expectation and variance
Moments are extracted by differentiating the MGF and evaluating at :
More generally, the -th moment is .
Uniqueness and existence
Not every random variable has an MGF. The MGF exists when for all in some open interval around 0. Distributions with light tails (e.g., normal, Poisson, exponential) have MGFs. Heavy-tailed distributions like the Cauchy distribution do not.
When the MGF does exist, it uniquely determines the distribution. This is why "matching MGFs" is a valid proof technique for showing two random variables have the same distribution.
Applications in probability calculations
- Sums of independent random variables: Multiply MGFs, then identify the resulting function. For example, the MGF of a standard normal is . The sum of independent standard normals has MGF , which is the MGF of .
- Proving limit theorems: MGFs provide one route to proving the central limit theorem and the law of large numbers.
- Identifying distributions: If you compute an MGF and recognize it as belonging to a known family, you've identified the distribution without inverting a transform.
Inequalities involving expectation and variance
These inequalities give you probability bounds using only moments, without needing the full distribution. They get progressively tighter as you use more information.
Markov's inequality
For a non-negative random variable and any :
This is the weakest of the three inequalities here, but it only requires and a finite mean. It's often used as a stepping stone to derive stronger bounds.
Example: If , then .
Chebyshev's inequality
For any random variable with finite mean and variance , and any :
Chebyshev's is stronger than Markov's because it uses both the mean and the variance. It applies to any distribution with finite variance.
Example: For any random variable with finite variance, at least 75% of the probability mass lies within 2 standard deviations of the mean (, so at most 25% lies outside).
Chernoff bounds
Chernoff bounds provide exponentially decaying tail bounds for sums of independent random variables, making them much tighter than Markov or Chebyshev for large deviations.
The general technique:
- For any , apply Markov's inequality to :
- Optimize over to get the tightest bound.
A common special case (Hoeffding's inequality): for a sum of independent bounded random variables with , and any :
The exponential decay makes Chernoff bounds far more useful than Chebyshev's inequality when dealing with sums of many independent random variables, which is a common setting in stochastic processes.