Markov chains are stochastic models that describe systems transitioning between states over time. They're defined by a state space, transition probabilities, and the memoryless property, making them powerful tools for analyzing complex systems with probabilistic behavior.

This topic covers the key components and properties of Markov chains, including state space, transition probabilities, and classifications. It also covers long-term behavior, first passage times, and applications in fields like physics, biology, and computer science.

Definition of Markov chains

A Markov chain is a stochastic process where a system transitions between states over time, and the probability of moving to any future state depends only on the current state. The entire history of how the system arrived at its current state is irrelevant. This single constraint is what makes Markov chains both tractable and widely applicable across physics, biology, economics, and computer science.

The key components of a Markov chain are the state space, transition probabilities, initial state distribution, and the memoryless (Markov) property.

State space in Markov chains

The state space is the set of all possible states the system can occupy at any given time. Each state represents a unique configuration or condition of the system.

The state space can be discrete (finite or countably infinite) or continuous (uncountably infinite). In this unit, we focus primarily on discrete state spaces.
Example: In a weather model, the state space could be {sunny, cloudy, rainy}. In a queueing model, the state space might be {0, 1, 2, ..., n} representing the number of customers waiting.

Transition probabilities

Transition probabilities describe the likelihood of moving from one state to another in a single time step. The transition probability from state $i$ to state $j$ is written as:

$P(X_{t+1} = j \mid X_t = i)$

where $X_t$ represents the state at time $t$ .

These probabilities are organized into a transition matrix $P$ , where entry $P_{ij}$ gives the probability of going from state $i$ to state $j$ . Two properties must hold for every row of this matrix:

Each entry is non-negative: $P_{ij} \geq 0$
Each row sums to 1: $\sum_j P_{ij} = 1$ (the system must go somewhere)

Example: For a two-state chain with states {0, 1}:

$P = \begin{bmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{bmatrix}$

This means if you're in state 0, there's a 70% chance of staying and a 30% chance of moving to state 1.

Initial state distribution

The initial state distribution, denoted $\pi_0$ , is a probability vector specifying the likelihood of starting in each state at time $t = 0$ . Each entry $\pi_0(i)$ gives the probability of starting in state $i$ , and the entries must sum to 1.

Once you have $\pi_0$ and the transition matrix $P$ , you can compute the state distribution at any future time step $t$ by:

$\pi_t = \pi_0 P^t$

Example: For a three-state chain, $\pi_0 = [0.2, 0.5, 0.3]$ means a 20% chance of starting in state 1, 50% in state 2, and 30% in state 3.

Memoryless property of Markov chains

The Markov property (also called the memoryless property) is the defining characteristic of a Markov chain. It states that the future evolution of the process depends only on the current state, not on the sequence of states that preceded it.

Mathematically:

$P(X_{t+1} = j \mid X_t = i, X_{t-1} = k, \ldots) = P(X_{t+1} = j \mid X_t = i)$

This is what makes Markov chains computationally manageable. Instead of tracking the full history of the process, you only need to know where the system is right now to determine where it goes next.

Classifications of Markov chains

Markov chains can be classified along several dimensions: the nature of time, whether transition probabilities change, the size of the state space, and structural properties like irreducibility and aperiodicity. These classifications determine which analytical tools apply and what long-term behavior to expect.

Discrete-time vs continuous-time

Discrete-time Markov chains (DTMCs) evolve at fixed time steps (e.g., daily, weekly). Transitions are governed by transition probabilities.
Continuous-time Markov chains (CTMCs) allow transitions at any point in time. Instead of transition probabilities, they use transition rates that describe how quickly transitions occur.

Example: A DTMC could model daily weather changes, while a CTMC could model the random arrival of customers at a service counter.

Time-homogeneous vs time-inhomogeneous

Time-homogeneous chains have transition probabilities that stay constant over time: $P(X_{t+1} = j \mid X_t = i)$ does not depend on $t$ . Most of the classical theory (stationary distributions, ergodicity) applies to this case.
Time-inhomogeneous chains have transition probabilities that vary with $t$ . These arise when the system's dynamics change over time, such as seasonal effects on consumer behavior.

Unless stated otherwise, you can generally assume a Markov chain is time-homogeneous in this course.

Finite vs infinite state space

Finite state space chains have a fixed number of states {1, 2, ..., n}. These are the most common in applications and have the most developed theory.
Countably infinite state space chains (e.g., a random walk on the integers) require more care, as convergence results that hold for finite chains don't always carry over.

Note: A chain with an infinite state space in this course typically means countably infinite (like the non-negative integers), not uncountably infinite. Uncountable state spaces fall under more general Markov process theory.

Irreducible Markov chains

A Markov chain is irreducible if every state can be reached from every other state in some finite number of steps. Formally, for all states $i$ and $j$ , there exists some $n \geq 1$ such that $P^n_{ij} > 0$ .

An irreducible chain has no absorbing states and no isolated subsets that trap the process.
Irreducibility ensures that the long-term behavior doesn't depend on which state you start in.

Example: A random walk on a connected graph is irreducible because you can get from any node to any other node.

Aperiodic Markov chains

The period of a state $i$ is the greatest common divisor (gcd) of all return times to that state: $d(i) = \gcd\{n \geq 1 : P^n_{ii} > 0\}$ . A state is aperiodic if $d(i) = 1$ , and a Markov chain is aperiodic if all its states are aperiodic.

Periodicity means the chain cycles through states in a predictable pattern, preventing convergence to a single stationary distribution.
A simple way to guarantee aperiodicity: if any state has a positive self-loop probability ( $P_{ii} > 0$ ), that state is aperiodic.

Example: A chain alternating deterministically between two states has period 2 (periodic). A fair coin flip model where you can also stay in the same state is aperiodic.

Long-term behavior of Markov chains

The long-term behavior of a Markov chain describes what happens to the state distribution as the number of time steps grows without bound. This is where the classifications above pay off: irreducibility and aperiodicity together determine whether the chain settles into a predictable steady state.

State space in Markov chains, How to draw state diagram for first order Markov chain for 10000bases from 2 chromosomes ...

Stationary distribution

A stationary distribution $\pi$ is a probability distribution over the states that satisfies:

$\pi P = \pi$

with $\sum_i \pi_i = 1$ and $\pi_i \geq 0$ for all $i$ . If the chain's state distribution ever equals $\pi$ , it stays at $\pi$ forever. That's why it's called "stationary."

Example: For the transition matrix $P = \begin{bmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{bmatrix}$ , you can solve $\pi P = \pi$ to get $\pi = \left[\frac{4}{7}, \frac{3}{7}\right] \approx [0.571, 0.429]$ .

To find $\pi$ , set up the system of equations from $\pi P = \pi$ along with the constraint $\pi_1 + \pi_2 + \cdots = 1$ , then solve.

Existence and uniqueness of stationary distribution

Not every Markov chain has a stationary distribution, and some have more than one. The key result:

A finite-state, irreducible, aperiodic Markov chain has exactly one stationary distribution.
This follows from the Perron-Frobenius theorem, which guarantees that a primitive non-negative matrix (one that is irreducible and aperiodic) has a unique dominant left eigenvector with eigenvalue 1.

If the chain is irreducible but periodic, a unique stationary distribution still exists, but the chain won't converge to it in the usual pointwise sense (it oscillates instead).

Convergence to stationary distribution

For a finite-state, irreducible, aperiodic Markov chain with stationary distribution $\pi$ :

$\lim_{n \to \infty} P^n_{ij} = \pi_j \quad \text{for all } i, j$

This means no matter where you start, the distribution of states converges to $\pi$ .

The rate of convergence is governed by the second-largest eigenvalue (in absolute value) of $P$ . If this eigenvalue is $\lambda_2$ , convergence is roughly geometric at rate $|\lambda_2|$ . A smaller $|\lambda_2|$ means faster convergence.

Ergodicity in Markov chains

A Markov chain is ergodic if it is irreducible, aperiodic, and positive recurrent. For finite-state chains, irreducibility and aperiodicity automatically imply positive recurrence, so any finite-state irreducible aperiodic chain is ergodic.

Ergodicity gives you a powerful result: time averages equal space averages. Specifically, for any function $f$ of the states:

$\frac{1}{N} \sum_{t=1}^{N} f(X_t) \to \sum_i \pi_i f(i) \quad \text{as } N \to \infty$

This means you can estimate the stationary distribution by running the chain for a long time and counting how often each state is visited.

First passage times

First passage times describe how long it takes a Markov chain to reach a particular state for the first time. These quantities capture the transient behavior of the chain, complementing the long-term analysis above.

Definition of first passage time

The first passage time from state $i$ to state $j$ , denoted $T_{ij}$ , is the number of steps until the chain first reaches state $j$ , starting from state $i$ :

$T_{ij} = \min\{n \geq 1 : X_n = j \mid X_0 = i\}$

When $i = j$ , this is called the first return time to state $i$ . A state is recurrent if the chain returns to it with probability 1, and transient if there's a positive probability of never returning.

Example: In a board game modeled as a Markov chain, $T_{ij}$ represents how many turns it takes to reach the winning square for the first time from your current position.

Expected first passage times

The expected first passage time $E[T_{ij}]$ is the average number of steps to reach state $j$ from state $i$ . For an ergodic chain, the expected return time to state $i$ is:

$E[T_{ii}] = \frac{1}{\pi_i}$

This connects first passage times directly to the stationary distribution: states visited more frequently in the long run have shorter expected return times.

For chains with absorbing states, expected first passage times can be computed using the fundamental matrix:

Identify the transient states and form the submatrix $Q$ of $P$ (transitions among transient states only).
Compute the fundamental matrix $N = (I - Q)^{-1}$ .
The entry $N_{ij}$ gives the expected number of times the chain visits transient state $j$ before absorption, starting from transient state $i$ .
The expected number of steps to absorption from state $i$ is the sum of row $i$ of $N$ .

Absorption probabilities

In chains with absorbing states (states where $P_{jj} = 1$ ), absorption probabilities tell you the likelihood of ending up in each absorbing state.

To compute them:

Form $Q$ (transient-to-transient transitions) and $R$ (transient-to-absorbing transitions) from the transition matrix.
Compute the fundamental matrix $N = (I - Q)^{-1}$ .
The absorption probability matrix is $A = NR$ , where $A_{ij}$ gives the probability of being absorbed in absorbing state $j$ starting from transient state $i$ .

Example: In a model of disease progression with stages {healthy, mild, severe, terminal}, if "terminal" is absorbing, the absorption probability from "mild" tells you the likelihood that a patient in the mild stage eventually reaches the terminal stage.

Applications of Markov chains

Markov chains show up across many disciplines because any system with probabilistic state transitions and the memoryless property can be modeled this way. Here are three major application areas.

Modeling with Markov chains

Markov chains provide a compact way to represent systems where the next state depends probabilistically on the current state. You define the states, estimate transition probabilities from data or domain knowledge, and then use the chain for simulation, prediction, or sensitivity analysis.

Example: An e-commerce website can be modeled with states {home, product page, cart, checkout, exit}. Transition probabilities capture how users navigate between pages. The stationary distribution reveals where users spend most of their time, and absorption probabilities (with "checkout" and "exit" as absorbing states) reveal conversion rates.

Markov decision processes

Markov decision processes (MDPs) extend Markov chains by adding actions and rewards. At each state, a decision-maker chooses an action that influences both the transition probabilities and the reward received.

The goal is to find a policy (a mapping from states to actions) that maximizes expected cumulative reward over time.
MDPs are the mathematical foundation of reinforcement learning and are widely used in robotics, operations research, and resource allocation.

Example: A robot navigating a grid world chooses movement directions (actions) at each cell (state). The transition probabilities account for stochastic movement (e.g., slipping), and rewards guide the robot toward a goal while avoiding obstacles.

Hidden Markov models

Hidden Markov models (HMMs) are used when the underlying Markov chain is not directly observable. Instead, each hidden state emits an observation according to some probability distribution, and you only see the observations.

The three central problems for HMMs are:

Evaluation: Given a model and an observation sequence, what's the probability of that sequence? (solved by the forward algorithm)
Decoding: Given observations, what's the most likely sequence of hidden states? (solved by the Viterbi algorithm)
Learning: Given observations, how do you estimate the model parameters? (solved by the Baum-Welch algorithm, a form of EM)

Example: In speech recognition, hidden states represent phonemes (sound units) and observations are acoustic features. The HMM infers which phonemes were spoken based on the audio signal.