Fiveable

🔀Stochastic Processes Unit 5 Review

QR code for Stochastic Processes practice questions

5.6 Hidden Markov models

5.6 Hidden Markov models

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🔀Stochastic Processes
Unit & Topic Study Guides

Definition of hidden Markov models

A hidden Markov model (HMM) is a probabilistic model for sequences where the data you observe is generated by a process you can't directly see. You observe outputs (emissions), but the underlying states producing those outputs are hidden from you.

The core idea: a system moves through a sequence of hidden states following a Markov chain, and at each state it emits an observable output according to some probability distribution. Your job is to use the observations to infer what's happening in the hidden layer.

HMMs show up across many domains: speech recognition (hidden phonemes produce acoustic signals), bioinformatics (hidden gene regions produce nucleotide sequences), and financial modeling (hidden market regimes produce price movements).

Components of HMMs

An HMM is fully specified by five components, often written compactly as λ=(S,V,π,A,B)\lambda = (S, V, \pi, A, B).

State space

The state space S={s1,s2,...,sN}S = \{s_1, s_2, ..., s_N\} is the set of possible hidden states. These states are never directly observed but govern what outputs get produced. In speech recognition, states might correspond to phonemes. In financial modeling, they might represent bull vs. bear market regimes.

Observation space

The observation space V={v1,v2,...,vM}V = \{v_1, v_2, ..., v_M\} is the set of possible outputs the model can emit. Each hidden state produces observations according to its own probability distribution. In speech recognition, observations are acoustic feature vectors. In finance, they could be daily returns or volatility measures.

Initial state distribution

The initial state distribution π\pi is a vector where πi=P(q1=si)\pi_i = P(q_1 = s_i) gives the probability that the system starts in state sis_i. This encodes your prior belief about which state the process begins in. The entries must be non-negative and sum to 1.

State transition probabilities

The transition matrix AA captures how the system moves between hidden states over time. Entry aij=P(qt+1=sjqt=si)a_{ij} = P(q_{t+1} = s_j \mid q_t = s_i) is the probability of moving from state ii to state jj. Since the system must go somewhere at each step, each row of AA sums to 1 (making it a row-stochastic matrix).

Observation probabilities

The emission matrix BB links hidden states to observations. Entry bj(k)=P(ot=vkqt=sj)b_j(k) = P(o_t = v_k \mid q_t = s_j) is the probability of emitting observation vkv_k when in state sjs_j. For discrete observations, each row of BB sums to 1. For continuous observations, BB is replaced by probability density functions (e.g., Gaussians) associated with each state.

Assumptions in HMMs

Markov property of states

The hidden state at time tt depends only on the state at time t1t-1:

P(qtqt1,qt2,,q1)=P(qtqt1)P(q_t \mid q_{t-1}, q_{t-2}, \ldots, q_1) = P(q_t \mid q_{t-1})

This is the standard first-order Markov assumption. It makes the model tractable because you don't need to track the full history of states, just the most recent one.

Conditional independence of observations

The observation at time tt depends only on the current hidden state, not on any other states or observations:

P(otqt,qt1,,q1,ot1,,o1)=P(otqt)P(o_t \mid q_t, q_{t-1}, \ldots, q_1, o_{t-1}, \ldots, o_1) = P(o_t \mid q_t)

This is sometimes called the output independence assumption. Combined with the Markov property, it gives HMMs their characteristic factored joint probability:

P(O,Qλ)=πq1bq1(o1)t=2Taqt1,qtbqt(ot)P(O, Q \mid \lambda) = \pi_{q_1} \cdot b_{q_1}(o_1) \cdot \prod_{t=2}^{T} a_{q_{t-1}, q_t} \cdot b_{q_t}(o_t)

Types of HMM problems

HMMs give rise to three fundamental computational problems. Each has a dedicated algorithm.

Evaluation problem

Question: Given a model λ\lambda and an observation sequence O=(o1,o2,,oT)O = (o_1, o_2, \ldots, o_T), what is P(Oλ)P(O \mid \lambda)?

This tells you how well a particular model explains the observed data. It's useful for model comparison: if you have several candidate HMMs, pick the one that assigns the highest likelihood to your observations. Solved by the forward algorithm.

Decoding problem

Question: Given a model λ\lambda and observations OO, what is the most likely hidden state sequence Q=argmaxQP(QO,λ)Q^* = \arg\max_Q P(Q \mid O, \lambda)?

This recovers the best explanation for the observations in terms of hidden states. Solved by the Viterbi algorithm.

State space, Hidden Markov Model

Learning problem

Question: Given observation sequences, how do you estimate the model parameters λ=(π,A,B)\lambda = (\pi, A, B) that maximize P(Oλ)P(O \mid \lambda)?

This is the training step. Solved by the Baum-Welch algorithm, which is a special case of Expectation-Maximization (EM).

Forward algorithm for evaluation

The naive approach to computing P(Oλ)P(O \mid \lambda) would sum over all possible state sequences, giving O(NT)O(N^T) complexity. The forward algorithm reduces this to O(N2T)O(N^2 T) using dynamic programming.

Forward variables

Define the forward variable αt(i)\alpha_t(i) as the joint probability of observing the partial sequence up to time tt and being in state ii at time tt:

αt(i)=P(o1,o2,,ot,qt=siλ)\alpha_t(i) = P(o_1, o_2, \ldots, o_t, q_t = s_i \mid \lambda)

Recursive computation

  1. Initialization (t=1t = 1): For each state ii, α1(i)=πibi(o1)\alpha_1(i) = \pi_i \cdot b_i(o_1) This is the probability of starting in state ii times the probability of emitting the first observation from that state.

  2. Recursion (t=1,2,,T1t = 1, 2, \ldots, T-1): For each state jj, αt+1(j)=[i=1Nαt(i)aij]bj(ot+1)\alpha_{t+1}(j) = \left[\sum_{i=1}^{N} \alpha_t(i) \cdot a_{ij}\right] \cdot b_j(o_{t+1}) The bracketed sum collects all the ways you could arrive at state jj at time t+1t+1. Multiplying by bj(ot+1)b_j(o_{t+1}) accounts for the new observation.

  3. Termination: P(Oλ)=i=1NαT(i)P(O \mid \lambda) = \sum_{i=1}^{N} \alpha_T(i) Sum over all states at the final time step, since the sequence could end in any state.

Termination and likelihood

The final sum i=1NαT(i)\sum_{i=1}^{N} \alpha_T(i) gives the total likelihood of the observation sequence under the model. This value is used for model selection and also feeds into the Baum-Welch algorithm during training.

Viterbi algorithm for decoding

The Viterbi algorithm finds the single most probable state sequence. It has the same structure as the forward algorithm but replaces summation with maximization and adds backtracking.

Viterbi variables

Define δt(i)\delta_t(i) as the highest probability of any state path that ends in state ii at time tt while producing observations o1,,oto_1, \ldots, o_t:

δt(i)=maxq1,,qt1P(q1,,qt1,qt=si,o1,,otλ)\delta_t(i) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1}, q_t = s_i, o_1, \ldots, o_t \mid \lambda)

Also define ψt(j)\psi_t(j) to store which previous state maximized the path into state jj at time tt. This is what enables backtracking.

Recursive computation

  1. Initialization (t=1t = 1): For each state ii, δ1(i)=πibi(o1),ψ1(i)=0\delta_1(i) = \pi_i \cdot b_i(o_1), \quad \psi_1(i) = 0

  2. Recursion (t=1,2,,T1t = 1, 2, \ldots, T-1): For each state jj, δt+1(j)=maxi=1N[δt(i)aij]bj(ot+1)\delta_{t+1}(j) = \max_{i=1}^{N} \left[\delta_t(i) \cdot a_{ij}\right] \cdot b_j(o_{t+1}) ψt+1(j)=argmaxi=1N[δt(i)aij]\psi_{t+1}(j) = \arg\max_{i=1}^{N} \left[\delta_t(i) \cdot a_{ij}\right]

  3. Termination: P=maxi=1NδT(i),qT=argmaxi=1NδT(i)P^* = \max_{i=1}^{N} \delta_T(i), \quad q_T^* = \arg\max_{i=1}^{N} \delta_T(i)

Termination and optimal state sequence

After termination, recover the full optimal path by backtracking:

qt=ψt+1(qt+1),t=T1,T2,,1q_t^* = \psi_{t+1}(q_{t+1}^*), \quad t = T-1, T-2, \ldots, 1

You start from qTq_T^* and trace backward through the stored ψ\psi pointers. The result is the most likely sequence of hidden states that explains the observations.

Baum-Welch algorithm for learning

The Baum-Welch algorithm estimates λ=(π,A,B)\lambda = (\pi, A, B) from observed data when the hidden states are unknown. It alternates between computing expected state occupancies (E-step) and re-estimating parameters (M-step), guaranteed to increase (or maintain) the likelihood at each iteration.

Forward and backward variables

The algorithm requires both:

  • Forward variables αt(i)\alpha_t(i): defined as above, computed left-to-right.
  • Backward variables βt(i)=P(ot+1,ot+2,,oTqt=si,λ)\beta_t(i) = P(o_{t+1}, o_{t+2}, \ldots, o_T \mid q_t = s_i, \lambda): the probability of the future observations given that you're in state ii at time tt.

The backward recursion mirrors the forward one:

  1. Initialization: βT(i)=1\beta_T(i) = 1 for all states ii.
  2. Recursion: βt(i)=j=1Naijbj(ot+1)βt+1(j)\beta_t(i) = \sum_{j=1}^{N} a_{ij} \cdot b_j(o_{t+1}) \cdot \beta_{t+1}(j) for t=T1,T2,,1t = T-1, T-2, \ldots, 1.

Expectation step

Using α\alpha and β\beta, compute two key posterior quantities:

  • State occupancy probability γt(i)\gamma_t(i): the probability of being in state ii at time tt given the full observation sequence.

γt(i)=αt(i)βt(i)j=1Nαt(j)βt(j)\gamma_t(i) = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j=1}^{N} \alpha_t(j) \cdot \beta_t(j)}

  • Transition probability ξt(i,j)\xi_t(i,j): the probability of being in state ii at time tt and state jj at time t+1t+1, given the full observation sequence.

ξt(i,j)=αt(i)aijbj(ot+1)βt+1(j)i=1Nj=1Nαt(i)aijbj(ot+1)βt+1(j)\xi_t(i,j) = \frac{\alpha_t(i) \cdot a_{ij} \cdot b_j(o_{t+1}) \cdot \beta_{t+1}(j)}{\sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_t(i) \cdot a_{ij} \cdot b_j(o_{t+1}) \cdot \beta_{t+1}(j)}

Note that γt(i)=j=1Nξt(i,j)\gamma_t(i) = \sum_{j=1}^{N} \xi_t(i,j), which provides a useful consistency check.

State space, Hidden Markov Model

Maximization step

Re-estimate the parameters using the expected counts from the E-step:

  • Initial distribution:

π^i=γ1(i)\hat{\pi}_i = \gamma_1(i)

  • Transition probabilities:

a^ij=t=1T1ξt(i,j)t=1T1γt(i)\hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)} This is the expected number of transitions from ii to jj, divided by the expected number of times you're in state ii (excluding the last time step).

  • Emission probabilities (discrete case):

b^j(k)=t=1Tγt(j)1ot=vkt=1Tγt(j)\hat{b}_j(k) = \frac{\sum_{t=1}^{T} \gamma_t(j) \cdot \mathbf{1}_{o_t = v_k}}{\sum_{t=1}^{T} \gamma_t(j)} This is the expected number of times state jj emits symbol vkv_k, divided by the expected total time spent in state jj.

Convergence and learned parameters

The E-step and M-step alternate until convergence. Typical stopping criteria:

  • The change in log-likelihood logP(Oλ)\log P(O \mid \lambda) between iterations falls below a threshold
  • A maximum number of iterations is reached

Because EM only guarantees convergence to a local maximum, results can depend on initialization. Running the algorithm multiple times with different random starting parameters and selecting the best result is standard practice.

Extensions of HMMs

Higher-order Markov models

Standard HMMs use a first-order Markov assumption. Higher-order HMMs let the current state depend on the previous kk states:

P(qtqt1,qt2,,qtk)P(q_t \mid q_{t-1}, q_{t-2}, \ldots, q_{t-k})

This captures longer-range temporal dependencies but increases the effective state space to NkN^k, making parameter estimation and computation significantly more expensive.

Input-output HMMs

Input-output HMMs (IOHMMs) condition the transition and/or emission probabilities on an external input sequence. This allows the model to adapt to changing external conditions, such as speaker identity in speech recognition or macroeconomic indicators in financial models.

Factorial HMMs

Factorial HMMs (FHMMs) represent the hidden state as a combination of multiple independent Markov chains running in parallel. The observation at each time step depends on the joint configuration of all chains. This is useful for modeling systems with multiple interacting processes (e.g., separating overlapping audio sources). Exact inference in FHMMs is intractable for large models, so approximate methods like variational inference are typically used.

Applications of HMMs

Speech recognition

Hidden states represent phonemes or sub-word units, and observations are acoustic feature vectors (e.g., MFCCs). The Viterbi algorithm decodes the most likely phoneme sequence from audio, and the Baum-Welch algorithm trains the model on labeled speech data. HMMs were the dominant approach in speech recognition for decades before deep learning methods took over, though hybrid HMM-neural network systems remain in use.

Bioinformatics and sequence analysis

In gene finding, hidden states might represent coding regions, introns, and intergenic regions, while observations are nucleotide bases (A, C, G, T). Profile HMMs are a specialized variant used for protein family classification and multiple sequence alignment. HMMs can also identify conserved motifs across related biological sequences.

Financial modeling and regime switching

Hidden states represent unobserved market regimes (e.g., low-volatility vs. high-volatility periods), and observations are financial time series like returns or spreads. Regime-switching HMMs can detect structural breaks in market behavior and capture phenomena like volatility clustering that standard models miss.

Limitations and challenges of HMMs

Model selection and complexity

Choosing the number of hidden states NN is not straightforward. Too few states and the model can't capture the true dynamics; too many and it overfits. Common approaches include:

  • Information criteria (AIC, BIC) that penalize model complexity
  • Cross-validation on held-out data
  • Bayesian nonparametric extensions (e.g., the infinite HMM) that let the data determine NN

Identifiability issues

Different parameter settings can produce identical observation distributions, a problem known as label switching. Permuting the state labels and adjusting AA, BB, and π\pi accordingly gives a different λ\lambda with the same likelihood. This doesn't affect prediction but complicates interpretation of learned states and comparison across model runs.

Computational complexity and scalability

The forward, Viterbi, and Baum-Welch algorithms all have time complexity O(N2T)O(N^2 T), where NN is the number of states and TT is the sequence length. For large state spaces or very long sequences, this becomes expensive. Practical strategies to manage this include:

  • Beam search / pruning: Only track the top-scoring states at each time step
  • Sparse transition matrices: Exploit structure where many aij=0a_{ij} = 0
  • Variational or sampling-based approximations: Trade exact inference for speed
  • Parallel computation: The per-time-step operations over states can be parallelized