Definition of hidden Markov models
A hidden Markov model (HMM) is a probabilistic model for sequences where the data you observe is generated by a process you can't directly see. You observe outputs (emissions), but the underlying states producing those outputs are hidden from you.
The core idea: a system moves through a sequence of hidden states following a Markov chain, and at each state it emits an observable output according to some probability distribution. Your job is to use the observations to infer what's happening in the hidden layer.
HMMs show up across many domains: speech recognition (hidden phonemes produce acoustic signals), bioinformatics (hidden gene regions produce nucleotide sequences), and financial modeling (hidden market regimes produce price movements).
Components of HMMs
An HMM is fully specified by five components, often written compactly as .
State space
The state space is the set of possible hidden states. These states are never directly observed but govern what outputs get produced. In speech recognition, states might correspond to phonemes. In financial modeling, they might represent bull vs. bear market regimes.
Observation space
The observation space is the set of possible outputs the model can emit. Each hidden state produces observations according to its own probability distribution. In speech recognition, observations are acoustic feature vectors. In finance, they could be daily returns or volatility measures.
Initial state distribution
The initial state distribution is a vector where gives the probability that the system starts in state . This encodes your prior belief about which state the process begins in. The entries must be non-negative and sum to 1.
State transition probabilities
The transition matrix captures how the system moves between hidden states over time. Entry is the probability of moving from state to state . Since the system must go somewhere at each step, each row of sums to 1 (making it a row-stochastic matrix).
Observation probabilities
The emission matrix links hidden states to observations. Entry is the probability of emitting observation when in state . For discrete observations, each row of sums to 1. For continuous observations, is replaced by probability density functions (e.g., Gaussians) associated with each state.
Assumptions in HMMs
Markov property of states
The hidden state at time depends only on the state at time :
This is the standard first-order Markov assumption. It makes the model tractable because you don't need to track the full history of states, just the most recent one.
Conditional independence of observations
The observation at time depends only on the current hidden state, not on any other states or observations:
This is sometimes called the output independence assumption. Combined with the Markov property, it gives HMMs their characteristic factored joint probability:
Types of HMM problems
HMMs give rise to three fundamental computational problems. Each has a dedicated algorithm.
Evaluation problem
Question: Given a model and an observation sequence , what is ?
This tells you how well a particular model explains the observed data. It's useful for model comparison: if you have several candidate HMMs, pick the one that assigns the highest likelihood to your observations. Solved by the forward algorithm.
Decoding problem
Question: Given a model and observations , what is the most likely hidden state sequence ?
This recovers the best explanation for the observations in terms of hidden states. Solved by the Viterbi algorithm.

Learning problem
Question: Given observation sequences, how do you estimate the model parameters that maximize ?
This is the training step. Solved by the Baum-Welch algorithm, which is a special case of Expectation-Maximization (EM).
Forward algorithm for evaluation
The naive approach to computing would sum over all possible state sequences, giving complexity. The forward algorithm reduces this to using dynamic programming.
Forward variables
Define the forward variable as the joint probability of observing the partial sequence up to time and being in state at time :
Recursive computation
-
Initialization (): For each state , This is the probability of starting in state times the probability of emitting the first observation from that state.
-
Recursion (): For each state , The bracketed sum collects all the ways you could arrive at state at time . Multiplying by accounts for the new observation.
-
Termination: Sum over all states at the final time step, since the sequence could end in any state.
Termination and likelihood
The final sum gives the total likelihood of the observation sequence under the model. This value is used for model selection and also feeds into the Baum-Welch algorithm during training.
Viterbi algorithm for decoding
The Viterbi algorithm finds the single most probable state sequence. It has the same structure as the forward algorithm but replaces summation with maximization and adds backtracking.
Viterbi variables
Define as the highest probability of any state path that ends in state at time while producing observations :
Also define to store which previous state maximized the path into state at time . This is what enables backtracking.
Recursive computation
-
Initialization (): For each state ,
-
Recursion (): For each state ,
-
Termination:
Termination and optimal state sequence
After termination, recover the full optimal path by backtracking:
You start from and trace backward through the stored pointers. The result is the most likely sequence of hidden states that explains the observations.
Baum-Welch algorithm for learning
The Baum-Welch algorithm estimates from observed data when the hidden states are unknown. It alternates between computing expected state occupancies (E-step) and re-estimating parameters (M-step), guaranteed to increase (or maintain) the likelihood at each iteration.
Forward and backward variables
The algorithm requires both:
- Forward variables : defined as above, computed left-to-right.
- Backward variables : the probability of the future observations given that you're in state at time .
The backward recursion mirrors the forward one:
- Initialization: for all states .
- Recursion: for .
Expectation step
Using and , compute two key posterior quantities:
- State occupancy probability : the probability of being in state at time given the full observation sequence.
- Transition probability : the probability of being in state at time and state at time , given the full observation sequence.
Note that , which provides a useful consistency check.

Maximization step
Re-estimate the parameters using the expected counts from the E-step:
- Initial distribution:
- Transition probabilities:
This is the expected number of transitions from to , divided by the expected number of times you're in state (excluding the last time step).
- Emission probabilities (discrete case):
This is the expected number of times state emits symbol , divided by the expected total time spent in state .
Convergence and learned parameters
The E-step and M-step alternate until convergence. Typical stopping criteria:
- The change in log-likelihood between iterations falls below a threshold
- A maximum number of iterations is reached
Because EM only guarantees convergence to a local maximum, results can depend on initialization. Running the algorithm multiple times with different random starting parameters and selecting the best result is standard practice.
Extensions of HMMs
Higher-order Markov models
Standard HMMs use a first-order Markov assumption. Higher-order HMMs let the current state depend on the previous states:
This captures longer-range temporal dependencies but increases the effective state space to , making parameter estimation and computation significantly more expensive.
Input-output HMMs
Input-output HMMs (IOHMMs) condition the transition and/or emission probabilities on an external input sequence. This allows the model to adapt to changing external conditions, such as speaker identity in speech recognition or macroeconomic indicators in financial models.
Factorial HMMs
Factorial HMMs (FHMMs) represent the hidden state as a combination of multiple independent Markov chains running in parallel. The observation at each time step depends on the joint configuration of all chains. This is useful for modeling systems with multiple interacting processes (e.g., separating overlapping audio sources). Exact inference in FHMMs is intractable for large models, so approximate methods like variational inference are typically used.
Applications of HMMs
Speech recognition
Hidden states represent phonemes or sub-word units, and observations are acoustic feature vectors (e.g., MFCCs). The Viterbi algorithm decodes the most likely phoneme sequence from audio, and the Baum-Welch algorithm trains the model on labeled speech data. HMMs were the dominant approach in speech recognition for decades before deep learning methods took over, though hybrid HMM-neural network systems remain in use.
Bioinformatics and sequence analysis
In gene finding, hidden states might represent coding regions, introns, and intergenic regions, while observations are nucleotide bases (A, C, G, T). Profile HMMs are a specialized variant used for protein family classification and multiple sequence alignment. HMMs can also identify conserved motifs across related biological sequences.
Financial modeling and regime switching
Hidden states represent unobserved market regimes (e.g., low-volatility vs. high-volatility periods), and observations are financial time series like returns or spreads. Regime-switching HMMs can detect structural breaks in market behavior and capture phenomena like volatility clustering that standard models miss.
Limitations and challenges of HMMs
Model selection and complexity
Choosing the number of hidden states is not straightforward. Too few states and the model can't capture the true dynamics; too many and it overfits. Common approaches include:
- Information criteria (AIC, BIC) that penalize model complexity
- Cross-validation on held-out data
- Bayesian nonparametric extensions (e.g., the infinite HMM) that let the data determine
Identifiability issues
Different parameter settings can produce identical observation distributions, a problem known as label switching. Permuting the state labels and adjusting , , and accordingly gives a different with the same likelihood. This doesn't affect prediction but complicates interpretation of learned states and comparison across model runs.
Computational complexity and scalability
The forward, Viterbi, and Baum-Welch algorithms all have time complexity , where is the number of states and is the sequence length. For large state spaces or very long sequences, this becomes expensive. Practical strategies to manage this include:
- Beam search / pruning: Only track the top-scoring states at each time step
- Sparse transition matrices: Exploit structure where many
- Variational or sampling-based approximations: Trade exact inference for speed
- Parallel computation: The per-time-step operations over states can be parallelized