🧩Intro to Algorithms

String Matching Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

String matching is one of the most fundamental problems in computer science, appearing everywhere from text editors and search engines to DNA sequence analysis and plagiarism detection. When you're tested on these algorithms, you're really being evaluated on your understanding of preprocessing strategies, amortized analysis, space-time tradeoffs, and heuristic optimization. These concepts extend far beyond strings and show up throughout algorithmic design.

Focus on why each approach achieves its efficiency rather than just memorizing time complexities. What information does preprocessing capture? How does the algorithm avoid redundant work? When does one technique outperform another? Understanding these mechanisms will help you answer questions that ask you to analyze, compare, or apply these algorithms to new scenarios.

Brute Force Approaches

The simplest strategy checks every possible alignment between pattern and text. While inefficient, understanding this baseline helps you appreciate what more sophisticated algorithms optimize.

Naive String Matching Algorithm

Compares pattern against every position in text. It slides the pattern one character at a time, checking all $m$ characters at each of $n - m + 1$ positions.
Time complexity of $O(m \cdot n)$ in the worst case, where $m$ is pattern length and $n$ is text length.
No preprocessing required. This makes it useful for one-time searches on small inputs or as a baseline for understanding what other algorithms improve upon.

The worst case hits when the text and pattern share many repeated characters (e.g., searching for "aab" in "aaaaaaaaab"), forcing near-complete comparisons at every position before discovering the mismatch.

Preprocessing the Pattern

These algorithms invest upfront computation to build auxiliary data structures from the pattern, enabling smarter comparisons that skip redundant work. The core insight: information about the pattern's internal structure tells us where mismatches can't possibly lead to matches.

Knuth-Morris-Pratt (KMP) Algorithm

KMP avoids re-examining text characters by using information about the pattern's own repetitive structure.

Builds a Longest Prefix-Suffix (LPS) array in $O(m)$ time. For each position $i$ in the pattern, $LPS[i]$ stores the length of the longest proper prefix of the substring $P[0..i]$ that is also a suffix of that substring. This captures how the pattern overlaps with itself.
Achieves $O(m + n)$ time complexity. The text pointer never moves backward, making KMP optimal for streaming applications where you process characters one at a time.
Exploits partial match information. When a mismatch occurs at position $j$ in the pattern, the LPS array tells you to resume comparison at position $LPS[j-1]$ rather than starting over. You skip ahead because you already know those characters match.

Z Algorithm

Computes the Z-array where $Z[i]$ stores the length of the longest substring starting at position $i$ that matches a prefix of the string.
Linear $O(n + m)$ time complexity. You concatenate the pattern and text with a special sentinel character (e.g., $) that appears in neither, then compute the Z-array in a single pass. Any position where $Z[i] = m$ indicates a match.
Particularly clean for multiple pattern problems. The Z-array directly encodes all occurrences without additional bookkeeping.

Compare: KMP vs. Z Algorithm: both achieve $O(n + m)$ time through pattern preprocessing, but KMP uses the LPS array for suffix-to-prefix matching while Z computes prefix matches at each position. If an exam asks about streaming text (character by character), KMP is your answer since it never needs to look back. For batch processing where you have the full text available, Z is often simpler to implement and reason about.

Finite Automata for String Matching

Constructs a deterministic finite automaton (DFA) with $m + 1$ states, where each state represents how much of the pattern has been matched so far.
Text processing runs in $O(n)$ time after $O(m \cdot |\Sigma|)$ preprocessing, where $|\Sigma|$ is the alphabet size. The preprocessing cost comes from computing a transition for every state-character pair.
Guarantees single-pass, deterministic behavior. Each character triggers exactly one state transition, making this approach ideal for hardware implementations or situations where you need predictable per-character cost.

The tradeoff compared to KMP is clear: the automaton approach spends more time in preprocessing (especially with large alphabets) but makes each text character lookup a simple table access rather than a conditional loop.

Heuristic Optimization

Rather than just avoiding redundant comparisons, these algorithms use heuristics to skip large portions of the text entirely. The insight: sometimes you can prove that many positions are impossible matches without examining them.

Boyer-Moore Algorithm

Boyer-Moore compares the pattern against the text right-to-left (starting from the last character of the pattern), which is what enables its skipping power.

Uses two heuristics: the bad character rule and the good suffix rule. When a mismatch occurs, each rule independently calculates a safe shift distance, and the algorithm takes the larger of the two.
- Bad character rule: If the mismatched text character doesn't appear in the pattern, you can shift the entire pattern past that character. If it does appear, you shift to align the rightmost occurrence.
- Good suffix rule: If part of the pattern matched before the mismatch, you shift to align with the next occurrence of that matched suffix within the pattern (or a prefix that matches a suffix of the matched portion).
Best-case time complexity of $O(n/m)$ . This is sublinear, meaning you can find a match while examining fewer than $n$ characters. This happens when mismatches occur on the first comparison (the rightmost character) and the bad character rule lets you skip $m$ positions at a time.
Most effective with large alphabets and long patterns. With a large alphabet (like ASCII text), mismatched characters are unlikely to appear elsewhere in the pattern, so the bad character rule triggers large shifts frequently.

Compare: KMP vs. Boyer-Moore: KMP scans left-to-right and guarantees $O(n + m)$ worst case, while Boyer-Moore scans right-to-left and can be faster in practice but has $O(m \cdot n)$ worst case. For exam questions about guaranteed performance, choose KMP. For practical performance on English text, Boyer-Moore typically wins.

Hash-Based Methods

Hashing transforms the comparison problem: instead of checking characters one by one, you compare fixed-size hash values. The tradeoff is that you gain speed but introduce the possibility of false positives from hash collisions.

Rabin-Karp Algorithm

Uses a rolling hash function to compute substring hashes in $O(1)$ time per position after the initial $O(m)$ computation. The "rolling" part is key: when you slide the window one position right, you subtract the contribution of the outgoing character and add the incoming one, rather than recomputing from scratch.
Average time complexity of $O(n + m)$ , but degrades to $O(m \cdot n)$ when many hash collisions occur, since each collision requires character-by-character verification to confirm or reject the match.
Excels at multiple pattern matching. You can search for $k$ patterns simultaneously by storing all pattern hashes in a hash table and checking each text window's hash against the table.

A common choice for the hash function is a polynomial rolling hash: treat each character as a digit in base $d$ and take the result modulo a prime $q$ . Choosing a large prime for $q$ reduces collision probability.

Compare: Rabin-Karp vs. KMP: both average $O(n + m)$ , but Rabin-Karp is probabilistic (performance depends on hash function quality) while KMP is deterministic. Rabin-Karp shines when searching for multiple patterns at once; KMP is preferred for single-pattern searches requiring guaranteed performance.

Advanced Data Structures

When you need to perform many searches on the same text, preprocessing the text (not just the pattern) becomes worthwhile. These structures trade significant preprocessing time and space for extremely fast queries.

Suffix Trees and Arrays

Suffix trees enable $O(m)$ pattern matching after $O(n)$ construction. They store all suffixes of the text in a compressed trie, so searching for a pattern of length $m$ just means traversing $m$ characters down the tree.
Suffix arrays offer a space-efficient alternative. A suffix array is a sorted array of all suffix starting positions. Construction takes $O(n \log n)$ time (or $O(n)$ with advanced algorithms), and the array requires only $O(n)$ space compared to the many pointers needed by suffix trees.
Support powerful string operations beyond matching. Longest common substring, longest repeated substring, and various string statistics all become efficient queries once the structure is built.

Compare: Suffix structures vs. per-query algorithms: if you're searching once, KMP or Boyer-Moore with $O(m + n)$ time beats building a suffix tree. But for $k$ searches on the same text, suffix structures give $O(n + km)$ total time versus $O(k(n + m))$ for repeated KMP calls. The break-even point depends on the relative sizes of $n$ , $m$ , and $k$ .

Quick Reference Table

Concept	Best Examples
Brute force baseline	Naive Algorithm
Pattern preprocessing (linear guarantee)	KMP, Z Algorithm
State machine approach	Finite Automata
Heuristic skipping (sublinear possible)	Boyer-Moore
Hash-based comparison	Rabin-Karp
Multiple pattern search	Rabin-Karp, Aho-Corasick (extension)
Text preprocessing for repeated queries	Suffix Trees, Suffix Arrays
Worst-case $O(n + m)$ guarantee	KMP, Z Algorithm, Finite Automata

Self-Check Questions

Which two algorithms both achieve $O(n + m)$ time through pattern preprocessing but use fundamentally different auxiliary arrays? What does each array capture about the pattern?
An exam question states: "The algorithm achieves sublinear time in the best case." Which algorithm is being described, and what conditions enable this performance?
Compare and contrast Rabin-Karp and KMP: Under what circumstances would you choose each? What's the key tradeoff between them?
You need to search for the same pattern millions of times in a fixed text corpus. Which approach would you use, and why is it better than running KMP repeatedly?
A system processes streaming text character-by-character and must identify pattern matches in real-time without buffering. Which algorithms are suitable, and which are not? Justify your answer based on how each algorithm processes input.

🧩Intro to Algorithms

String Matching Algorithms

Why This Matters

Brute Force Approaches

Naive String Matching Algorithm

Preprocessing the Pattern

Knuth-Morris-Pratt (KMP) Algorithm

Z Algorithm

Finite Automata for String Matching

Heuristic Optimization

Boyer-Moore Algorithm

Hash-Based Methods

Rabin-Karp Algorithm

Advanced Data Structures

Suffix Trees and Arrays

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes