๐ŸงฉIntro to Algorithms

String Matching Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

String matching is one of the most fundamental problems in computer science, appearing everywhere from text editors and search engines to DNA sequence analysis and plagiarism detection. When you're tested on these algorithms, you're really being evaluated on your understanding of preprocessing strategies, amortized analysis, space-time tradeoffs, and heuristic optimization. These concepts extend far beyond strings and show up throughout algorithmic design.

Focus on why each approach achieves its efficiency rather than just memorizing time complexities. What information does preprocessing capture? How does the algorithm avoid redundant work? When does one technique outperform another? Understanding these mechanisms will help you answer questions that ask you to analyze, compare, or apply these algorithms to new scenarios.


Brute Force Approaches

The simplest strategy checks every possible alignment between pattern and text. While inefficient, understanding this baseline helps you appreciate what more sophisticated algorithms optimize.

Naive String Matching Algorithm

  • Compares pattern against every position in text. It slides the pattern one character at a time, checking all mm characters at each of nโˆ’m+1n - m + 1 positions.
  • Time complexity of O(mโ‹…n)O(m \cdot n) in the worst case, where mm is pattern length and nn is text length.
  • No preprocessing required. This makes it useful for one-time searches on small inputs or as a baseline for understanding what other algorithms improve upon.

The worst case hits when the text and pattern share many repeated characters (e.g., searching for "aab" in "aaaaaaaaab"), forcing near-complete comparisons at every position before discovering the mismatch.


Preprocessing the Pattern

These algorithms invest upfront computation to build auxiliary data structures from the pattern, enabling smarter comparisons that skip redundant work. The core insight: information about the pattern's internal structure tells us where mismatches can't possibly lead to matches.

Knuth-Morris-Pratt (KMP) Algorithm

KMP avoids re-examining text characters by using information about the pattern's own repetitive structure.

  • Builds a Longest Prefix-Suffix (LPS) array in O(m)O(m) time. For each position ii in the pattern, LPS[i]LPS[i] stores the length of the longest proper prefix of the substring P[0..i]P[0..i] that is also a suffix of that substring. This captures how the pattern overlaps with itself.
  • Achieves O(m+n)O(m + n) time complexity. The text pointer never moves backward, making KMP optimal for streaming applications where you process characters one at a time.
  • Exploits partial match information. When a mismatch occurs at position jj in the pattern, the LPS array tells you to resume comparison at position LPS[jโˆ’1]LPS[j-1] rather than starting over. You skip ahead because you already know those characters match.

Z Algorithm

  • Computes the Z-array where Z[i]Z[i] stores the length of the longest substring starting at position ii that matches a prefix of the string.
  • Linear O(n+m)O(n + m) time complexity. You concatenate the pattern and text with a special sentinel character (e.g., $) that appears in neither, then compute the Z-array in a single pass. Any position where Z[i]=mZ[i] = m indicates a match.
  • Particularly clean for multiple pattern problems. The Z-array directly encodes all occurrences without additional bookkeeping.

Compare: KMP vs. Z Algorithm: both achieve O(n+m)O(n + m) time through pattern preprocessing, but KMP uses the LPS array for suffix-to-prefix matching while Z computes prefix matches at each position. If an exam asks about streaming text (character by character), KMP is your answer since it never needs to look back. For batch processing where you have the full text available, Z is often simpler to implement and reason about.

Finite Automata for String Matching

  • Constructs a deterministic finite automaton (DFA) with m+1m + 1 states, where each state represents how much of the pattern has been matched so far.
  • Text processing runs in O(n)O(n) time after O(mโ‹…โˆฃฮฃโˆฃ)O(m \cdot |\Sigma|) preprocessing, where โˆฃฮฃโˆฃ|\Sigma| is the alphabet size. The preprocessing cost comes from computing a transition for every state-character pair.
  • Guarantees single-pass, deterministic behavior. Each character triggers exactly one state transition, making this approach ideal for hardware implementations or situations where you need predictable per-character cost.

The tradeoff compared to KMP is clear: the automaton approach spends more time in preprocessing (especially with large alphabets) but makes each text character lookup a simple table access rather than a conditional loop.


Heuristic Optimization

Rather than just avoiding redundant comparisons, these algorithms use heuristics to skip large portions of the text entirely. The insight: sometimes you can prove that many positions are impossible matches without examining them.

Boyer-Moore Algorithm

Boyer-Moore compares the pattern against the text right-to-left (starting from the last character of the pattern), which is what enables its skipping power.

  • Uses two heuristics: the bad character rule and the good suffix rule. When a mismatch occurs, each rule independently calculates a safe shift distance, and the algorithm takes the larger of the two.
    • Bad character rule: If the mismatched text character doesn't appear in the pattern, you can shift the entire pattern past that character. If it does appear, you shift to align the rightmost occurrence.
    • Good suffix rule: If part of the pattern matched before the mismatch, you shift to align with the next occurrence of that matched suffix within the pattern (or a prefix that matches a suffix of the matched portion).
  • Best-case time complexity of O(n/m)O(n/m). This is sublinear, meaning you can find a match while examining fewer than nn characters. This happens when mismatches occur on the first comparison (the rightmost character) and the bad character rule lets you skip mm positions at a time.
  • Most effective with large alphabets and long patterns. With a large alphabet (like ASCII text), mismatched characters are unlikely to appear elsewhere in the pattern, so the bad character rule triggers large shifts frequently.

Compare: KMP vs. Boyer-Moore: KMP scans left-to-right and guarantees O(n+m)O(n + m) worst case, while Boyer-Moore scans right-to-left and can be faster in practice but has O(mโ‹…n)O(m \cdot n) worst case. For exam questions about guaranteed performance, choose KMP. For practical performance on English text, Boyer-Moore typically wins.


Hash-Based Methods

Hashing transforms the comparison problem: instead of checking characters one by one, you compare fixed-size hash values. The tradeoff is that you gain speed but introduce the possibility of false positives from hash collisions.

Rabin-Karp Algorithm

  • Uses a rolling hash function to compute substring hashes in O(1)O(1) time per position after the initial O(m)O(m) computation. The "rolling" part is key: when you slide the window one position right, you subtract the contribution of the outgoing character and add the incoming one, rather than recomputing from scratch.
  • Average time complexity of O(n+m)O(n + m), but degrades to O(mโ‹…n)O(m \cdot n) when many hash collisions occur, since each collision requires character-by-character verification to confirm or reject the match.
  • Excels at multiple pattern matching. You can search for kk patterns simultaneously by storing all pattern hashes in a hash table and checking each text window's hash against the table.

A common choice for the hash function is a polynomial rolling hash: treat each character as a digit in base dd and take the result modulo a prime qq. Choosing a large prime for qq reduces collision probability.

Compare: Rabin-Karp vs. KMP: both average O(n+m)O(n + m), but Rabin-Karp is probabilistic (performance depends on hash function quality) while KMP is deterministic. Rabin-Karp shines when searching for multiple patterns at once; KMP is preferred for single-pattern searches requiring guaranteed performance.


Advanced Data Structures

When you need to perform many searches on the same text, preprocessing the text (not just the pattern) becomes worthwhile. These structures trade significant preprocessing time and space for extremely fast queries.

Suffix Trees and Arrays

  • Suffix trees enable O(m)O(m) pattern matching after O(n)O(n) construction. They store all suffixes of the text in a compressed trie, so searching for a pattern of length mm just means traversing mm characters down the tree.
  • Suffix arrays offer a space-efficient alternative. A suffix array is a sorted array of all suffix starting positions. Construction takes O(nlogโกn)O(n \log n) time (or O(n)O(n) with advanced algorithms), and the array requires only O(n)O(n) space compared to the many pointers needed by suffix trees.
  • Support powerful string operations beyond matching. Longest common substring, longest repeated substring, and various string statistics all become efficient queries once the structure is built.

Compare: Suffix structures vs. per-query algorithms: if you're searching once, KMP or Boyer-Moore with O(m+n)O(m + n) time beats building a suffix tree. But for kk searches on the same text, suffix structures give O(n+km)O(n + km) total time versus O(k(n+m))O(k(n + m)) for repeated KMP calls. The break-even point depends on the relative sizes of nn, mm, and kk.


Quick Reference Table

ConceptBest Examples
Brute force baselineNaive Algorithm
Pattern preprocessing (linear guarantee)KMP, Z Algorithm
State machine approachFinite Automata
Heuristic skipping (sublinear possible)Boyer-Moore
Hash-based comparisonRabin-Karp
Multiple pattern searchRabin-Karp, Aho-Corasick (extension)
Text preprocessing for repeated queriesSuffix Trees, Suffix Arrays
Worst-case O(n+m)O(n + m) guaranteeKMP, Z Algorithm, Finite Automata

Self-Check Questions

  1. Which two algorithms both achieve O(n+m)O(n + m) time through pattern preprocessing but use fundamentally different auxiliary arrays? What does each array capture about the pattern?

  2. An exam question states: "The algorithm achieves sublinear time in the best case." Which algorithm is being described, and what conditions enable this performance?

  3. Compare and contrast Rabin-Karp and KMP: Under what circumstances would you choose each? What's the key tradeoff between them?

  4. You need to search for the same pattern millions of times in a fixed text corpus. Which approach would you use, and why is it better than running KMP repeatedly?

  5. A system processes streaming text character-by-character and must identify pattern matches in real-time without buffering. Which algorithms are suitable, and which are not? Justify your answer based on how each algorithm processes input.