upgrade
upgrade

🧩Intro to Algorithms

String Matching Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

String matching is one of the most fundamental problems in computer science, appearing everywhere from text editors and search engines to DNA sequence analysis and plagiarism detection. When you're tested on these algorithms, you're really being evaluated on your understanding of preprocessing strategies, amortized analysis, space-time tradeoffs, and heuristic optimization. These concepts extend far beyond strings—they're core algorithmic design principles you'll encounter throughout your career.

Don't just memorize the time complexities for each algorithm. Instead, focus on why each approach achieves its efficiency: What information does preprocessing capture? How does the algorithm avoid redundant work? When does one technique outperform another? Understanding these mechanisms will help you tackle FRQ-style questions that ask you to analyze, compare, or apply these algorithms to novel scenarios.


Brute Force Approaches

The simplest strategy checks every possible alignment between pattern and text. While inefficient, understanding this baseline helps you appreciate what more sophisticated algorithms optimize.

Naive String Matching Algorithm

  • Compares pattern against every position in text—slides the pattern one character at a time, checking all mm characters at each of nm+1n - m + 1 positions
  • Time complexity of O(mn)O(m \cdot n) in the worst case, where mm is pattern length and nn is text length
  • No preprocessing required—makes it useful for one-time searches on small inputs or as a baseline for understanding optimization gains

Preprocessing the Pattern

These algorithms invest upfront computation to build auxiliary data structures from the pattern, enabling smarter comparisons that skip redundant work. The key insight: information about the pattern's internal structure tells us where mismatches can't possibly lead to matches.

Knuth-Morris-Pratt (KMP) Algorithm

  • Builds a Longest Prefix-Suffix (LPS) array in O(m)O(m) time—captures how the pattern overlaps with itself to avoid re-examining characters
  • Achieves O(m+n)O(m + n) time complexity—never backtracks in the text, making it optimal for streaming applications
  • Exploits partial match information—when a mismatch occurs, the LPS array indicates exactly where to resume comparison

Z Algorithm

  • Computes the Z-array where Z[i]Z[i] stores the length of the longest substring starting at position ii that matches a prefix of the string
  • Linear O(n+m)O(n + m) time complexity—concatenates pattern and text with a sentinel character, then processes in a single pass
  • Particularly elegant for multiple pattern problems—the Z-array provides direct information about all occurrences without additional bookkeeping

Compare: KMP vs. Z Algorithm—both achieve O(n+m)O(n + m) time through pattern preprocessing, but KMP uses the LPS array for suffix-to-prefix matching while Z computes prefix matches at each position. If an exam asks about streaming text (character by character), KMP is your answer; for batch processing, Z is often simpler to implement.

Finite Automata for String Matching

  • Constructs a deterministic finite automaton (DFA) with states representing how much of the pattern has been matched
  • Text processing runs in O(n)O(n) time after O(mΣ)O(m \cdot |\Sigma|) preprocessing, where Σ|\Sigma| is alphabet size
  • Guarantees single-pass, deterministic behavior—each character triggers exactly one state transition, making it ideal for hardware implementations

Heuristic Optimization

Rather than just avoiding redundant comparisons, these algorithms use clever heuristics to skip large portions of the text entirely. The insight: sometimes we can prove that many positions are impossible matches without examining them.

Boyer-Moore Algorithm

  • Uses two heuristics: bad character and good suffix rules—when a mismatch occurs, calculates the maximum safe shift from both rules and takes the larger
  • Best-case time complexity of O(n/m)O(n/m)—achieves sublinear performance by potentially skipping mm characters per comparison
  • Most effective with large alphabets and long patterns—the bad character rule becomes more powerful when mismatched characters are unlikely to appear elsewhere in the pattern

Compare: KMP vs. Boyer-Moore—KMP scans left-to-right and guarantees O(n+m)O(n + m) worst case, while Boyer-Moore scans right-to-left and can be faster in practice but has O(mn)O(m \cdot n) worst case. For exam questions about guaranteed performance, choose KMP; for practical performance on English text, Boyer-Moore typically wins.


Hash-Based Methods

Hashing transforms the comparison problem: instead of checking characters one by one, we compare fixed-size hash values. The tradeoff: we gain speed but introduce the possibility of false positives from hash collisions.

Rabin-Karp Algorithm

  • Uses rolling hash functions to compute substring hashes in O(1)O(1) time per position after initial O(m)O(m) computation
  • Average time complexity of O(n+m)O(n + m) but degrades to O(mn)O(m \cdot n) when many hash collisions occur, requiring character-by-character verification
  • Excels at multiple pattern matching—can search for kk patterns simultaneously by storing all pattern hashes in a hash table

Compare: Rabin-Karp vs. KMP—both average O(n+m)O(n + m), but Rabin-Karp is probabilistic (depends on hash function quality) while KMP is deterministic. Rabin-Karp shines when searching for multiple patterns; KMP is preferred for single-pattern searches requiring guaranteed performance.


Advanced Data Structures

When you need to perform many searches on the same text, preprocessing the text (not just the pattern) becomes worthwhile. These structures trade significant preprocessing time and space for extremely fast queries.

Suffix Trees and Arrays

  • Suffix trees enable O(m)O(m) pattern matching after O(n)O(n) construction—store all suffixes in a compressed trie structure
  • Suffix arrays offer a space-efficient alternative requiring O(nlogn)O(n \log n) construction time but only O(n)O(n) space versus O(n)O(n) pointers in suffix trees
  • Support powerful string operations beyond matching—longest common substring, longest repeated substring, and string statistics all become efficient queries

Compare: Suffix structures vs. per-query algorithms—if you're searching once, KMP or Boyer-Moore with O(m+n)O(m + n) time beats building a suffix tree. But for kk searches on the same text, suffix structures give O(n+km)O(n + km) total time versus O(k(n+m))O(k(n + m)) for repeated KMP calls.


Quick Reference Table

ConceptBest Examples
Brute force baselineNaive Algorithm
Pattern preprocessing (linear guarantee)KMP, Z Algorithm
State machine approachFinite Automata
Heuristic skipping (sublinear possible)Boyer-Moore
Hash-based comparisonRabin-Karp
Multiple pattern searchRabin-Karp, Aho-Corasick (extension)
Text preprocessing for repeated queriesSuffix Trees, Suffix Arrays
Worst-case O(n+m)O(n + m) guaranteeKMP, Z Algorithm, Finite Automata

Self-Check Questions

  1. Which two algorithms both achieve O(n+m)O(n + m) time through pattern preprocessing but use fundamentally different auxiliary arrays? What does each array capture about the pattern?

  2. An exam question states: "The algorithm achieves sublinear time in the best case." Which algorithm is being described, and what conditions enable this performance?

  3. Compare and contrast Rabin-Karp and KMP: Under what circumstances would you choose each? What's the key tradeoff between them?

  4. You need to search for the same pattern millions of times in a fixed text corpus. Which approach would you use, and why is it better than running KMP repeatedly?

  5. A system processes streaming text character-by-character and must identify pattern matches in real-time without buffering. Which algorithms are suitable, and which are not? Justify your answer based on how each algorithm processes input.