Fiveable

🔁Data Structures Unit 14 Review

QR code for Data Structures practice questions

14.3 String Searching Algorithms

14.3 String Searching Algorithms

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🔁Data Structures
Unit & Topic Study Guides

String Searching Algorithms

String searching algorithms find occurrences of a pattern (a short string) within a larger text. They show up everywhere: text editors doing find-and-replace, search engines scanning documents, bioinformatics tools matching DNA sequences. This section covers three approaches: brute-force, KMP, and Boyer-Moore, each with different tradeoffs in preprocessing cost and search speed.

Brute-force String Searching

The brute-force approach is the most straightforward way to search for a pattern in text. You slide the pattern across the text one position at a time and check for a match at each position.

How it works:

  1. Align the pattern with position 0 of the text.
  2. Compare characters left to right: pattern[0] with text[0], pattern[1] with text[1], and so on.
  3. If all characters match, you've found the pattern. Return the current position.
  4. If a mismatch occurs, shift the pattern one position to the right and start comparing again from pattern[0].
  5. Repeat until the pattern is found or you've exhausted all valid positions in the text.

For example, searching for "hello" in "hello world" starts by comparing at index 0. All five characters match, so it returns 0.

Complexity:

  • Time: O(mn)O(mn) worst case, where mm is the pattern length and nn is the text length. Searching for a pattern of length 5 in text of length 100 could require up to 500 comparisons. The worst case happens with inputs like pattern "AAAB" in text "AAAAAAAAAB," where almost-complete matches force many comparisons before each mismatch.
  • Best case: O(n)O(n), when the first character of the pattern mismatches at most positions, so you skip ahead quickly.
  • Average case: O(n)O(n) for typical text with a reasonably large alphabet (like English), since mismatches tend to happen early.
  • Space: O(1)O(1), just a few index variables.

The brute-force method works fine for short patterns and small texts, but it wastes effort. After a partial match fails, it "forgets" everything it just learned and starts from scratch. KMP and Boyer-Moore fix this problem in different ways.

KMP Algorithm for Substring Searching

The Knuth-Morris-Pratt (KMP) algorithm avoids redundant comparisons by preprocessing the pattern. When a mismatch occurs partway through a match, KMP uses information about the pattern's internal structure to skip ahead intelligently instead of restarting from the next position.

The failure function (prefix function):

The key idea is the failure function (sometimes called the prefix function or partial match table). For each position ii in the pattern, it stores the length of the longest proper prefix of the substring pattern[0..i] that is also a suffix of that substring.

For the pattern "ABABC":

PositionSubstringLongest prefix = suffixFailure value
0A(none)0
1AB(none)0
2ABAA1
3ABABAB2
4ABABC(none)0

So the failure function is [0, 0, 1, 2, 0].

Algorithm steps:

  1. Preprocess: Compute the failure function for the pattern.

  2. Search: Compare the pattern against the text left to right, maintaining a pointer jj into the pattern.

  3. On a match: Advance both the text pointer and jj. If jj reaches the end of the pattern, you've found a match.

  4. On a mismatch: Don't reset jj to 0. Instead, set jj to the failure function value at position j1j - 1. This jumps jj back to the length of the longest prefix that could still be part of a valid match, skipping the characters you already know match.

  5. Repeat until the text is exhausted.

Why this helps: Suppose you're matching "ABABC" and you've matched "ABAB" before a mismatch. The failure value at position 3 is 2, meaning the last two characters of what you matched ("AB") are also the first two characters of the pattern. So you can continue comparing from pattern[2] instead of starting over.

Complexity:

  • Time: O(m+n)O(m + n). Building the failure function takes O(m)O(m), and the search takes O(n)O(n). Each character in the text is examined at most twice (once on match, once on mismatch fallback), so the total is linear.
  • Space: O(m)O(m) for the failure function array.
Brute-force string searching efficiency, Karol Kuczmarski's Blog – O(n log n) isn’t bad

Boyer-Moore Algorithm

The Boyer-Moore algorithm takes a different approach: it compares characters right to left within the pattern and uses two preprocessing heuristics to skip large portions of the text. In practice, it's often the fastest string search algorithm for typical inputs.

Two heuristics:

  • Bad character heuristic: When a mismatch occurs, look at the mismatched character in the text. If that character appears elsewhere in the pattern, shift the pattern so that character aligns with its rightmost occurrence in the pattern. If it doesn't appear in the pattern at all, shift the entire pattern past that position. For example, if you're matching against text character "x" and "x" doesn't appear anywhere in the pattern, you can safely skip ahead by the full pattern length.
  • Good suffix heuristic: When a mismatch occurs after matching some suffix of the pattern, check whether that matched suffix appears elsewhere in the pattern. If it does, shift the pattern to align with that other occurrence. If not, shift to align with the longest prefix of the pattern that matches a suffix of the matched portion.

Algorithm steps:

  1. Preprocess the pattern to build the bad character table and the good suffix table.
  2. Align the pattern with the beginning of the text.
  3. Compare characters from right to left (starting at the last character of the pattern).
  4. If all characters match, report the match.
  5. On a mismatch, compute the shift from both heuristics and take the maximum of the two shifts.
  6. Shift the pattern by that amount and repeat from step 3.

Why right-to-left comparison matters: By starting at the end of the pattern, a mismatch on the very first comparison can let you skip ahead by up to mm positions, since you immediately learn something about a character that's far into the text.

Complexity:

  • Time: O(mn)O(mn) worst case (pathological inputs like pattern "AAA" in text "AAAAAA"), but the average case is O(n/m)O(n/m). That sublinear average means it can actually examine fewer characters than exist in the text. Searching for a 5-character pattern in 100 characters of text might need only about 20 comparisons on average.
  • Space: O(m+Σ)O(m + |\Sigma|), where Σ|\Sigma| is the alphabet size. The bad character table needs an entry for each possible character (e.g., 256 for ASCII), and the good suffix table needs O(m)O(m) space.

Performance Comparison

FactorBrute-ForceKMPBoyer-Moore
PreprocessingNoneO(m)O(m)$$O(m +
Worst-case timeO(mn)O(mn)O(m+n)O(m + n)O(mn)O(mn)
Average-case timeO(n)O(n)O(m+n)O(m + n)O(n/m)O(n/m)
SpaceO(1)O(1)O(m)O(m)$$O(m +
Comparison directionLeft to rightLeft to rightRight to left

When to use each:

  • Brute-force is fine for short patterns in short texts (searching for a word in a sentence). The simplicity means less overhead, and for small inputs the O(mn)O(mn) worst case doesn't matter.
  • KMP excels when the pattern contains repeated substrings or characters (like "ABABABABAB"). The failure function captures that repetition and prevents backtracking. It also guarantees O(m+n)O(m + n) regardless of input, making it a safe default for large inputs.
  • Boyer-Moore tends to be fastest in practice for searching short patterns in large texts with a reasonably sized alphabet. The sublinear average case means it often skips most of the text entirely. It's the algorithm behind many real-world search tools.

Factors that affect your choice:

  • Pattern length vs. text length: Longer patterns give Boyer-Moore more room to skip. Very long patterns with internal repetition favor KMP.
  • Alphabet size and character distribution: A larger alphabet (like ASCII) helps Boyer-Moore's bad character heuristic, since mismatched characters are less likely to appear in the pattern. A small alphabet (like DNA's {A, C, G, T}) reduces that advantage.
  • Repetition in the pattern: Patterns with lots of repeated prefixes/suffixes are where brute-force performs worst and KMP provides the biggest improvement.