Light

14.3 String Searching Algorithms

5 min read•july 19, 2024

String searching algorithms are essential tools in computer science, used to find occurrences of a pattern within a larger text. This topic covers three main algorithms: brute-force, KMP, and , each with unique approaches to .

Understanding these algorithms is crucial for efficient text processing. The brute-force method is simple but inefficient for large texts, while KMP and Boyer-Moore offer improved performance through preprocessing and clever skipping techniques. Their effectiveness varies based on pattern and text characteristics.

String Searching Algorithms

Brute-force string searching efficiency

Top images from around the web for Brute-force string searching efficiency

Karol Kuczmarski's Blog – O(n log n) isn’t bad View original
Is this image relevant?
time complexity - Determining the number of steps in an algorithm - Stack Overflow View original
Is this image relevant?
Karol Kuczmarski's Blog – O(n log n) isn’t bad View original
Is this image relevant?
time complexity - Determining the number of steps in an algorithm - Stack Overflow View original
Is this image relevant?

1 of 2

Top images from around the web for Brute-force string searching efficiency

Karol Kuczmarski's Blog – O(n log n) isn’t bad View original
Is this image relevant?
time complexity - Determining the number of steps in an algorithm - Stack Overflow View original
Is this image relevant?
Karol Kuczmarski's Blog – O(n log n) isn’t bad View original
Is this image relevant?
time complexity - Determining the number of steps in an algorithm - Stack Overflow View original
Is this image relevant?

1 of 2

Brute-force string searching
- Compares each character of the pattern with the corresponding character in the text (e.g., searching for "hello" in "hello world")
- : $O(mn)$ , where $m$ is the length of the pattern and $n$ is the length of the text (e.g., searching for a pattern of length 5 in a text of length 100 would require up to 500 comparisons)
- : $O(1)$ , requires only a constant amount of extra space (e.g., a few variables to store indices and temporary values)
Naive string searching algorithm
- Slides the pattern over the text one by one (e.g., aligning "hello" with each position in "hello world")
- Compares characters at each shift (e.g., comparing "h" with "h", "e" with "e", etc.)
- Returns the index where the first occurrence of the pattern is found or -1 if not found (e.g., returns 0 for "hello" in "hello world")
Efficiency analysis
- Worst-case scenario: $O(mn)$ comparisons when the pattern occurs at the end of the text or not at all (e.g., searching for "world" in "hello world" or searching for "bye" in "hello world")
- Best-case scenario: $O(n)$ comparisons when the pattern occurs at the beginning of the text (e.g., searching for "hello" in "hello world")
- Average-case scenario: $O(n)$ comparisons when the pattern is not present or has a short length compared to the text (e.g., searching for "hi" in "hello world")

KMP algorithm for substring searching

(KMP) algorithm
- Preprocesses the pattern to compute a failure function (also known as a prefix function) (e.g., for the pattern "ABABC", the failure function is [0, 0, 1, 2, 0])
- Failure function determines the length of the longest proper prefix that is also a suffix of the pattern (e.g., for "ABABC", the longest proper prefix that is also a suffix of "ABAB" is "AB", which has a length of 2)
- Avoids unnecessary comparisons by skipping characters based on the failure function (e.g., if a mismatch occurs after matching "ABAB", the failure function value of 2 indicates that the next comparison should start from the third character of the pattern)
KMP algorithm steps
1. Compute the failure function for the pattern
2. Use the failure function to determine the next position to match when a mismatch occurs
3. Shift the pattern according to the failure function value and continue the comparison
Time complexity: $O(m + n)$ , where $m$ is the length of the pattern and $n$ is the length of the text (e.g., preprocessing the pattern of length 5 takes $O(5)$ time, and searching in a text of length 100 takes $O(100)$ time)
Space complexity: $O(m)$ to store the failure function (e.g., for a pattern of length 5, the failure function array requires 5 units of space)

Boyer-Moore algorithm implementation

Boyer-Moore algorithm
- Preprocesses the pattern to compute two heuristics: bad character heuristic and good suffix heuristic
- Bad character heuristic determines the shift based on the last occurrence of the mismatched character in the pattern (e.g., if the mismatched character is "x" and its last occurrence in the pattern is at index 2, the pattern can be shifted by 3 positions)
- Good suffix heuristic determines the shift based on the occurrence of the suffix of the matched substring in the pattern (e.g., if the matched substring is "abc" and its suffix "bc" occurs at index 1 in the pattern, the pattern can be shifted by 2 positions)
Boyer-Moore algorithm steps
1. Align the pattern with the beginning of the text
2. Compare characters from right to left
3. If a mismatch occurs, shift the pattern based on the maximum shift suggested by the bad character and good suffix heuristics
4. If the pattern is found, return the index; otherwise, repeat the process until the end of the text is reached
Time complexity: $O(mn)$ in the worst case, but it performs better on average with $O(n/m)$ comparisons (e.g., searching for a pattern of length 5 in a text of length 100 may require only 20 comparisons on average)
Space complexity: $O(m + |\Sigma|)$ , where $|\Sigma|$ is the size of the alphabet (e.g., for a pattern of length 5 and an alphabet size of 256, the space complexity is $O(261)$ )

Performance comparison of searching algorithms

Brute-force algorithms
- Suitable for small patterns and texts (e.g., searching for a pattern of length 3 in a text of length 10)
- Inefficient for large texts and patterns due to the high number of comparisons (e.g., searching for a pattern of length 100 in a text of length 1,000,000)
KMP algorithm
- Efficient for searching long patterns in large texts (e.g., searching for a pattern of length 1,000 in a text of length 1,000,000)
- Performs well when the pattern has many repeated characters or substrings (e.g., searching for "ABABABABAB" in a text)
- Preprocessing the failure function allows skipping unnecessary comparisons (e.g., if a mismatch occurs after matching a long prefix, the failure function helps avoid comparing the already matched characters again)
Boyer-Moore algorithm
- Efficient for searching short patterns in large texts (e.g., searching for a pattern of length 5 in a text of length 1,000,000)
- Performs well when the alphabet size is small and the pattern has distinct characters (e.g., searching for "abcde" in a text containing only lowercase letters)
- Heuristics allow skipping a significant portion of the text during the search (e.g., if the last character of the pattern is rare in the text, the bad character heuristic can skip many positions)
Factors affecting performance
- Length of the pattern and text (e.g., longer patterns and texts generally require more comparisons)
- Alphabet size and character distribution (e.g., a smaller alphabet size and distinct characters in the pattern favor the Boyer-Moore algorithm)
- Presence of repeated characters or substrings in the pattern and text (e.g., repeated substrings in the pattern favor the KMP algorithm)
Choosing the appropriate algorithm
- For small patterns and texts, brute-force algorithms may suffice (e.g., searching for a word in a sentence)
- For long patterns or large texts, KMP or Boyer-Moore algorithms are preferred (e.g., searching for a gene sequence in a genome)
- Consider the characteristics of the pattern and text when selecting the algorithm for optimal performance (e.g., if the pattern has many repeated characters, KMP may be a better choice; if the pattern has distinct characters and the alphabet size is small, Boyer-Moore may be more efficient)

Key Terms to Review (14)

Approximate string matching: Approximate string matching is the process of finding strings that are similar to a given pattern, allowing for a certain degree of errors such as insertions, deletions, or substitutions. This technique is essential in situations where exact matches are not feasible, such as searching through large databases of text where typos or variations in spelling may occur. It plays a crucial role in improving the efficiency and accuracy of string searching algorithms by enabling them to locate potential matches that are close enough to the desired input.

Boyer-Moore: The Boyer-Moore algorithm is an efficient string-searching algorithm that uses heuristics to skip sections of the text, allowing it to search for substrings within a larger string more quickly than many other algorithms. It leverages the information gained from mismatches between the pattern and the text to optimize the search process, making it particularly effective in practical scenarios where the alphabet size is large and the pattern length is short.

Dna sequence matching: DNA sequence matching is the process of comparing two or more DNA sequences to identify similarities, differences, or specific patterns. This technique is crucial in various fields like genetics, bioinformatics, and forensic science, as it helps in understanding genetic relationships, diagnosing diseases, and identifying individuals based on their genetic information.

Knuth-Morris-Pratt: The Knuth-Morris-Pratt (KMP) algorithm is an efficient string searching algorithm that finds occurrences of a substring within a larger string. It improves upon the naive approach by preprocessing the pattern to create a longest prefix-suffix (LPS) array, which helps skip unnecessary comparisons, leading to better performance, especially for larger strings.

Lexical analysis: Lexical analysis is the process of converting a sequence of characters in source code into a sequence of tokens, which are the meaningful symbols used in programming languages. This step is crucial in the compilation process, as it helps identify keywords, operators, identifiers, and literals, thereby breaking down the code into manageable parts for further processing. Efficient lexical analysis can significantly improve the performance of string searching algorithms used during parsing and interpretation.

Pattern Matching: Pattern matching is a process used to find and identify sequences or patterns within data, particularly strings. This concept is essential in string searching algorithms, where the goal is to locate occurrences of a specific substring within a larger text efficiently. Pattern matching plays a vital role in applications such as text processing, data validation, and artificial intelligence, making it a foundational technique in computer science.

Space Complexity: Space complexity refers to the amount of memory space required by an algorithm to execute as a function of the size of the input data. It includes both the space needed for variables and the space needed for the input itself. Understanding space complexity helps in choosing the right data structures and algorithms, as it directly impacts performance and resource usage.

String parsing: String parsing is the process of analyzing a string of text to extract meaningful information from it, often by breaking the string into smaller parts or tokens. This technique is crucial in various programming tasks, as it enables developers to manipulate and understand text data more effectively. By using algorithms designed for string searching, parsing can help identify patterns, locate specific substrings, and format data for further processing.

Substring: A substring is a contiguous sequence of characters within a string. It can be as short as a single character or as long as the entire string itself, and it plays a crucial role in various string searching algorithms that help find occurrences of specific sequences within larger texts.

Suffix tree: A suffix tree is a compressed trie data structure that represents all the suffixes of a given string, allowing for efficient substring searches and various string processing operations. It enables fast querying, making it a powerful tool in string searching algorithms, as it can help find patterns, repetitions, and matches within large datasets effectively.

Text searching: Text searching refers to the process of finding specific sequences of characters within a larger body of text. This concept is essential in various applications, including search engines, databases, and text processing software, where efficiently locating patterns or substrings is necessary for data retrieval and analysis. Understanding text searching algorithms can greatly enhance performance and accuracy in tasks involving large amounts of textual data.

Time Complexity: Time complexity refers to the computational complexity that describes the amount of time it takes to run an algorithm as a function of the length of the input. It is a critical concept that helps in comparing the efficiency of different algorithms, guiding choices about which data structures and algorithms to use for optimal performance.

Trie: A trie, also known as a prefix tree, is a specialized tree-like data structure used for efficiently storing and searching a dynamic set of strings, where the keys are usually strings. It allows for fast retrieval of keys with common prefixes, making it particularly useful in applications like autocomplete and spell-checking. Each node in a trie represents a common prefix of some strings, and the path from the root to any node represents a string formed by concatenating the characters along that path.

Wildcard matching: Wildcard matching is a technique used in string searching algorithms to find patterns in text that may not match exactly due to the presence of special characters, known as wildcards. These wildcards can represent any character or group of characters, allowing for flexible searching within strings. This capability is especially useful in applications like search engines and database queries, where the exact match may not always be available.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

Back

Practice QuizGlossary

Practice Quiz Glossary

14.3 String Searching Algorithms

String Searching Algorithms

Brute-force string searching efficiency

Top images from around the web for Brute-force string searching efficiency

Top images from around the web for Brute-force string searching efficiency

KMP algorithm for substring searching

Boyer-Moore algorithm implementation

Performance comparison of searching algorithms

Key Terms to Review (14)

© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

Back

Next guide