String Search: Knuth-Morris-Pratt
August 25, 2009
We examined the brute-force method of searching for a pattern in a string in the previous exercise. Today, we look at a better string-searching method developed by Donald Knuth and Vaughn Pratt, and independently by James Morris, which they published jointly in 1977.
In the brute-force method, if the beginning of a pattern matches at the current position of the text, but leads to a mis-match before the end of the pattern, the search restarts at the beginning of the pattern moved one character along in the string. The Knuth-Morris-Pratt method takes advantage of the partial-match; since the characters are already known, and in fact are part of the pattern, it is possible to pre-compute based on the pattern, even before the search begins, where is the next character at which a search could possibly succeed. Consider for instance the pattern ABABAC being matched against the string ABABABAC. The first comparison looks like this:
str: A B A B A B A C
pat: A B A B A C
The first five characters match, the sixth fails. But instead of sliding the pattern one character to the right, as in the brute-force search,
str: A B A B A B A C
pat: A B A B A C
we can slide the pattern two characters to the right, since we already know the five characters of the pattern that matched:
str: A B A B A B A C
pat: A B A B A C
And we’re done. The skip size can be computed before the search ever starts, just by comparing the pattern to itself; the skip array, which is the same length as the pattern, tells how many characters to skip if the mis-match is found at a given pattern character. The first element of the skip array is always 0; if the first character of the pattern doesn’t match the first character of the string, slide the pattern one character to the right (that is, skip 0 characters). The second element of the skip array is also 0, because the A that is the first character of the pattern doesn’t match the B that is the second character of the pattern. The third element of the skip array is 1, because the A that is the third character of the pattern matches the A that is the first character of the pattern. The fourth element of the skip array is 2, because the AB that is the third and fourth characters of the pattern matches the AB that is the first and second characters of the pattern. The fifth element of the skip array is 3, because the ABA that is the third through fifth characters of the pattern matches the ABA that is the first through third characters of the pattern. And the sixth element of the skip array is 0, because the C that is the sixth character of the pattern doesn’t match any previous characters in the pattern.
Thus, the Knuth-Morris-Pratt algorithm works by pre-computing the skip array, then comparing the pattern to the string, character by character, occasionally skipping characters according to the skip array.
Your task is to write a function that searches for a pattern in a string using the Knuth-Morris-Pratt algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.