September 1, 2009
We examined the brute-force, Knuth-Morris-Pratt and Boyer-Moore methods of searching for a pattern in a string in three previous exercises. Today, we conclude this series of string-searching exercises with a look at a method devised by Michael Rabin (the same Michael Rabin we saw earlier in the exercise on primality checking) and Richard Karp in 1987 that uses hashing to find a substring in a text.
The basic idea of the Rabin-Karp algorithm is to take every pattern-length substring of the string being searched, hash the characters to a number, and compare that number to the hash number of the pattern. When a substring hash is equal to the pattern hash, a final check must be made to confirm the match.
It is useful to use a hash function that considers the characters in order; that way, considering each substring in sequence makes it easy to compute the various hashes. One approach is to add the integer values corresponding to each character; to compute the next hash, drop off the integer value of the first character and add the integer value of the current character to the running hash value.
The hash function that we will adopt treats each substring as a base-256 number. Thus, cat, with ascii values of 99, 97 and 116 for its three letters, hashes to 99 · 2562 + 97 · 2561 + 116 · 2560 = 6513012. Since hash values can grow large quickly, it is customary to take them modulo some convenient prime number.
Given this hash function, it is easy to roll the hash sequentially through the search string. If the search string is catch, the next hash after cat will be 6386787, which can be computed as 6513012 – 97 · 2562 = 24948, multipled by 256, thus 24948 × 256 = 6386688, then add 99, thus 6386688 + 99 = 6386787.
Your task is to write a function that searches for a pattern in a string using the Rabin-Karp algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.