String Search: Knuth-Morris-Pratt

August 25, 2009

We examined the brute-force method of searching for a pattern in a string in the previous exercise. Today, we look at a better string-searching method developed by Donald Knuth and Vaughn Pratt, and independently by James Morris, which they published jointly in 1977.

In the brute-force method, if the beginning of a pattern matches at the current position of the text, but leads to a mis-match before the end of the pattern, the search restarts at the beginning of the pattern moved one character along in the string. The Knuth-Morris-Pratt method takes advantage of the partial-match; since the characters are already known, and in fact are part of the pattern, it is possible to pre-compute based on the pattern, even before the search begins, where is the next character at which a search could possibly succeed. Consider for instance the pattern ABABAC being matched against the string ABABABAC. The first comparison looks like this:

str: A B A B A B A C
pat: A B A B A C

The first five characters match, the sixth fails. But instead of sliding the pattern one character to the right, as in the brute-force search,

str: A B A B A B A C
pat:   A B A B A C

we can slide the pattern two characters to the right, since we already know the five characters of the pattern that matched:

str: A B A B A B A C
pat:     A B A B A C

And we’re done. The skip size can be computed before the search ever starts, just by comparing the pattern to itself; the skip array, which is the same length as the pattern, tells how many characters to skip if the mis-match is found at a given pattern character. The first element of the skip array is always 0; if the first character of the pattern doesn’t match the first character of the string, slide the pattern one character to the right (that is, skip 0 characters). The second element of the skip array is also 0, because the A that is the first character of the pattern doesn’t match the B that is the second character of the pattern. The third element of the skip array is 1, because the A that is the third character of the pattern matches the A that is the first character of the pattern. The fourth element of the skip array is 2, because the AB that is the third and fourth characters of the pattern matches the AB that is the first and second characters of the pattern. The fifth element of the skip array is 3, because the ABA that is the third through fifth characters of the pattern matches the ABA that is the first through third characters of the pattern. And the sixth element of the skip array is 0, because the C that is the sixth character of the pattern doesn’t match any previous characters in the pattern.

Thus, the Knuth-Morris-Pratt algorithm works by pre-computing the skip array, then comparing the pattern to the string, character by character, occasionally skipping characters according to the skip array.

Your task is to write a function that searches for a pattern in a string using the Knuth-Morris-Pratt algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

About these ads

Pages: 1 2

4 Responses to “String Search: Knuth-Morris-Pratt”

  1. Connochaetes said

    I remember this from college, I think the approach I took here is due to Sedgewick if my memory serves me. The initial -1 in the skip array is for the null-prefix of the pattern.

    Ruby:

    def preprocess(pattern)
    skip = [-1]
    i, j = 0, -1
    while i = 0) and (pattern[i] != pattern[j])
    j = skip[j]
    end
    i, j = i+1, j+1
    skip[i] = j
    end
    return skip
    end

    def kmp_search(str, pattern)
    i = 0; j = 0
    skip = preprocess(pattern)
    while i =0) & (str[i] != pattern[j])
    j = skip[j]
    end
    i,j = i+1, j+1
    return i-j if j == pattern.length
    end
    return nil
    end

  2. Connochaetes said

    I remember this from college, I think the approach I took here is due to Sedgewick if my memory serves me. The initial -1 in the skip array is for the null-prefix of the pattern.

    This is the correcty formatted version, please kill the previous post :)

    Ruby:

    def preprocess(pattern)
    skip = [-1]
    i, j = 0, -1
    while i = 0) and (pattern[i] != pattern[j])
    j = skip[j]
    end
    i, j = i+1, j+1
    skip[i] = j
    end
    return skip
    end
    
    def kmp_search(str, pattern)
    i = 0; j = 0
    skip = preprocess(pattern)
    while i =0) & (str[i] != pattern[j])
    j = skip[j]
    end
    i,j = i+1, j+1
    return i-j if j == pattern.length
    end
    return nil
    end
    
  3. Connochaetes said

    I lost tabs – oh well, I give up :) Time for bed.

  4. Vikas Tandi said

    My implementation in C

    #include "interface.h"
    #include <stdlib.h>
    #include <string.h>
    
    static void build_table(char *p, int *table);
    
    int string_index_2(char *s, char *p)
    {
    	int m, i;
    	int *table;
    
    	table = (int*)malloc(sizeof(int)*strlen(p));
    	if(table == NULL)
    		return -1;
    
    	/* precompute the skip table */
    	build_table(p, table);
    
    	/* search the pattern in string using skip table */
    	for(m = 0, i = 0; s[m+i] != '\0';)
    	{
    		for(; p[i] != '\0' && p[i] == s[m+i]; i++);
    		if(p[i] == '\0')
    		{
    			free(table);
    			return m+1;
    		}
    
    		/* set next values of m and i using skip table */
    		m = m + i - table[i];
    		if(table[i] < 0)
    			i = 0;
    		else
    			i = table[i];
    	}
    	free(table);
    	return -1;
    }
    
    static void build_table(char *p, int *table)
    {
    	int i, j;
    
    	table[0] = -1;
    	if(p[1] == '\0')
    		return;
    
    	table[1] = 0;
    
    	for(i = 0, j = 2; p[j] != '\0';)
    	{
    		if(p[i] == p[j-1])				/* the substring continues */
    			table[j++] = ++i;
    		else if(i > 0)					/* doesn't match but we can fallback */
    			i = table[i];
    		else
    			table[j++] = 0;
    	}
    }
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 577 other followers

%d bloggers like this: