Beautiful Code
September 11, 2009
In their book The Practice of Programming, Brian Kernighan and Rob Pike present this code for matching simple regular expressions consisting of literal characters, a dot matching any character, a star consisting of zero or more repetitions of the preceding character, and a caret and a dollar sign representing the beginning or end of the search string:
/* match: search for re anywhere in text */
int match(char *re, char *text)
{
if (re[0] == '^')
return matchhere(re+1, text);
do { /* must look at empty string */
if (matchhere(re, text))
return 1;
} while (*text++ != '\0');
return 0;
}
/* matchhere: search for re at beginning of text */
int matchhere(char *re, char *text)
{
if (re[0] == '\0')
return 1;
if (re[1] == '*')
return matchstar(re[0], re+2, text);
if (re[0] == '$' && re[1] == '\0')
return *text == '\0';
if (*text!='\0' && (re[0]=='.' || re[0]==*text))
return matchhere(re+1, text+1);
return 0;
}
/* matchstar: search for c*re at beginning of text */
int matchstar(int c, char *re, char *text)
{
do { /* a * matches zero or more instances */
if (matchhere(re, text))
return 1;
} while (*text!='\0' && (*text++==c || c=='.'));
return 0;
}
Many readers commented on the beauty of the code, and in a later book, Beautiful Code by Andy Oram and Greg Wilson, Kernighan explained the history of the code.
Your task is to port the code to your favorite language. You should use the features and idioms of your language, while simultaneously preserving the beauty of Rob Pike’s regular expression matcher. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
Porter Stemming
September 8, 2009
Stemming, in the parlance of searching and information retrieval, is the operation of stripping the suffices from a word, leaving its stem. Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when you ask for a web page that contains the word connect.
There are basically two ways to implement stemming. The first approach is to create a big dictionary that maps words to their stems. The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. The second approach is to use a set of rules that extract stems from words. The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes. But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen.
In 1979, Martin Porter developed a stemming algorithm that, with minor modifications, is still in use today; it uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right. Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html; the description of the algorithm is repeated on the next page.
Your task is to write a function that stems words according to Porter’s algorithm; you should be aware that this exercise requires rather more code than we usually write, though it’s no harder than usual. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
Ron’s Cipher #4
September 4, 2009
RC4 is a stream cipher, similar to the Blum Blum Shub cipher of an earlier exercise, invented by Ron Rivest in 1987. RC4 provides encryption for the SSL protocol that permits secure internet connections and the WEP protocol that secures wireless networks.
RC4 works by using a key to initialize a 256-byte array, which then forms the basis of a pseudo-random byte generator. The stream of random bytes is xor’ed with plaintext to produce ciphertext; decryption is performed by xor’ing the same stream of random bytes with the ciphertext to produce plaintext.
The key array K is initialized like this:
for i from 0 to 255
K[i] := i
j := 0
for i from 0 to 255
j := (j + K[i] + key[i mod klen]) mod 256
swap K[i], K[j]
Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:
i := 0
j := 0
start:
i := (i + 1) mod 256
j := (j + K[i]) mod 256
swap K[i], K[j]
output K[ (K[i] + K[j]) mod 256 ]
goto start
Your task is to write a function that takes a key and a text and produces the corresponding output text; the same function can be used for both encryption and decryption. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
String Search: Rabin-Karp
September 1, 2009
We examined the brute-force, Knuth-Morris-Pratt and Boyer-Moore methods of searching for a pattern in a string in three previous exercises. Today, we conclude this series of string-searching exercises with a look at a method devised by Michael Rabin (the same Michael Rabin we saw earlier in the exercise on primality checking) and Richard Karp in 1987 that uses hashing to find a substring in a text.
The basic idea of the Rabin-Karp algorithm is to take every pattern-length substring of the string being searched, hash the characters to a number, and compare that number to the hash number of the pattern. When a substring hash is equal to the pattern hash, a final check must be made to confirm the match.
It is useful to use a hash function that considers the characters in order; that way, considering each substring in sequence makes it easy to compute the various hashes. One approach is to add the integer values corresponding to each character; to compute the next hash, drop off the integer value of the first character and add the integer value of the current character to the running hash value.
The hash function that we will adopt treats each substring as a base-256 number. Thus, cat, with ascii values of 99, 97 and 116 for its three letters, hashes to 99 · 2562 + 97 · 2561 + 116 · 2560 = 6513012. Since hash values can grow large quickly, it is customary to take them modulo some convenient prime number.
Given this hash function, it is easy to roll the hash sequentially through the search string. If the search string is catch, the next hash after cat will be 6386787, which can be computed as 6513012 – 97 · 2562 = 24948, multipled by 256, thus 24948 × 256 = 6386688, then add 99, thus 6386688 + 99 = 6386787.
Your task is to write a function that searches for a pattern in a string using the Rabin-Karp algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
String Search: Boyer-Moore
August 28, 2009
The two previous exercises discussed the brute-force and Knuth-Morris-Pratt algoritms for searching strings. Today we discuss the Boyer-Moore string search algorithm, invented by Bob Boyer and J Strother Moore in 1977, in a variant devised by Nigel Horspool.
The Boyer-Moore algorithm is a “backwards” version of the Knuth-Morris-Pratt algorithm. It looks at the last character of the pattern first, working its way right-to-left until it finds a mis-match, when it slides the pattern right along the search string for a skip size based on the current character.
Consider the pattern ABABAC, the same pattern used in the prior exercise. The skip array is:
String Search: Knuth-Morris-Pratt
August 25, 2009
We examined the brute-force method of searching for a pattern in a string in the previous exercise. Today, we look at a better string-searching method developed by Donald Knuth and Vaughn Pratt, and independently by James Morris, which they published jointly in 1977.
In the brute-force method, if the beginning of a pattern matches at the current position of the text, but leads to a mis-match before the end of the pattern, the search restarts at the beginning of the pattern moved one character along in the string. The Knuth-Morris-Pratt method takes advantage of the partial-match; since the characters are already known, and in fact are part of the pattern, it is possible to pre-compute based on the pattern, even before the search begins, where is the next character at which a search could possibly succeed. Consider for instance the pattern ABABAC being matched against the string ABABABAC. The first comparison looks like this:
str: A B A B A B A C
pat: A B A B A C
The first five characters match, the sixth fails. But instead of sliding the pattern one character to the right, as in the brute-force search,
str: A B A B A B A C
pat: A B A B A C
we can slide the pattern two characters to the right, since we already know the five characters of the pattern that matched:
str: A B A B A B A C
pat: A B A B A C
And we’re done. The skip size can be computed before the search ever starts, just by comparing the pattern to itself; the skip array, which is the same length as the pattern, tells how many characters to skip if the mis-match is found at a given pattern character. The first element of the skip array is always 0; if the first character of the pattern doesn’t match the first character of the string, slide the pattern one character to the right (that is, skip 0 characters). The second element of the skip array is also 0, because the A that is the first character of the pattern doesn’t match the B that is the second character of the pattern. The third element of the skip array is 1, because the A that is the third character of the pattern matches the A that is the first character of the pattern. The fourth element of the skip array is 2, because the AB that is the third and fourth characters of the pattern matches the AB that is the first and second characters of the pattern. The fifth element of the skip array is 3, because the ABA that is the third through fifth characters of the pattern matches the ABA that is the first through third characters of the pattern. And the sixth element of the skip array is 0, because the C that is the sixth character of the pattern doesn’t match any previous characters in the pattern.
Thus, the Knuth-Morris-Pratt algorithm works by pre-computing the skip array, then comparing the pattern to the string, character by character, occasionally skipping characters according to the skip array.
Your task is to write a function that searches for a pattern in a string using the Knuth-Morris-Pratt algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
String Search: Brute Force
August 21, 2009
In this exercise, and several that will follow, we will examine algorithms for searching a string. Our goal is to write a function that takes a pattern and a string and either returns the index of the first location of the pattern or an indication that the pattern is not present in the string; an optional third argument allows the search to start at an arbitrary point in the string. Our pattern is simply a fixed string; unlike regular expressions, there are no meta-characters.
Our first string-search algorithm uses brute force: place the pattern at each possible location in the search string and compare character-by-character to determine if there is a match.
Your task is to write a function that performs brute-force string searching as described above. Since we will be writing several other string searching functions, you should also write a test suite that will find any errors in your function. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Blum Blum Shub
August 18, 2009
A stream cipher takes its input one plaintext character at a time, producing the corresponding ciphertext character before it moves on to the next plaintext character. An easy way to do this is to seed a pseudo-random number generator and xor its output with the stream of plaintext characters; because of the xor, decryption is the exact same operation as encryption. The pseudo-random number generator must be crypographically secure; such common pseudo-random number generators as linear congruential, lagged fibonacci, and mersenne twister all fail due to serial correlation between successive values.
A simple pseudo-random number generator that is cryptographically secure is known as Blum Blum Shub, after its three inventors. Successive values are computed as the least-significant bit (or bits) of the number xk = xk-12 (mod n). The parameter n is the product of two large primes p and q, each congruent to 3 modulo 4. The initial value x0 is calculated as s2 (mod n), where s is the seed, which must be coprime to n.
Your task is to write a function that enciphers and deciphers a stream of characters as described above. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
Pairing Heaps
August 14, 2009
We previously implemented priority queues using leftist heaps. In this exercise we will implement the same library using a different data structure known as a pairing heap. Pairing heaps are an unusual data structure; they are simple to implement, and performance is good, but their time complexity is not proved, only conjectured, and the analysis has defied many good algorithmists. Our presentation is based on Chris Okasaki’s book Purely Functional Data Structures.
Pairing heaps are heap-ordered multi-way trees, with each node of each tree less than its children. A pairing heap is either empty or is a node consisting of an item and a list of pairing heaps, none of which may be empty. The find-first function simply returns the item at the root of the tree of nodes. Merge takes two trees and makes the tree with the larger root the leftmost child of the tree with the smaller root. Insert builds a singleton node, then calls merge.
The fundamental operation on pairing heaps is the delete-first operation, which discards the root of the tree and merges its children in two passes. The first pass runs from left to right, merging child nodes pair-wise, the first child with the second, the third child with the fourth, and so on. The second pass merges the resulting trees, again pair-wise, from right to left.
Your task is to implement priority queues using pairing heaps, using the same interface as the prior exercise. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Update: The Daily WTF
August 14, 2009
Alex Papadimoulis has agreed to change the name of his series of programming exercises, thus ending the dispute between us. I thank him for agreeing to the change, and wish him well.
To all of my regular readers: I apologize again for involving you in this dispute. The next exercise will be published in a few hours, and I hope you are as anxious to get back to programming as I am.
To all those people who found my blog for the first time: Welcome! Please stay. Enjoy the exercises. And please post your solutions, so we may all learn from your work.