Green Eyes

September 29, 2009

We haven’t done a math problem in a while. This one comes from my daughter’s high-school math class. She is never quite sure whether or not to ask me for help; sometimes she gets much more help than she really wants.

In a group of twenty-seven people, eleven have blue eyes, thirteen have brown eyes, and three have green eyes. If three people are randomly selected from the group, what is the probability that exactly one of them will have green eyes?

Your task is to find the probability. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

Grep

September 25, 2009

Grep is the quintessential unix utility: it has a funny name, has more command-line switches than a Swiss Army knife has blades, and dates to prehistoric times. In its simplest incarnation, grep takes a regular expression and a file and prints those lines of the file that match the regular expression.

Since we wrote a regular-expression matcher in the three previous exercises, it is easy to write our own version of grep. We’ll keep it simple:

grep [-v] regex [file ...]

The -v flag inverts the sense of the match: print only those lines that do not match the regular expression. One or more files may be named on the command line; if so, the filename is printed before each matching (or non-matching, for -v) line. If no filename is given, grep reads from standard input. All output goes to standard output.

Your task is to write grep. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

Regular Expressions, Part 3

September 22, 2009

We are generally rather cavalier about testing the code we write on Programming Praxis, but the regular expression parser and matcher we wrote in the two previous exercises are complicated, with many opportunities for error, and deserve proper testing.

Your task is to write a comprehensive test suite for the regular expression parser and matcher of the two previous exercises. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

Regular Expressions, Part 2

September 18, 2009

In the previous exercise, we decided on an intermediate representation for regular expressions and wrote a parser that produces the intermediate representation corresponding to a regular expression given as input.

Your task today is to write the corresponding matcher that takes an intermediate representation and a target string and returns a boolean indicating whether or not the regular expression matches the target string, using the same conventions as the prior exercise. Your matcher will be similar to Rob Pike’s matcher that we studied previously. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Regular Expressions, Part 1

September 15, 2009

In the previous exercise we studied a simple regular-expression matcher. In this exercise, and the one that follows, we will write our own matcher for a somewhat larger set of regular expressions, based on those provided in the books Software Tools and Software Tools in Pascal by Brian W. Kernighan and P. J. Plauger:

  • c — the literal character c
  • . — any character
  • ^ and $ — the beginning or end of the search text
  • […] and [^…] — the set (or the negation of the set) of characters between brackets
  • x* — zero or more repetitions of the regular expression element x, which may be a literal character c, the unspecified character dot, or a character class

Anywhere a single character may appear, an escaped character may appear. An escape sequence consists of a backslash followed by a single character. If the character is n, the escape sequence represents a newline. If the character is t, the escape sequence represents a horizontal tab. If the character is anything else, it represents itself literally.

A character class consists of characters, escaped characters, or character ranges. A character range consists of a lower-case letter followed by a hyphen followed by a larger lower-case letter, or the same sequence using upper-case letters, or a digit followed by a hyphen followed by a larger digit.

Here are some sample regular expressions:

  • [0-9][0-9]* — matches an integer
  • ^..*$ — matches a non-empty line of text
  • hello — matches the string hello anywhere in the search text
  • ^ *hello *$ — matches a search text containing the string hello, possibly surrounded by space characters, but nothing else
  • ^[^x].*[0-9] *x$ — matches a string that starts with any character except a literal x, followed by any number (including zero) of characters, followed by a single digit, followed by any number (including zero) of space characters, and ends with a literal x

Pike’s matcher works by interpreting the regular expression directly from its input. That simple method won’t work for the more complicated regular expressions we are considering here, because it is too time-consuming to re-interpret each regular expression element each time it appears. Instead, we will pre-compile the regular expression into an intermediate form, using a parser, then match the input text using the pre-compiled intermediate form. The intermediate form consists of the following elements, arranged in an array or linked list: the beginning-of-line and end-of-line markers, the any character and literal characters, character classes and negated character classes, and closures of the any character, a literal character, or either of the two types of character classes.

Your task today is to determine a suitable intermediate form, then write a parser that takes a regular expression and outputs the corresponding intermediate form; in the next exercise, you will write the corresponding matcher. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

Beautiful Code

September 11, 2009

In their book The Practice of Programming, Brian Kernighan and Rob Pike present this code for matching simple regular expressions consisting of literal characters, a dot matching any character, a star consisting of zero or more repetitions of the preceding character, and a caret and a dollar sign representing the beginning or end of the search string:

/* match: search for re anywhere in text */
int match(char *re, char *text)
{
   if (re[0] == '^')
      return matchhere(re+1, text);
   do { /* must look at empty string */
      if (matchhere(re, text))
         return 1;
   } while (*text++ != '\0');
   return 0;
}

/* matchhere: search for re at beginning of text */
int matchhere(char *re, char *text)
{
   if (re[0] == '\0')
      return 1;
   if (re[1] == '*')
      return matchstar(re[0], re+2, text);
   if (re[0] == '$' && re[1] == '\0')
      return *text == '\0';
   if (*text!='\0' && (re[0]=='.' || re[0]==*text))
      return matchhere(re+1, text+1);
   return 0;
}

/* matchstar: search for c*re at beginning of text */
int matchstar(int c, char *re, char *text)
{
   do { /* a * matches zero or more instances */
      if (matchhere(re, text))
         return 1;
   } while (*text!='\0' && (*text++==c || c=='.'));
   return 0;
}

Many readers commented on the beauty of the code, and in a later book, Beautiful Code by Andy Oram and Greg Wilson, Kernighan explained the history of the code.

Your task is to port the code to your favorite language. You should use the features and idioms of your language, while simultaneously preserving the beauty of Rob Pike’s regular expression matcher. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

Porter Stemming

September 8, 2009

Stemming, in the parlance of searching and information retrieval, is the operation of stripping the suffices from a word, leaving its stem. Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when you ask for a web page that contains the word connect.

There are basically two ways to implement stemming. The first approach is to create a big dictionary that maps words to their stems. The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. The second approach is to use a set of rules that extract stems from words. The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes. But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen.

In 1979, Martin Porter developed a stemming algorithm that, with minor modifications, is still in use today; it uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right. Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html; the description of the algorithm is repeated on the next page.

Your task is to write a function that stems words according to Porter’s algorithm; you should be aware that this exercise requires rather more code than we usually write, though it’s no harder than usual. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2 3

Ron’s Cipher #4

September 4, 2009

RC4 is a stream cipher, similar to the Blum Blum Shub cipher of an earlier exercise, invented by Ron Rivest in 1987. RC4 provides encryption for the SSL protocol that permits secure internet connections and the WEP protocol that secures wireless networks.

RC4 works by using a key to initialize a 256-byte array, which then forms the basis of a pseudo-random byte generator. The stream of random bytes is xor’ed with plaintext to produce ciphertext; decryption is performed by xor’ing the same stream of random bytes with the ciphertext to produce plaintext.

The key array K is initialized like this:

for i from 0 to 255
    K[i] := i

j := 0

for i from 0 to 255
    j := (j + K[i] + key[i mod klen]) mod 256
    swap K[i], K[j]

Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:

i := 0
j := 0

start:
    i := (i + 1) mod 256
    j := (j + K[i]) mod 256
    swap K[i], K[j]
    output K[ (K[i] + K[j]) mod 256 ]
    goto start

Your task is to write a function that takes a key and a text and produces the corresponding output text; the same function can be used for both encryption and decryption. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

String Search: Rabin-Karp

September 1, 2009

We examined the brute-force, Knuth-Morris-Pratt and Boyer-Moore methods of searching for a pattern in a string in three previous exercises. Today, we conclude this series of string-searching exercises with a look at a method devised by Michael Rabin (the same Michael Rabin we saw earlier in the exercise on primality checking) and Richard Karp in 1987 that uses hashing to find a substring in a text.

The basic idea of the Rabin-Karp algorithm is to take every pattern-length substring of the string being searched, hash the characters to a number, and compare that number to the hash number of the pattern. When a substring hash is equal to the pattern hash, a final check must be made to confirm the match.

It is useful to use a hash function that considers the characters in order; that way, considering each substring in sequence makes it easy to compute the various hashes. One approach is to add the integer values corresponding to each character; to compute the next hash, drop off the integer value of the first character and add the integer value of the current character to the running hash value.

The hash function that we will adopt treats each substring as a base-256 number. Thus, cat, with ascii values of 99, 97 and 116 for its three letters, hashes to 99 · 2562 + 97 · 2561 + 116 · 2560 = 6513012. Since hash values can grow large quickly, it is customary to take them modulo some convenient prime number.

Given this hash function, it is easy to roll the hash sequentially through the search string. If the search string is catch, the next hash after cat will be 6386787, which can be computed as 6513012 – 97 · 2562 = 24948, multipled by 256, thus 24948 × 256 = 6386688, then add 99, thus 6386688 + 99 = 6386787.

Your task is to write a function that searches for a pattern in a string using the Rabin-Karp algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2