Word Hy-phen-a-tion By Com-pu-ter
April 24, 2009
In 1982, Frank Liang published in his doctoral dissertation a description of the hyphenation algorithm that he invented for Donald Knuth’s TeX typesetting program; his algorithm is the basis of most of the hyphenation programs used today. Liang’s algorithm is described in Appendix H of The TeXbook by Knuth. Liang’s algorithm never inserts a hyphen where one does not belong, but does miss some opportunities to insert hyphens where they do belong; Liang claims to find 95% of the hyphens in most texts. Liang’s algorithm is quick, and reasonably stingy with space.
Liang’s method works by pre-computing a list of hyphenating and inhibiting patterns based on a given hyphenation dictionary. First, a list of hyphenating patterns is established; for instance, -tion and c-c are good hyphenating patterns. But all the patterns have exceptions; for instance, the -tion pattern improperly hyphenates the word cat-ion. Thus, a set of inhibiting patterns prevents hypenations; in this case, the inhibition pattern .ca=t (the dot indicates the beginning or end of a word, the equal-sign inhibits a hyphen) overrides the hyphenation pattern and prevents the hyphen at ca-tion. Of course, there are exceptions to the inhibition patterns; Liang’s algorithm goes five levels deep — hyphenation, inhibition, hyphenation, inhibition, hyphenation — to get a good set of patterns, and even then requires an exception listing to fix a few words that would otherwise be hyphenated incorrectly. It is also required that a word must have a least five letters to be hyphenated, no hyphens can be inserted after the first letter or before the last letter of a word, and any word that includes non-alphabetic characters is never hyphenated.
Consider this example of hyphenating the word hyphenation: there are two hyphenating patterns 1na and 1tio, then three inhibiting patterns n2at, 2io, and he2n, then another hyphenating pattern hy3ph, another inhibiting pattern hena4, and finally another hyphenating pattern hen5at:
. h y p h e n a t i o n .
1n a
1t i o
n2a t
2i o
h e2n
h y3p h
h e n a4
h e n5a t
.0h0y3p0h0e2n5a4t2i0o0n0.
h y-p h e n-a t i o n
After all the patterns present in the word are identified, the highest number at any inter-letter position wins (we’ll call that number a rule), and hyphens are inserted at all the odd-numbered rules. In this case, the 1na pattern originally placed a hyphen at hyphe-nation, but the he2n pattern inhibited the hyphen, because rule 2 trumps rule 1; likewise, hena4 inhibited 1tio. Note that three patterns start at the fourth letter of hyphenation: he2n, hena4, and hen5at.
Liang’s hyphenation algorithm iterates through the letters of the input word, identifying all the patterns that are present in the word, and takes the maximum rule at each position. The task of identifying the patterns calls for some sort of search algorithm; Liang used a trie that he cleverly packed into an array.
Finally, Liang’s dissertation described a program patgen for pre-computing the hyphenating and inhibiting rules, for English or any other language, given a hyphenating dictionary. See his dissertation for details. The original lists of patterns and exceptions used by Liang for TeX 82 are shown on the next page.
Your task is to write a function that takes an input word and returns a list of suitable hyphenation points. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
Probabilistic Spell Checking
April 21, 2009
In a previous exercise we built a spell checker based on storing words in a trie. That spell checker was exact: the spell checker reported success if and only if the checked word was in the dictionary. Today we will build a spell checker that is probabilistic: it correctly reports failure if the checked word is not in the dictionary, and correctly reports success if the checked word is in the dictionary, but may also incorrectly report success even if the checked word is not in the dictionary. The probability of error can be made arbitrarily small, as determined by the programmer.
We will use a bloom filter, a data structure invented by Burton Bloom in 1970 to test membership is a set. A bloom filter consists of an array of m bits, plus k different hash functions that map set elements to the range 0 to m-1. All the bits are initially 0. To add an element, calculate the k hash values of the element, and set each kth bit to 1. To test if an element is in the set, calculate the k hash values of the element, return true if all of the kth bits are 1, and false if any is 0. In this way, it is certain that the element is not in the set if any hash returns 0, but it is possible that an element not in the set may be incorrectly reported as being in the set if all of the hashes return 1, but one of the hashes was set by some other element.
The easiest way to build a large number of hash functions is to use a single hash function and “salt” the dictionary words with an additional letter. For instance, to hash the word “hello” three times, use “ahelloa”, “bhellob”, and “chelloc” and hash with a standard string-hashing function.
There is some considerable math involved in determining the appropriate values of m and k. For a set of n elements, the probability p of a false positive is given by the formula:
To give this a sense of scale, storing a fifty thousand word dictionary in a bloom filter of a million bits using seven hash functions will result in a false positive every 5102 words, on average.
Your task is to build a probabilistic spell checker as described above. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
Spell Checking
April 17, 2009
Spell checkers are ubiquitous. Word processors have spell checkers, as do browser-based e-mail clients. They all work the same way: a dictionary is stored in some data structure, then each word of input is submitted to a search in the data structure, and those that fail are flagged as spelling errors. There is a certain art to building the word list on which a spell checker is based; Exxon isn’t a word in anybody’s dictionary, but it is likely included in the word list, but cwm (a geological structure), which is certainly a word, is most likely omitted from the word list, on the grounds it is more likely an error than a legitimate spelling (unless you are a geologist).
There are many appropriate data structures to store the word list, including a sorted array accessed via binary search, a hash table, or a bloom filter. In this exercise you are challenged to store the word list character-by-character in a trie. Consider this sample trie that stores the words CAR, CART, CAT and DOG:

To see that CAT is a valid spelling, follow the C branch, then the A branch, then the T branch, where you find a filled circle, indicating a word. To see that CAB is not a valid spelling, follow the C branch, then the A branch, and fail when there is no B branch.
Tries can be very fast; for instance, to see that QARTER (a misspelling of QUARTER) is not a word, follow the Q branch, then fail when there is no A branch. This is even faster than hashing, which must read all six letters of QARTER just to compute the hash value. And tries can also be space-efficient, since space is shared between common prefixes.
Your task is to build a spell checker based on tries. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
Google Treasure Hunt 2008 Puzzle 4
April 14, 2009
This puzzle comes from the 2008 Google Treasure Hunt:
Find the smallest number that can be expressed as the sum of 7 consecutive prime numbers, the sum of 17 consecutive prime numbers, the sum of 41 consecutive prime numbers, the sum of 541 consecutive prime numbers, and is itself a prime number.
For example, 41 is the smallest prime number that can be expressed as the sum of 3 consecutive primes (11 + 13 + 17 = 41) and the sum of 6 consecutive primes (2 + 3 + 5 + 7 + 11 + 13 = 41).
Your task is to find the number. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
Anagrams
April 10, 2009
Words that are formed from the same set of letters are anagrams of each other. For instance, pots, post, stop, spot, opts, and tops are anagrams.
Your task is to write a program that, given a dictionary and an input word, prints all the anagrams of the input word. You are also to determine the largest anagram class in your dictionary. When you are finished, you can read or run a suggested solution, or post your own solution or discuss the exercise in the comments below.
Flipping Pancakes
April 7, 2009
Pity the poor waiter:
The chef in our place is sloppy, and when he prepares a stack of pancakes they come out all different sizes. Therefore, when I deliver them to a customer, on the way to the table I rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) by grabbing several from the top and flipping them over, repeating this (varying the number I flip) as many times as necessary.
Your task is to write a function that sorts a list of unique positive integers into ascending order using the pancake-flipping algorithm. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
Programming the Turing Machine
April 3, 2009
In a previous exercise you wrote a turing machine simulator. In this exercise, you are challenged to write a program for the turing machine that performs multiplication. Write a program that takes an input like _ _ _ [1] 1 1 * 1 1 1 1 _ _ _, which indicates the multiplication 3 × 4, and writes the output _ _ _ 1 1 1 1 1 1 1 1 1 1 1 [1] _ _ _.
Rail-Fence Cipher
March 31, 2009
The rail-fence cipher is a transposition cipher that rearranges the characters of a clear-text to form the cipher-text. The clear-text is arranged in up-and-down waves like the tops of the pickets on a rail fence; the cipher key is the height of the fence. For instance, the encipherment of “PROGRAMMING PRAXIS” with a key of 4 is shown below, using an underscore to make the space character visible:
P R O G R A M M I N G _ P R A X I S
P M P = P M P
R A M _ R S = R A M _ R S
O R I G A I = O R I G A I
G N X = G N X
The cipher-text is read at the right of the pickets: “PMPRAM RSORIGAIGNX”.
Your task is to write functions that encipher and decipher texts using the rail-fence method. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
A Turing Machine Simulator
March 27, 2009
A turing machine is a hypothetical computer, invented in 1936 by the British mathematician Alan Turing, that can evaluate any computable function. Though hypothetical, turing machines play a large role in computer science. Many variants of turing machines have been devised; the purpose of this exercise is to build a simulator for a particular type of turing machine.
The hardware of our turing machine consists of a read/write head and an infinite tape divided into cells that passes across the read/write head. Each cell on the tape contains a symbol, which we take as an ascii character; there is a special blank symbol that is present in all cells which are not yet written. The read/write head can see one cell at a time, read its contents, and write a new symbol (possibly the blank symbol) to the cell.
The software of our turing machine consists of a table encoding a transition function and a state register that maintains the current state. In our machine, states are numbered with non-negative integers; the machine always starts in state zero, and halts if it enters a negative-numbered state. The program being executed by the turing machine is encoded in the transition function, which takes two inputs, the current state and the symbol in the current cell, and then writes a new symbol (or the same symbol, or the special symbol blank) in the current cell, then either moves the read/write head one cell to the left or one cell to the right or remains in the same place, then resets the state to a new state. The transition function is thus encoded in a table of 5-tuples q1 s1 s2 d q2 where q1 is the current state, s1 is the symbol currently under the read/write head, s2 is the symbol to be written, d is the direction to move the tape, and q2 is the next state to enter.
As an example, consider the tape _ _ _ [1] 1 1 + 1 1 1 1 1 _ _ _. This is our notation for a tape that contains, in the middle of an infinite number of blanks, the nine symbols consisting of three one’s, a plus sign, and five one’s, with the read/write head positioned over the first one; the tape can be considered to model the problem of adding the two numbers three and five. The program
0 1 1 R 0
0 + 1 R 1
1 1 1 R 1
1 _ _ L 2
2 1 _ L -1
adds the two numbers and writes _ _ _ 1 1 1 1 1 1 1 [1] _ _ _ as its output.
Write a function that takes a turing-machine program and a tape and simulates the action of a turing machine, writing the final tape as output.
Binary Search
March 23, 2009
Children who are learning arithmetic sometimes play a number-guessing game: “I’m thinking of a number between 1 and 100. Can you guess it?” “Is the number less than 50?” “Yes.” “Is the number less than 25?” “No.” And so on, halving the interval at each step until only one number is left. This technique is known colloquially as the binary chop.
Your first task is to write a function that takes a target number and an array of numbers in non-decreasing order and returns either the position of the number in the array, or -1 to indicate the target number is not in the array. For instance, bsearch(32, [13 19 24 29 32 37 43]) should return 4, since 32 is the fourth element of the array (counting from zero).
Beware that this exercise is harder than it looks. Jon Bentley, in his book Programming Pearls, reports that 90% of professional programmers cannot write a proper implementation of binary search in two hours, and Donald Knuth, in the second volume of his book The Art of Computer Programming, reports that though the first binary search was published in 1946, the first bug-free binary search wasn’t published until 1962.
Thus, your second task is to write a suitable test program that shows the accuracy of your binary search function.