April 28, 2009
Yesterday was the 218th anniversary of the birth of Samuel F. B. Morse, who invented the telegraph. We present this simple exercise in his honor.
Morse Code is an encoding of alphanumeric characters using short and long pulses of sound. Originally developed in the 1840s for the use of Morse’s telegraph, it is still in use today, primarily in the fields of amateur radio and aviation. Letters and digits consist of dots and dashes as shown in the chart at right; a single space is inserted between characters, and an extra space is inserted between words. For instance, “Programming Praxis” is rendered in Morse code as the string “
• — — • • — • — — — — — • • — • • — — — — — • • — • — — • • — — • • — • • — — • • — • • • • •“.
|A • —||N — •||1 • — — — —|
|B — • • •||O — — —||2 • • — — —|
|C — • — •||P • — — •||3 • • • — —|
|D — • •||Q — — • —||4 • • • • —|
|E •||R • — •||5 • • • • •|
|F • • — •||S • • •||6 — • • • •|
|G — — •||T —||7 — — • • •|
|H • • • •||U • • —||8 — — — • •|
|I • •||V • • • —||9 — — — — •|
|J • — — —||W • — —||0 — — — — —|
|K — • —||X — • • —|
|L • — • •||Y — • — —|
|M — —||Z — — • •|
Your task is to write functions that convert back and forth between character strings and Morse code. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
April 24, 2009
In 1982, Frank Liang published in his doctoral dissertation a description of the hyphenation algorithm that he invented for Donald Knuth’s TeX typesetting program; his algorithm is the basis of most of the hyphenation programs used today. Liang’s algorithm is described in Appendix H of The TeXbook by Knuth. Liang’s algorithm never inserts a hyphen where one does not belong, but does miss some opportunities to insert hyphens where they do belong; Liang claims to find 95% of the hyphens in most texts. Liang’s algorithm is quick, and reasonably stingy with space.
Liang’s method works by pre-computing a list of hyphenating and inhibiting patterns based on a given hyphenation dictionary. First, a list of hyphenating patterns is established; for instance,
c-c are good hyphenating patterns. But all the patterns have exceptions; for instance, the
-tion pattern improperly hyphenates the word
cat-ion. Thus, a set of inhibiting patterns prevents hypenations; in this case, the inhibition pattern
.ca=t (the dot indicates the beginning or end of a word, the equal-sign inhibits a hyphen) overrides the hyphenation pattern and prevents the hyphen at
ca-tion. Of course, there are exceptions to the inhibition patterns; Liang’s algorithm goes five levels deep — hyphenation, inhibition, hyphenation, inhibition, hyphenation — to get a good set of patterns, and even then requires an exception listing to fix a few words that would otherwise be hyphenated incorrectly. It is also required that a word must have a least five letters to be hyphenated, no hyphens can be inserted after the first letter or before the last letter of a word, and any word that includes non-alphabetic characters is never hyphenated.
Consider this example of hyphenating the word hyphenation: there are two hyphenating patterns
1tio, then three inhibiting patterns
he2n, then another hyphenating pattern
hy3ph, another inhibiting pattern
hena4, and finally another hyphenating pattern
. h y p h e n a t i o n .
1t i o
h y3p h
h e n a4
h e n5a t
h y-p h e n-a t i o n
After all the patterns present in the word are identified, the highest number at any inter-letter position wins (we’ll call that number a rule), and hyphens are inserted at all the odd-numbered rules. In this case, the
1na pattern originally placed a hyphen at
hyphe-nation, but the
he2n pattern inhibited the hyphen, because rule 2 trumps rule 1; likewise,
1tio. Note that three patterns start at the fourth letter of hyphenation:
Liang’s hyphenation algorithm iterates through the letters of the input word, identifying all the patterns that are present in the word, and takes the maximum rule at each position. The task of identifying the patterns calls for some sort of search algorithm; Liang used a trie that he cleverly packed into an array.
Finally, Liang’s dissertation described a program patgen for pre-computing the hyphenating and inhibiting rules, for English or any other language, given a hyphenating dictionary. See his dissertation for details. The original lists of patterns and exceptions used by Liang for TeX 82 are shown on the next page.
Your task is to write a function that takes an input word and returns a list of suitable hyphenation points. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
April 21, 2009
In a previous exercise we built a spell checker based on storing words in a trie. That spell checker was exact: the spell checker reported success if and only if the checked word was in the dictionary. Today we will build a spell checker that is probabilistic: it correctly reports failure if the checked word is not in the dictionary, and correctly reports success if the checked word is in the dictionary, but may also incorrectly report success even if the checked word is not in the dictionary. The probability of error can be made arbitrarily small, as determined by the programmer.
We will use a bloom filter, a data structure invented by Burton Bloom in 1970 to test membership is a set. A bloom filter consists of an array of m bits, plus k different hash functions that map set elements to the range 0 to m-1. All the bits are initially 0. To add an element, calculate the k hash values of the element, and set each kth bit to 1. To test if an element is in the set, calculate the k hash values of the element, return true if all of the kth bits are 1, and false if any is 0. In this way, it is certain that the element is not in the set if any hash returns 0, but it is possible that an element not in the set may be incorrectly reported as being in the set if all of the hashes return 1, but one of the hashes was set by some other element.
The easiest way to build a large number of hash functions is to use a single hash function and “salt” the dictionary words with an additional letter. For instance, to hash the word “hello” three times, use “ahelloa”, “bhellob”, and “chelloc” and hash with a standard string-hashing function.
There is some considerable math involved in determining the appropriate values of m and k. For a set of n elements, the probability p of a false positive is given by the formula:
To give this a sense of scale, storing a fifty thousand word dictionary in a bloom filter of a million bits using seven hash functions will result in a false positive every 5102 words, on average.
Your task is to build a probabilistic spell checker as described above. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
April 17, 2009
Spell checkers are ubiquitous. Word processors have spell checkers, as do browser-based e-mail clients. They all work the same way: a dictionary is stored in some data structure, then each word of input is submitted to a search in the data structure, and those that fail are flagged as spelling errors. There is a certain art to building the word list on which a spell checker is based; Exxon isn’t a word in anybody’s dictionary, but it is likely included in the word list, but cwm (a geological structure), which is certainly a word, is most likely omitted from the word list, on the grounds it is more likely an error than a legitimate spelling (unless you are a geologist).
There are many appropriate data structures to store the word list, including a sorted array accessed via binary search, a hash table, or a bloom filter. In this exercise you are challenged to store the word list character-by-character in a trie. Consider this sample trie that stores the words CAR, CART, CAT and DOG:
To see that CAT is a valid spelling, follow the C branch, then the A branch, then the T branch, where you find a filled circle, indicating a word. To see that CAB is not a valid spelling, follow the C branch, then the A branch, and fail when there is no B branch.
Tries can be very fast; for instance, to see that QARTER (a misspelling of QUARTER) is not a word, follow the Q branch, then fail when there is no A branch. This is even faster than hashing, which must read all six letters of QARTER just to compute the hash value. And tries can also be space-efficient, since space is shared between common prefixes.
April 14, 2009
This puzzle comes from the 2008 Google Treasure Hunt:
Find the smallest number that can be expressed as the sum of 7 consecutive prime numbers, the sum of 17 consecutive prime numbers, the sum of 41 consecutive prime numbers, the sum of 541 consecutive prime numbers, and is itself a prime number.
For example, 41 is the smallest prime number that can be expressed as the sum of 3 consecutive primes (11 + 13 + 17 = 41) and the sum of 6 consecutive primes (2 + 3 + 5 + 7 + 11 + 13 = 41).
April 10, 2009
Words that are formed from the same set of letters are anagrams of each other. For instance, pots, post, stop, spot, opts, and tops are anagrams.
Your task is to write a program that, given a dictionary and an input word, prints all the anagrams of the input word. You are also to determine the largest anagram class in your dictionary. When you are finished, you can read or run a suggested solution, or post your own solution or discuss the exercise in the comments below.
April 7, 2009
Pity the poor waiter:
The chef in our place is sloppy, and when he prepares a stack of pancakes they come out all different sizes. Therefore, when I deliver them to a customer, on the way to the table I rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) by grabbing several from the top and flipping them over, repeating this (varying the number I flip) as many times as necessary.
Your task is to write a function that sorts a list of unique positive integers into ascending order using the pancake-flipping algorithm. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
April 3, 2009
In a previous exercise you wrote a turing machine simulator. In this exercise, you are challenged to write a program for the turing machine that performs multiplication. Write a program that takes an input like
_ _ _  1 1 * 1 1 1 1 _ _ _, which indicates the multiplication 3 × 4, and writes the output
_ _ _ 1 1 1 1 1 1 1 1 1 1 1  _ _ _.