Word Hy-phen-a-tion By Com-pu-ter
April 24, 2009
In 1982, Frank Liang published in his doctoral dissertation a description of the hyphenation algorithm that he invented for Donald Knuth’s TeX typesetting program; his algorithm is the basis of most of the hyphenation programs used today. Liang’s algorithm is described in Appendix H of The TeXbook by Knuth. Liang’s algorithm never inserts a hyphen where one does not belong, but does miss some opportunities to insert hyphens where they do belong; Liang claims to find 95% of the hyphens in most texts. Liang’s algorithm is quick, and reasonably stingy with space.
Liang’s method works by pre-computing a list of hyphenating and inhibiting patterns based on a given hyphenation dictionary. First, a list of hyphenating patterns is established; for instance, -tion
and c-c
are good hyphenating patterns. But all the patterns have exceptions; for instance, the -tion
pattern improperly hyphenates the word cat-ion
. Thus, a set of inhibiting patterns prevents hypenations; in this case, the inhibition pattern .ca=t
(the dot indicates the beginning or end of a word, the equal-sign inhibits a hyphen) overrides the hyphenation pattern and prevents the hyphen at ca-tion
. Of course, there are exceptions to the inhibition patterns; Liang’s algorithm goes five levels deep — hyphenation, inhibition, hyphenation, inhibition, hyphenation — to get a good set of patterns, and even then requires an exception listing to fix a few words that would otherwise be hyphenated incorrectly. It is also required that a word must have a least five letters to be hyphenated, no hyphens can be inserted after the first letter or before the last letter of a word, and any word that includes non-alphabetic characters is never hyphenated.
Consider this example of hyphenating the word hyphenation: there are two hyphenating patterns 1na
and 1tio
, then three inhibiting patterns n2at
, 2io
, and he2n
, then another hyphenating pattern hy3ph
, another inhibiting pattern hena4
, and finally another hyphenating pattern hen5at
:
. h y p h e n a t i o n .
1n a
1t i o
n2a t
2i o
h e2n
h y3p h
h e n a4
h e n5a t
.0h0y3p0h0e2n5a4t2i0o0n0.
h y-p h e n-a t i o n
After all the patterns present in the word are identified, the highest number at any inter-letter position wins (we’ll call that number a rule), and hyphens are inserted at all the odd-numbered rules. In this case, the 1na
pattern originally placed a hyphen at hyphe-nation
, but the he2n
pattern inhibited the hyphen, because rule 2 trumps rule 1; likewise, hena4
inhibited 1tio
. Note that three patterns start at the fourth letter of hyphenation: he2n
, hena4
, and hen5at
.
Liang’s hyphenation algorithm iterates through the letters of the input word, identifying all the patterns that are present in the word, and takes the maximum rule at each position. The task of identifying the patterns calls for some sort of search algorithm; Liang used a trie that he cleverly packed into an array.
Finally, Liang’s dissertation described a program patgen for pre-computing the hyphenating and inhibiting rules, for English or any other language, given a hyphenating dictionary. See his dissertation for details. The original lists of patterns and exceptions used by Liang for TeX 82 are shown on the next page.
Your task is to write a function that takes an input word and returns a list of suitable hyphenation points. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.