Word Hy-phen-a-tion By Com-pu-ter

April 24, 2009

In 1982, Frank Liang published in his doctoral dissertation a description of the hyphenation algorithm that he invented for Donald Knuth’s TeX typesetting program; his algorithm is the basis of most of the hyphenation programs used today. Liang’s algorithm is described in Appendix H of The TeXbook by Knuth. Liang’s algorithm never inserts a hyphen where one does not belong, but does miss some opportunities to insert hyphens where they do belong; Liang claims to find 95% of the hyphens in most texts. Liang’s algorithm is quick, and reasonably stingy with space.

Liang’s method works by pre-computing a list of hyphenating and inhibiting patterns based on a given hyphenation dictionary. First, a list of hyphenating patterns is established; for instance, -tion and c-c are good hyphenating patterns. But all the patterns have exceptions; for instance, the -tion pattern improperly hyphenates the word cat-ion. Thus, a set of inhibiting patterns prevents hypenations; in this case, the inhibition pattern .ca=t (the dot indicates the beginning or end of a word, the equal-sign inhibits a hyphen) overrides the hyphenation pattern and prevents the hyphen at ca-tion. Of course, there are exceptions to the inhibition patterns; Liang’s algorithm goes five levels deep — hyphenation, inhibition, hyphenation, inhibition, hyphenation — to get a good set of patterns, and even then requires an exception listing to fix a few words that would otherwise be hyphenated incorrectly. It is also required that a word must have a least five letters to be hyphenated, no hyphens can be inserted after the first letter or before the last letter of a word, and any word that includes non-alphabetic characters is never hyphenated.

Consider this example of hyphenating the word hyphenation: there are two hyphenating patterns 1na and 1tio, then three inhibiting patterns n2at, 2io, and he2n, then another hyphenating pattern hy3ph, another inhibiting pattern hena4, and finally another hyphenating pattern hen5at:

. h y p h e n a t i o n .
           1n a
               1t i o
            n2a t
                 2i o
        h e2n
  h y3p h
        h e n a4
        h e n5a t
.0h0y3p0h0e2n5a4t2i0o0n0.
  h y-p h e n-a t i o n

After all the patterns present in the word are identified, the highest number at any inter-letter position wins (we’ll call that number a rule), and hyphens are inserted at all the odd-numbered rules. In this case, the 1na pattern originally placed a hyphen at hyphe-nation, but the he2n pattern inhibited the hyphen, because rule 2 trumps rule 1; likewise, hena4 inhibited 1tio. Note that three patterns start at the fourth letter of hyphenation: he2n, hena4, and hen5at.

Liang’s hyphenation algorithm iterates through the letters of the input word, identifying all the patterns that are present in the word, and takes the maximum rule at each position. The task of identifying the patterns calls for some sort of search algorithm; Liang used a trie that he cleverly packed into an array.

Finally, Liang’s dissertation described a program patgen for pre-computing the hyphenating and inhibiting rules, for English or any other language, given a hyphenating dictionary. See his dissertation for details. The original lists of patterns and exceptions used by Liang for TeX 82 are shown on the next page.

Your task is to write a function that takes an input word and returns a list of suitable hyphenation points. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.

About these ads

Pages: 1 2 3

3 Responses to “Word Hy-phen-a-tion By Com-pu-ter”

  1. [...] Praxis – Word hypenation By Remco Niemeijer Today’s Programming Praxis problem is about word hyphenation. Let’s see what we can come up [...]

  2. Remco Niemeijer said

    My Haskell solution (see http://bonsaicode.wordpress.com/2009/04/24/programming-praxis-word-hypenation/ for the commented version):

    import Data.Char
    import Data.List
    import Data.List.HT
    
    exceptions :: [(String, String)]
    exceptions = zip (map (filter isLetter) ws) ws
        where ws = words "as-so-ciate as-so-ciates dec-li-na-tion \
                         \oblig-a-tory phil-an-thropic present presents \
                         \project projects reci-procity re-cog-ni-zance \
                         \ref-or-ma-tion ret-ri-bu-tion ta-ble"
    
    main :: IO ()
    main = do patterns <- fmap words $ readFile "patterns.txt"
              print $ hyphenate patterns "hyphenation"
              print $ hyphenate patterns "associate"
    
    hyphenate :: [String] -> String -> String
    hyphenate ps s = maybe (hyphenate' s ps) id $ lookup s exceptions
    
    hyphenate' :: String -> [String] -> String
    hyphenate' s = concat . intersperse "-" . map (filter isLetter) .
                   chop (\c -> isDigit c && odd (digitToInt c)) .
                   foldl (flip (tryPattern . format)) ("." ++ format s ++ ".")
    
    format :: String -> String
    format (x:y:xs) | all isLetter [x, y] = x : '0' : format (y:xs)
    format (x:xs)   = x : format xs
    format []       = []
    
    tryPattern :: String -> String -> String
    tryPattern _ [] = []
    tryPattern p s  = x : tryPattern p xs
                      where (x:xs) = if match p s then overlay p s else s
    
    match :: String -> String -> Bool
    match (x:xs) (y:ys) = (all isDigit [x, y] || x == y) && match xs ys
    match xs     _      = null xs
    
    overlay :: String -> String -> String
    overlay p = zipWith max (p ++ repeat '0')
    
  3. programmingpraxis said

    The standard solution doesn’t follow the requirement that any word containing a non-alphabetic character is never hyphenated. Here is a version that does:

    (define (new-hyphenate word)
      (let ((ws (string->list word)))
        (cond ((< (length ws) 5) (list word))
              ((any? (lambda (c) (not (char-alphabetic? c))) ws) (list word))
              (else (let loop ((ws (append (list #\.) ws (list #\.)))
                               (front (make-list (+ (length ws) 3) 0)) (back '()))
                      (if (null? ws)
                          (let loop ((cs (string->list word)) (hs (fixup back)) (p '()) (ps '()))
                            (cond ((null? (cdr hs)) (reverse (cons (list->string (reverse p)) ps)))
                                  ((odd? (car hs))
                                    (loop (cdr cs) (cdr hs) (list (car cs))
                                          (cons (list->string (reverse p)) ps)))
                                  (else (loop (cdr cs) (cdr hs) (cons (car cs) p) ps))))
                          (let ((new-front (fold-left max-rule front (t-looks ws t-hyph))))
                            (loop (cdr ws) (cdr new-front) (cons (car new-front) back))))))))))

    This function requires any?, which will soon be added to the Standard Prelude:

    (define (any? pred? xs)
      (cond ((null? xs) #f)
            ((pred? (car xs)) #t)
            (else (any? pred? (cdr xs)))))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 632 other followers

%d bloggers like this: