Word Hy-phen-a-tion By Com-pu-ter

April 24, 2009

We will follow Liang’s algorithm closely, but use the tries of a previous exercise instead of Liang’s packed tries. We store the patterns and exceptions in lists (they are symbols, not strings, because that’s how Liang’s listing is given):

(define patterns '(.ach4 .ad4der .af1t ... zte4 4z1z2 z4zy))

(define exceptions '(as-so-ciate ... ta-ble))

Our first task is to split a pattern like hen5at into the key henat and rules 000500; note the use of symbol->string:

(define (split-pat pat)
  (let loop ((ps (string->list (symbol->string pat))) (cs '()) (ns '()))
    (cond ((null? ps)
            (if (= (length cs) (length ns))
                (values (reverse cs) (reverse (cons 0 ns)))
                (values (reverse cs) (reverse ns))))
          ((and (null? (cdr ps)) (char-numeric? (car ps)))
            (values (reverse cs)
                    (reverse (cons (- char->integer (car ps)) 48) ns))))
          ((not (char-numeric? (car ps)))
            (loop (cdr ps) (cons (car ps) cs) (cons 0 ns)))
          (else (loop (cddr ps)
                      (cons (cadr s) cs)
                      (cons (- (char->integer (car ps)) 48) ns))))))

Exceptions are handled by making them un-exceptional: an exception like ta-ble is turned into a pattern like 6.6t6a7b6l6e6.6 and entered into the trie along with the regular patterns. Because 7 and 6 are greater than the other hyphenation and inhibition rules, and because the pattern is anchored with a dot at beginning and end, the exception rules will always trump any other rules that may exist in the word. Turning the exceptions into patterns means they don’t have to be handled specially, so the program can be simpler. Here is the code to split an exception into pattern and key:

(define (split-exn exn)
  (let loop ((ps (append (list #\.) (string->list (symbol->string exn)) (list #\.)))
             (cs '()) (ns '()))
    (cond ((null? ps) (values (reverse cs) (reverse (cons 6 ns))))
          ((char=? (car ps) #\-)
            (loop (cddr ps) (cons (cadr ps) cs) (cons 7 ns)))
          (else (loop (cdr ps) (cons (car ps) cs) (cons 6 ns))))))

T-hyph is the trie that stores all the key/rule pairs, built by looping through the lists of patterns and exceptions:

(define t-hyph
  (let loop ((pats patterns) (exns exceptions) (t t-null))
    (cond ((null? exns) t)
          ((null? pats)
            (call-with-values (lambda () (split-exn (car exns)))
              (lambda (cs ns) (loop pats (cdr exns) (t-bind cs ns t)))))
          (else (call-with-values (lambda () (split-pat (car pats)))
                  (lambda (cs ns) (loop (cdr pats) exns (t-bind cs ns t))))))))

The standard trie returns the single rule associated with a key. We need instead to get a list of all the rules along the path from a substring; for instance, when examining the substring henation., we find the three patterns he2n, hena4, and hen5at. T-looks is similar to t-look, but returns a list, possibly null, instead of a single value:

(define (t-looks ks t)
  (if (null? ks)
      (if (pair? (car t)) (list (caar t)) '())
      (let ((x (a-look (car ks) (cdr t))))
        (if x
            (if (pair? (car t))
                (cons (caar t) (t-looks (cdr ks) (cdr x)))
                (t-looks (cdr ks) (cdr x)))
            (if (pair? (car t)) (list (caar t)) '())))))

We are ready now to show the function that performs hyphenation. Hyphenate takes a single word, as a string, and returns a list of substrings between hyphenation points; for instance, (hyphenate "praxis") returns ("prax" "is").

The outer loop iterates over the letters of the input word, surrounded by dots. The rules are stored in a two-part list, the parts before the current letter and the part after the current letter. When the input word is exhausted, the substrings are collected in the first inner loop. Otherwise, the fold-left in the second inner loop saves the maximum rule at each position for each pattern returned by t-looks.

Hyphenate looks complicated, but examined in pieces it’s not too bad:

(define (hyphenate word)
  (if (< (string-length word) 5) (list word)
    (let loop ((ws (append (list #\.) (string->list word) (list #\.)))
               (front (make-list (+ (string-length word) 3) 0)) (back '()))
      (if (null? ws)
          (let loop ((cs (string->list word)) (hs (fixup back)) (p '()) (ps '()))
            (cond ((null? (cdr hs)) (reverse (cons (list->string (reverse p)) ps)))
                  ((odd? (car hs))
                    (loop (cdr cs) (cdr hs) (list (car cs))
                          (cons (list->string (reverse p)) ps)))
                  (else (loop (cdr cs) (cdr hs) (cons (car cs) p) ps))))
          (let ((new-front (fold-left max-rule front (t-looks ws t-hyph))))
            (loop (cdr ws) (cdr new-front) (cons (car new-front) back)))))))

Hyphenate calls two helper functions: max-rule is called by fold-left to loop through the rules for a particular pattern, saving only the maximum at each step, and fixup implements the requirement that a word cannot begin or end with a single-letter substring:

(define (max-rule xs ys)
  (let loop ((xs xs) (ys ys) (zs '()))
    (if (or (null? xs) (null? ys))
        (append (reverse zs) xs)
        (loop (cdr xs) (cdr ys) (cons (max (car xs) (car ys)) zs)))))

(define (fixup xs)
  (cons 0 (cons 0 (cdddr (reverse (cons 0 (cons 0 (cddr xs))))))))

Here is a peek at hyphenate in action:

> (hyphenate "hyphenation")
("hy" "phen" "ation")

Fold-left and make-list come from the Standard Prelude. You can run this program at http://programmingpraxis.codepad.org/s0N597UX.

Pages: 1 2 3

3 Responses to “Word Hy-phen-a-tion By Com-pu-ter”

  1. […] Praxis – Word hypenation By Remco Niemeijer Today’s Programming Praxis problem is about word hyphenation. Let’s see what we can come up […]

  2. Remco Niemeijer said

    My Haskell solution (see http://bonsaicode.wordpress.com/2009/04/24/programming-praxis-word-hypenation/ for the commented version):

    import Data.Char
    import Data.List
    import Data.List.HT
    
    exceptions :: [(String, String)]
    exceptions = zip (map (filter isLetter) ws) ws
        where ws = words "as-so-ciate as-so-ciates dec-li-na-tion \
                         \oblig-a-tory phil-an-thropic present presents \
                         \project projects reci-procity re-cog-ni-zance \
                         \ref-or-ma-tion ret-ri-bu-tion ta-ble"
    
    main :: IO ()
    main = do patterns <- fmap words $ readFile "patterns.txt"
              print $ hyphenate patterns "hyphenation"
              print $ hyphenate patterns "associate"
    
    hyphenate :: &#91;String&#93; -> String -> String
    hyphenate ps s = maybe (hyphenate' s ps) id $ lookup s exceptions
    
    hyphenate' :: String -> [String] -> String
    hyphenate' s = concat . intersperse "-" . map (filter isLetter) .
                   chop (\c -> isDigit c && odd (digitToInt c)) .
                   foldl (flip (tryPattern . format)) ("." ++ format s ++ ".")
    
    format :: String -> String
    format (x:y:xs) | all isLetter [x, y] = x : '0' : format (y:xs)
    format (x:xs)   = x : format xs
    format []       = []
    
    tryPattern :: String -> String -> String
    tryPattern _ [] = []
    tryPattern p s  = x : tryPattern p xs
                      where (x:xs) = if match p s then overlay p s else s
    
    match :: String -> String -> Bool
    match (x:xs) (y:ys) = (all isDigit [x, y] || x == y) && match xs ys
    match xs     _      = null xs
    
    overlay :: String -> String -> String
    overlay p = zipWith max (p ++ repeat '0')
    
  3. programmingpraxis said

    The standard solution doesn’t follow the requirement that any word containing a non-alphabetic character is never hyphenated. Here is a version that does:

    (define (new-hyphenate word)
      (let ((ws (string->list word)))
        (cond ((< (length ws) 5) (list word))
              ((any? (lambda (c) (not (char-alphabetic? c))) ws) (list word))
              (else (let loop ((ws (append (list #\.) ws (list #\.)))
                               (front (make-list (+ (length ws) 3) 0)) (back '()))
                      (if (null? ws)
                          (let loop ((cs (string->list word)) (hs (fixup back)) (p '()) (ps '()))
                            (cond ((null? (cdr hs)) (reverse (cons (list->string (reverse p)) ps)))
                                  ((odd? (car hs))
                                    (loop (cdr cs) (cdr hs) (list (car cs))
                                          (cons (list->string (reverse p)) ps)))
                                  (else (loop (cdr cs) (cdr hs) (cons (car cs) p) ps))))
                          (let ((new-front (fold-left max-rule front (t-looks ws t-hyph))))
                            (loop (cdr ws) (cdr new-front) (cons (car new-front) back))))))))))

    This function requires any?, which will soon be added to the Standard Prelude:

    (define (any? pred? xs)
      (cond ((null? xs) #f)
            ((pred? (car xs)) #t)
            (else (any? pred? (cdr xs)))))

Leave a comment