Spell Checking

April 17, 2009

Spell checkers are ubiquitous. Word processors have spell checkers, as do browser-based e-mail clients. They all work the same way: a dictionary is stored in some data structure, then each word of input is submitted to a search in the data structure, and those that fail are flagged as spelling errors. There is a certain art to building the word list on which a spell checker is based; Exxon isn’t a word in anybody’s dictionary, but it is likely included in the word list, but cwm (a geological structure), which is certainly a word, is most likely omitted from the word list, on the grounds it is more likely an error than a legitimate spelling (unless you are a geologist).

There are many appropriate data structures to store the word list, including a sorted array accessed via binary search, a hash table, or a bloom filter. In this exercise you are challenged to store the word list character-by-character in a trie. Consider this sample trie that stores the words CAR, CART, CAT and DOG:

trie

To see that CAT is a valid spelling, follow the C branch, then the A branch, then the T branch, where you find a filled circle, indicating a word. To see that CAB is not a valid spelling, follow the C branch, then the A branch, and fail when there is no B branch.

Tries can be very fast; for instance, to see that QARTER (a misspelling of QUARTER) is not a word, follow the Q branch, then fail when there is no A branch. This is even faster than hashing, which must read all six letters of QARTER just to compute the hash value. And tries can also be space-efficient, since space is shared between common prefixes.

Your task is to build a spell checker based on tries. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.

Pages: 1 2

6 Responses to “Spell Checking”

  1. FalconNL said

    In Haskell:

    import qualified Data.ByteString.Char8 as B
    import Data.Trie

    main = do trie <- fmap (fromList . map (flip (,) ()) . B.lines) $ B.readFile "words.txt" print $ valid trie "qarter" valid trie = flip member trie . B.pack [/sourcecode]

  2. Eric H said

    Another in scheme:

    #! /bin/sh
    #| Hey Emacs, this is -*-scheme-*- code!
    #$Id: v4-script-template.ss 5887 2008-12-30 18:12:50Z erich $
    exec mzscheme -l errortrace –require "$0" –main — ${1+"$@"}
    |#
    #lang scheme
    (require "trie.ss"
    (planet schematics/schemeunit:3)
    (planet schematics/schemeunit:3/text-ui))
    (define snarf-dictionary
    (match-lambda
    [(? string? inp)
    (snarf-dictionary (build-path inp))]
    [(? path? inp)
    (fprintf (current-error-port) "Reading dictionary ~s … " inp)
    (let ((dict (call-with-input-file inp snarf-dictionary)))
    (fprintf (current-error-port) "done; ~s words~%"
    (dict-count dict))
    dict)]
    [(? input-port? inp)
    (for/fold ([dict (make-immutable-trie)])
    ([word (in-lines inp)])
    (dict-set dict word word ))]))
    (define dict-tests
    (test-suite
    "dictionary"
    (let ((d (snarf-dictionary "/usr/share/dict/words")))
    (check-not-false (dict-ref d "dog"))
    (check-false (dict-ref d "I bet this word isn't in the dictionary"))
    (printf "Hey: ~a~%"
    (for/list ((w (in-list '("sam" "Sam" "snord" "flutter" "butter" "smith" "Smith"))))
    (cons w (dict-ref d w))))
    )))
    (define trie-tests
    (test-suite
    "top"
    (test-case
    "tries"
    (let ((t (make-immutable-trie)))
    (check-true (trie? t))
    (check-equal? (dict-count t) 0)
    (let ((t (dict-set t "c" "The letter 'c'")))
    (check-equal? (dict-count t) 1)
    (check-equal? (dict-ref t "c") "The letter 'c'")
    (let* ((exp "The furry 'cat'")
    (t (dict-set t "cat" exp)))
    (check-false (dict-ref t "ca" #f))
    (check-equal? (dict-count t) 2)
    (check-equal? (dict-ref t "cat") exp)
    ))
    )
    (let ((t (make-immutable-trie)))
    (check-true (dict? t))
    (check-false (dict-ref t "cat" #f))
    (check-equal? (dict-count t) 0)
    (let ((t (dict-set t "c" 'plurgh)))
    (check-equal? (dict-ref t "c" #f) 'plurgh)
    (check-equal? (dict-count t) 1)
    (let ((t (dict-set t "cat" "hat")))
    (check-equal? (dict-ref t "cat" #f) "hat")
    (let ((t (dict-set t "cats" "mats")))
    (check-equal? (dict-ref t "cat" #f) "hat")
    (check-equal? (dict-ref t "cats" #f) "mats")
    (check-equal? (dict-count t) 3))
    )
    )))
    dict-tests))
    (provide main)
    (define (main . args)
    (exit (run-tests
    trie-tests
    'verbose)))

    view raw
    trie-tests.ss
    hosted with ❤ by GitHub

    #lang scheme
    (require mzlib/trace)
    (define (ref t key [failure-result #f])
    (define (ref-inner t chars)
    (if (null? chars)
    (if (box? (trie-value t))
    (unbox (trie-value t))
    failure-result)
    (let ((probe (dict-ref (trie-alist t) (car chars) #f)))
    (cond
    ((not probe)
    failure-result)
    ((box? probe)
    (unbox probe))
    (#t
    (ref-inner probe
    (cdr chars)))))))
    ;; (trace ref-inner)
    (ref-inner t (string->list key)))
    ;; (trace ref)
    (define (set t key value)
    (define (set-inner t chars value)
    ;; chars: () => set the box
    ;; chars: (ch . rest) => lookup ch => old value; (set-inner oldvalue (cdr chars) value)
    (let ((new (make-trie '() (box-immutable value))))
    (if (null? chars)
    new
    (let ((probe (dict-ref (trie-alist t) (car chars) (make-trie '() #f))))
    (make-trie
    (dict-set (trie-alist t) (car chars)
    (set-inner probe (cdr chars) value))
    (trie-value t))))))
    ;; (trace set-inner)
    (set-inner t (string->list key) value))
    ;; (trace set)
    (define (count t)
    (foldl +
    (if (box? (trie-value t))
    1
    0)
    (map (compose count cdr) (trie-alist t))))
    ;; (trace count)
    (define (iterate-first t)
    (and (not (null? t))
    0))
    ;; (trace iterate-first)
    (define (iterate-next t pos)
    (if (= pos (sub1 (length t)))
    #f
    (add1 pos)))
    ;; (trace iterate-next)
    (define (iterate-key t pos)
    (car (list-ref t pos)))
    ;; (trace iterate-key)
    (define (iterate-value t pos)
    (cdr (list-ref t pos)))
    ;; (trace iterate-value)
    (define-struct trie
    (alist
    value ;;either #f or a box
    )
    #:property prop:dict (vector
    ref
    #f set
    #f remove
    count
    iterate-first iterate-next
    iterate-key iterate-value)
    #:transparent)
    (provide make-immutable-trie trie?)
    (define (make-immutable-trie)
    (make-trie '() #f))

    view raw
    trie.ss
    hosted with ❤ by GitHub

  3. Vikas Tandi said

    implemented in c language:
    http://codepad.org/PmYVlSN6

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: