Probabilistic Spell Checking

April 21, 2009

We define k, m and bloom as global variables; since standard Scheme doesn’t provide bit-arrays (though every Scheme implementation provides them) we use a vector of booleans:

(define k 7)

(define m 1000000)

(define bloom (make-vector m #f))

Function (key i word) downcases the input word, salts it with the ith salt, hashes it with string-hash, and maps the hash to the range 0 to m-1:

(define (key i word)
  (let* ((c (string (integer->char (+ i 96))))
         (s (string-append c (string-downcase word) c)))
    (modulo (string-hash s) m)))

We read the dictionary and insert each word into the bloom filter for each of the k salts:

(with-input-from-file "/usr/dict/words"
  (lambda ()
    (do ((word (read-line) (read-line))) ((eof-object? word))
      (do ((i 0 (+ i 1))) ((= i k))
        (vector-set! bloom (key i word) #t)))))

Then to query the bloom filter we look up each of the k hashes of the word to be checked:

(define (spell word)
  (let loop ((i 0))
    (if (= i k) #t
      (and (vector-ref bloom (key i word))
           (loop (+ i 1))))))

Here’s an example:

> (spell "programming")
#t

The advantage of bloom filters for a spell checker, as compared to the tries of the previous exercise, is the small space they occupy. Real spell checkers strip prefixes and suffixes — for instance, mis, re-, pre- and -ed are stripped from misrepresented, leaving the stem sent — then look up the stem in a bloom filter. With affixes removed, an English dictionary of 125,000 words can be represented as about 20,000 stems, which can be checked with an error rate of 1 in 5102 using 7 hash functions and a 400,000 bit bloom filter, which fits in just 50KB. Adding space for a simple-minded affix stripper and for the program itself, a high-quality spell checker can fit in about 64KB, which is tiny compared to modern RAM sizes.

String-hash, read-line and string-downcase come from the Standard Prelude. You can run the spell-checker at http://codepad.org/zUt3Ezz1.

Pages: 1 2

One Response to “Probabilistic Spell Checking”

  1. FalconNL said

    In Haskell:

    import Data.BloomFilter.Easy
    import Data.Char

    main = do dict <- fmap (easyList 0.01 . map lowercase . lines) $ readFile "words.txt" print $ dict `contains` "ValiD" print $ dict `contains` "xyzzy" contains b s = elemB (lowercase s) b lowercase = map toLower [/sourcecode] Long live appropriately named libraries :)

Leave a comment