Probabilistic Spell Checking
April 21, 2009
We define k
, m
and bloom
as global variables; since standard Scheme doesn’t provide bit-arrays (though every Scheme implementation provides them) we use a vector of booleans:
(define k 7)
(define m 1000000)
(define bloom (make-vector m #f))
Function (key i word)
downcases the input word, salts it with the ith salt, hashes it with string-hash
, and maps the hash to the range 0 to m-1:
(define (key i word)
(let* ((c (string (integer->char (+ i 96))))
(s (string-append c (string-downcase word) c)))
(modulo (string-hash s) m)))
We read the dictionary and insert each word into the bloom filter for each of the k salts:
(with-input-from-file "/usr/dict/words"
(lambda ()
(do ((word (read-line) (read-line))) ((eof-object? word))
(do ((i 0 (+ i 1))) ((= i k))
(vector-set! bloom (key i word) #t)))))
Then to query the bloom filter we look up each of the k hashes of the word to be checked:
(define (spell word)
(let loop ((i 0))
(if (= i k) #t
(and (vector-ref bloom (key i word))
(loop (+ i 1))))))
Here’s an example:
> (spell "programming")
#t
The advantage of bloom filters for a spell checker, as compared to the tries of the previous exercise, is the small space they occupy. Real spell checkers strip prefixes and suffixes — for instance, mis, re-, pre- and -ed are stripped from misrepresented, leaving the stem sent — then look up the stem in a bloom filter. With affixes removed, an English dictionary of 125,000 words can be represented as about 20,000 stems, which can be checked with an error rate of 1 in 5102 using 7 hash functions and a 400,000 bit bloom filter, which fits in just 50KB. Adding space for a simple-minded affix stripper and for the program itself, a high-quality spell checker can fit in about 64KB, which is tiny compared to modern RAM sizes.
String-hash, read-line and string-downcase come from the Standard Prelude. You can run the spell-checker at http://codepad.org/zUt3Ezz1.
In Haskell:
import Data.BloomFilter.Easy
import Data.Char
main = do dict <- fmap (easyList 0.01 . map lowercase . lines) $ readFile "words.txt" print $ dict `contains` "ValiD" print $ dict `contains` "xyzzy" contains b s = elemB (lowercase s) b lowercase = map toLower [/sourcecode] Long live appropriately named libraries :)