Probabilistic Spell Checking

April 21, 2009

We define k, m and bloom as global variables; since standard Scheme doesn’t provide bit-arrays (though every Scheme implementation provides them) we use a vector of booleans:

(define k 7)

(define m 1000000)

(define bloom (make-vector m #f))

Function (key i word) downcases the input word, salts it with the i^th salt, hashes it with string-hash, and maps the hash to the range 0 to m-1:

(define (key i word) (let* ((c (string (integer->char (+ i 96)))) (s (string-append c (string-downcase word) c))) (modulo (string-hash s) m)))

We read the dictionary and insert each word into the bloom filter for each of the k salts:

(with-input-from-file "/usr/dict/words" (lambda () (do ((word (read-line) (read-line))) ((eof-object? word)) (do ((i 0 (+ i 1))) ((= i k)) (vector-set! bloom (key i word) #t)))))

Then to query the bloom filter we look up each of the k hashes of the word to be checked:

(define (spell word) (let loop ((i 0)) (if (= i k) #t (and (vector-ref bloom (key i word)) (loop (+ i 1))))))

Here’s an example:

> (spell "programming") #t

The advantage of bloom filters for a spell checker, as compared to the tries of the previous exercise, is the small space they occupy. Real spell checkers strip prefixes and suffixes — for instance, mis, re-, pre- and -ed are stripped from misrepresented, leaving the stem sent — then look up the stem in a bloom filter. With affixes removed, an English dictionary of 125,000 words can be represented as about 20,000 stems, which can be checked with an error rate of 1 in 5102 using 7 hash functions and a 400,000 bit bloom filter, which fits in just 50KB. Adding space for a simple-minded affix stripper and for the program itself, a high-quality spell checker can fit in about 64KB, which is tiny compared to modern RAM sizes.

String-hash, read-line and string-downcase come from the Standard Prelude. You can run the spell-checker at http://codepad.org/zUt3Ezz1.

Posted by programmingpraxis

Filed in Exercises

1 Comment »

One Response to “Probabilistic Spell Checking”

FalconNL said
April 21, 2009 at 9:53 AM
In Haskell:

import Data.BloomFilter.Easy
import Data.Char

main = do dict <- fmap (easyList 0.01 . map lowercase . lines) $ readFile "words.txt" print $ dict `contains` "ValiD" print $ dict `contains` "xyzzy" contains b s = elemB (lowercase s) b lowercase = map toLower [/sourcecode] Long live appropriately named libraries :)

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Programming Praxis