A Statisticle Speling Korrecter

December 29, 2009

We begin with the code to read the training corpus. Read-word gets the next word (a maximal sequence of alphabetic characters) from the input and stores word/count pairs in a hash table; the big.txt training corpus comes from Norvig:

(define (read-word) (let loop ((in-word? #f) (c (read-char)) (word '())) (cond ((eof-object? c) (if in-word? (list->string (reverse word)) c)) ((char-alphabetic? c) (loop #t (read-char) (cons (char-downcase c) word))) (in-word? (list->string (reverse word))) (else (loop #f (read-char) word)))))

(define nwords (make-hash string-hash string=? #f 24989))

(with-input-from-file "big.txt" (lambda () (do ((word (read-word) (read-word))) ((eof-object? word)) (nwords 'update (string-downcase word) (lambda (k v) (+ v 1)) 1))))

Python provides sets as a native data type, but Scheme does not. We provide sets represented as lists with no duplicates; set converts a list to a set by eliminating duplicates, and set-of provides set comprehensions similar to list-of list comprehensions:

(define (set xs) (unique string=? (sort string<? xs)))

(define-syntax set-of (syntax-rules () ((_ arg ...) (fold-of (lambda (d a) (if (memv a d) d (cons a d))) '() arg ...))))

Norvig’s edits1 function returns all the words that are an edit distance of one from the input word by considering all possible deletions, transpositions, replacements and insertions. Our function is similar, though we work with lists of characters instead of explicit indexes into strings, and somewhat more verbose due to differences between Python and Scheme:

(define alfa (string->list "abcdefghijklmnopqrstuvwxyz"))

(define (edits1 word) (let* ((word (string->list word)) (s (list-of (list (take i word) (drop i word)) (i range (add1 (length word)))))) (set (map list->string (append (set-of (append (car xs) (cdadr xs)) (xs in s) (pair? (cadr xs))) (set-of (append (car xs) (list (cadadr xs)) (list (caadr xs)) (drop 2 (cadr xs))) (xs in s) (pair? (cadr xs)) (pair? (cdadr xs))) (set-of (append (car xs) (list c) (cdadr xs)) (xs in s) (pair? (cadr xs)) (c in alfa)) (set-of (append (car xs) (list c) (cadr xs)) (xs in s) (c in alfa)))))))

Known-edits2 gets all words in the training corpus that are an edit-distance of two from the input word by starting from all words that are an edit-distance of one from the input word; it does the lookup directly, instead of returning all words at an edit-distance of two, in order to avoid creation of the large intermediate set of non-words:

(define (known-edits2 word) (set (set-of e2 (e1 in (edits1 word)) (e2 in (edits1 e1)) (nwords 'lookup e2))))

Finally, we come to the correct function that takes an input word and returns the most likely correction; it calls known and max-word:

(define (known words) (filter (lambda (w) (nwords 'lookup w)) words))

(define (max-word words) (let loop ((count 0) (maxword "") (words words)) (if (null? words) maxword (let ((c (nwords 'lookup (car words)))) (if (< count c) (loop c (car words) (cdr words)) (loop count maxword (cdr words)))))))

(define (correct word) (if (nwords 'lookup word) word (let ((w1 (known (edits1 word)))) (if (pair? w1) (max-word w1) (let ((w2 (known-edits2 word))) (if (pair? w2) (max-word w2) word))))))

And that’s it. Norvig explains the math behind the corrector, and shows a test regime; though it’s not shown here, our corrector passes Norvig’s tests. We used several functions from the Standard Prelude: make-hash, string-hash, fold-of, list-of, string-downcase, take, drop, sort and unique. You can see the code collected at http://programmingpraxis.codepad.org/ciNSny8H, but you can’t run it there because file input is not provided.

Posted by programmingpraxis

Filed in Exercises

4 Comments »

4 Responses to “A Statisticle Speling Korrecter”

Programming Praxis – A Statisticle Speling Korrecter « Bonsai Code said
December 29, 2009 at 7:09 PM
[…] Praxis – A Statisticle Speling Korrecter By Remco Niemeijer In today’s Programming Praxis exercise, we have to implement Peter Norvig’s spelling corrector. […]

Remco Niemeijer said

December 29, 2009 at 7:09 PM

My Haskell solution (see http://bonsaicode.wordpress.com/2009/12/29/programming-praxis-a-statisticle-speling-korrecter/ for a version with comments):

import Data.Char
import Data.List
import qualified Data.List.Key as K
import qualified Data.Map as M
import qualified Data.Set as S
import qualified Data.Text as T

nwords :: IO (M.Map String Int)
nwords = fmap (M.fromListWith (+) . map (flip (,) 1 . T.unpack) .
               T.splitBy (not . isLetter) . T.toLower . T.pack) $
         readFile "big.txt"

edits :: String -> [String]
edits word = S.elems . S.fromList $ dels ++ trans ++ repls ++ ins
    where s     = zip (inits word) (tails word)
          dels  = [a ++ b     | (a, _:b)   <- s]
          trans = [a ++ c:b:d | (a, b:c:d) <- s]
          repls = [a ++ c:b   | (a, _:b)   <- s, c <- ['a'..'z']]
          ins   = [a ++ c:b   | (a, b)     <- s, c <- ['a'..'z']]

known_edits :: M.Map String a -> String -> Int -> [String]
known_edits dict word n = filter (`M.member` dict) $ 
    iterate (edits =<<) [word] !! n

correct :: M.Map String Int -> String -> String
correct dict word = maybe word (K.maximum (`M.lookup` dict)) .
    find (not . null) $ map (known_edits dict word) [0..2]

Raphaël Lemaire said
December 31, 2009 at 12:28 PM
A scala version :

import scala.io.Source.fromFile
import java.io.File
import scala.collection.mutable.Map

object Spelling {
def main(args : Array[String]) {
// learning
var wordMap = Map[String, Int]()
for (file <- new File("textes").listFiles()) {
for (line <- fromFile(file).getLines.toList) {
val words = line.toLowerCase().split("[\\s\\W]+").filter(!_.matches("\\d+"))
for (word newValue)
}
}
}

for (word
List(word.take(i) + word.drop(i + 1)) ++ // deletion
(‘a’ to ‘z’).map(word.take(i) + _ + word.drop(i + 1)) ++ // suppressions
(‘a’ to ‘z’).map(word.take(i – 1) + _ + word.drop(i + 1)) ++ // insertions
(if (i > 0) // and transposition
List(word.take(i – 2) + word(i) + word(i – 1) + word.drop(i + 1))
else
List[String]())
}.filter(wordMap.contains(_)) // keep only existing words

// suggestion
val distance1WithFrequency = atDistance1.map((w) => (w, wordMap(w)))
val sorted = (List() ++ distance1WithFrequency).sort((t1, t2) => t1._2 > t2._2)
if (!sorted.isEmpty) {
println(word + ” : did you mean ” + sorted.head._1 + ” ?”)
}
}
}
}
}

Scott Haug said

January 4, 2010 at 12:35 AM

Rich Hickey uses this problem in his presentations as an example of Clojure’s succinctness in comparison to Python. The solution below is what from this presentation (and can be seen in more detail here: http://en.wikibooks.org/wiki/Clojure_Programming/Examples/Norvig_Spelling_Corrector).

(defn words [text] (re-seq #"[a-z]+" (.toLowerCase text)))
 
(defn train [features]
  (reduce (fn [model f] (assoc model f (inc (get model f 1)))) {} features))
 
(def *nwords* (train (words (slurp "big.txt"))))
 
(defn edits1 [word]
  (let [alphabet "abcdefghijklmnopqrstuvwxyz", n (count word)]
    (distinct (concat
      (for [i (range n)] (str (subs word 0 i) (subs word (inc i))))
      (for [i (range (dec n))]
        (str (subs word 0 i) (nth word (inc i)) (nth word i) (subs word (+ 2 i))))
      (for [i (range n) c alphabet] (str (subs word 0 i) c (subs word (inc i))))
      (for [i (range (inc n)) c alphabet] (str (subs word 0 i) c (subs word i)))))))
 
(defn known [words nwords] (seq (for [w words :when (nwords w)]  w)))
 
(defn known-edits2 [word nwords] (seq (for [e1 (edits1 word) e2 (edits1 e1) :when (nwords e2)]  e2)))
 
(defn correct [word nwords]
  (let [candidates (or (known [word] nwords) (known (edits1 word) nwords) 
                       (known-edits2 word nwords) [word])]
    (apply max-key #(get nwords % 1) candidates)))

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Programming Praxis