A Statisticle Speling Korrecter
December 29, 2009
We begin with the code to read the training corpus. Read-word
gets the next word (a maximal sequence of alphabetic characters) from the input and stores word/count pairs in a hash table; the big.txt training corpus comes from Norvig:
(define (read-word)
(let loop ((in-word? #f) (c (read-char)) (word '()))
(cond ((eof-object? c) (if in-word? (list->string (reverse word)) c))
((char-alphabetic? c) (loop #t (read-char) (cons (char-downcase c) word)))
(in-word? (list->string (reverse word)))
(else (loop #f (read-char) word)))))
(define nwords (make-hash string-hash string=? #f 24989))
(with-input-from-file "big.txt"
(lambda ()
(do ((word (read-word) (read-word))) ((eof-object? word))
(nwords 'update (string-downcase word) (lambda (k v) (+ v 1)) 1))))
Python provides sets as a native data type, but Scheme does not. We provide sets represented as lists with no duplicates; set
converts a list to a set by eliminating duplicates, and set-of
provides set comprehensions similar to list-of
list comprehensions:
(define (set xs)
(unique string=?
(sort string<? xs)))
(define-syntax set-of
(syntax-rules ()
((_ arg ...)
(fold-of
(lambda (d a) (if (memv a d) d (cons a d)))
'() arg ...))))
Norvig’s edits1
function returns all the words that are an edit distance of one from the input word by considering all possible deletions, transpositions, replacements and insertions. Our function is similar, though we work with lists of characters instead of explicit indexes into strings, and somewhat more verbose due to differences between Python and Scheme:
(define alfa (string->list "abcdefghijklmnopqrstuvwxyz"))
(define (edits1 word)
(let* ((word (string->list word))
(s (list-of (list (take i word) (drop i word))
(i range (add1 (length word))))))
(set (map list->string (append
(set-of (append (car xs) (cdadr xs))
(xs in s) (pair? (cadr xs)))
(set-of (append (car xs) (list (cadadr xs))
(list (caadr xs)) (drop 2 (cadr xs)))
(xs in s) (pair? (cadr xs)) (pair? (cdadr xs)))
(set-of (append (car xs) (list c) (cdadr xs))
(xs in s) (pair? (cadr xs)) (c in alfa))
(set-of (append (car xs) (list c) (cadr xs))
(xs in s) (c in alfa)))))))
Known-edits2
gets all words in the training corpus that are an edit-distance of two from the input word by starting from all words that are an edit-distance of one from the input word; it does the lookup directly, instead of returning all words at an edit-distance of two, in order to avoid creation of the large intermediate set of non-words:
(define (known-edits2 word)
(set (set-of e2
(e1 in (edits1 word))
(e2 in (edits1 e1))
(nwords 'lookup e2))))
Finally, we come to the correct
function that takes an input word and returns the most likely correction; it calls known
and max-word
:
(define (known words)
(filter (lambda (w) (nwords 'lookup w)) words))
(define (max-word words)
(let loop ((count 0) (maxword "") (words words))
(if (null? words) maxword
(let ((c (nwords 'lookup (car words))))
(if (< count c)
(loop c (car words) (cdr words))
(loop count maxword (cdr words)))))))
(define (correct word)
(if (nwords 'lookup word) word
(let ((w1 (known (edits1 word))))
(if (pair? w1) (max-word w1)
(let ((w2 (known-edits2 word)))
(if (pair? w2) (max-word w2) word))))))
And that’s it. Norvig explains the math behind the corrector, and shows a test regime; though it’s not shown here, our corrector passes Norvig’s tests. We used several functions from the Standard Prelude: make-hash, string-hash, fold-of, list-of, string-downcase, take, drop, sort and unique. You can see the code collected at http://programmingpraxis.codepad.org/ciNSny8H, but you can’t run it there because file input is not provided.
[…] Praxis – A Statisticle Speling Korrecter By Remco Niemeijer In today’s Programming Praxis exercise, we have to implement Peter Norvig’s spelling corrector. […]
My Haskell solution (see http://bonsaicode.wordpress.com/2009/12/29/programming-praxis-a-statisticle-speling-korrecter/ for a version with comments):
A scala version :
import scala.io.Source.fromFile
import java.io.File
import scala.collection.mutable.Map
object Spelling {
def main(args : Array[String]) {
// learning
var wordMap = Map[String, Int]()
for (file <- new File("textes").listFiles()) {
for (line <- fromFile(file).getLines.toList) {
val words = line.toLowerCase().split("[\\s\\W]+").filter(!_.matches("\\d+"))
for (word newValue)
}
}
}
for (word
List(word.take(i) + word.drop(i + 1)) ++ // deletion
(‘a’ to ‘z’).map(word.take(i) + _ + word.drop(i + 1)) ++ // suppressions
(‘a’ to ‘z’).map(word.take(i – 1) + _ + word.drop(i + 1)) ++ // insertions
(if (i > 0) // and transposition
List(word.take(i – 2) + word(i) + word(i – 1) + word.drop(i + 1))
else
List[String]())
}.filter(wordMap.contains(_)) // keep only existing words
// suggestion
val distance1WithFrequency = atDistance1.map((w) => (w, wordMap(w)))
val sorted = (List() ++ distance1WithFrequency).sort((t1, t2) => t1._2 > t2._2)
if (!sorted.isEmpty) {
println(word + ” : did you mean ” + sorted.head._1 + ” ?”)
}
}
}
}
}
Rich Hickey uses this problem in his presentations as an example of Clojure’s succinctness in comparison to Python. The solution below is what from this presentation (and can be seen in more detail here: http://en.wikibooks.org/wiki/Clojure_Programming/Examples/Norvig_Spelling_Corrector).