A Statisticle Speling Korrecter

December 29, 2009

Many word processors and search engines have a feature that suggests alternate spellings of words that appear to be misspelled; for instance, if you search Google for “speling” it replies “Did you mean: spelling?”

Peter Norvig showed how Google’s spelling corrector works in an article on his web site. Norvig’s system works in two phases. In the training phase, it reads a text, counting the number of occurrences of each word. Then, when asked to suggest a correction, it considers similar words, based on their edit distance from the target word, and chooses the one that occurred most frequently in the training text.

The edit distance between two words is the number of deletions (remove one letter), transpositions (swap two adjacent letters), alterations (change one letter to another), and insertions (add one letter) that would be required to convert one word to the other. For instance, Tuesday can be converted into Thursday by inserting an h after the initial T and changing the e to an r, so there is an edit distance of two between them.

Norvig chose a simple rule for selecting a correction: If the input word is in the training corpus, it’s spelled correctly, so return it unchanged. Otherwise, from the set of words an edit distance of one from the input word, select the word that appears most frequently in the training corpus. If neither the input word nor any words that are an edit distance of one from the input word appear in the training corpus, generate the set of words an edit distance of two from the input word and select the word that appears most frequently in the training corpus. And if that still fails, return the original input word back to the user, unchanged.

Your task is to write a spelling corrector similar to Norvig’s. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

4 Comments »

4 Responses to “A Statisticle Speling Korrecter”

Programming Praxis – A Statisticle Speling Korrecter « Bonsai Code said
December 29, 2009 at 7:09 PM
[…] Praxis – A Statisticle Speling Korrecter By Remco Niemeijer In today’s Programming Praxis exercise, we have to implement Peter Norvig’s spelling corrector. […]

Remco Niemeijer said

December 29, 2009 at 7:09 PM

My Haskell solution (see http://bonsaicode.wordpress.com/2009/12/29/programming-praxis-a-statisticle-speling-korrecter/ for a version with comments):

import Data.Char
import Data.List
import qualified Data.List.Key as K
import qualified Data.Map as M
import qualified Data.Set as S
import qualified Data.Text as T

nwords :: IO (M.Map String Int)
nwords = fmap (M.fromListWith (+) . map (flip (,) 1 . T.unpack) .
               T.splitBy (not . isLetter) . T.toLower . T.pack) $
         readFile "big.txt"

edits :: String -> [String]
edits word = S.elems . S.fromList $ dels ++ trans ++ repls ++ ins
    where s     = zip (inits word) (tails word)
          dels  = [a ++ b     | (a, _:b)   <- s]
          trans = [a ++ c:b:d | (a, b:c:d) <- s]
          repls = [a ++ c:b   | (a, _:b)   <- s, c <- ['a'..'z']]
          ins   = [a ++ c:b   | (a, b)     <- s, c <- ['a'..'z']]

known_edits :: M.Map String a -> String -> Int -> [String]
known_edits dict word n = filter (`M.member` dict) $ 
    iterate (edits =<<) [word] !! n

correct :: M.Map String Int -> String -> String
correct dict word = maybe word (K.maximum (`M.lookup` dict)) .
    find (not . null) $ map (known_edits dict word) [0..2]

Raphaël Lemaire said
December 31, 2009 at 12:28 PM
A scala version :

import scala.io.Source.fromFile
import java.io.File
import scala.collection.mutable.Map

object Spelling {
def main(args : Array[String]) {
// learning
var wordMap = Map[String, Int]()
for (file <- new File("textes").listFiles()) {
for (line <- fromFile(file).getLines.toList) {
val words = line.toLowerCase().split("[\\s\\W]+").filter(!_.matches("\\d+"))
for (word newValue)
}
}
}

for (word
List(word.take(i) + word.drop(i + 1)) ++ // deletion
(‘a’ to ‘z’).map(word.take(i) + _ + word.drop(i + 1)) ++ // suppressions
(‘a’ to ‘z’).map(word.take(i – 1) + _ + word.drop(i + 1)) ++ // insertions
(if (i > 0) // and transposition
List(word.take(i – 2) + word(i) + word(i – 1) + word.drop(i + 1))
else
List[String]())
}.filter(wordMap.contains(_)) // keep only existing words

// suggestion
val distance1WithFrequency = atDistance1.map((w) => (w, wordMap(w)))
val sorted = (List() ++ distance1WithFrequency).sort((t1, t2) => t1._2 > t2._2)
if (!sorted.isEmpty) {
println(word + ” : did you mean ” + sorted.head._1 + ” ?”)
}
}
}
}
}

Scott Haug said

January 4, 2010 at 12:35 AM

Rich Hickey uses this problem in his presentations as an example of Clojure’s succinctness in comparison to Python. The solution below is what from this presentation (and can be seen in more detail here: http://en.wikibooks.org/wiki/Clojure_Programming/Examples/Norvig_Spelling_Corrector).

(defn words [text] (re-seq #"[a-z]+" (.toLowerCase text)))
 
(defn train [features]
  (reduce (fn [model f] (assoc model f (inc (get model f 1)))) {} features))
 
(def *nwords* (train (words (slurp "big.txt"))))
 
(defn edits1 [word]
  (let [alphabet "abcdefghijklmnopqrstuvwxyz", n (count word)]
    (distinct (concat
      (for [i (range n)] (str (subs word 0 i) (subs word (inc i))))
      (for [i (range (dec n))]
        (str (subs word 0 i) (nth word (inc i)) (nth word i) (subs word (+ 2 i))))
      (for [i (range n) c alphabet] (str (subs word 0 i) c (subs word (inc i))))
      (for [i (range (inc n)) c alphabet] (str (subs word 0 i) c (subs word i)))))))
 
(defn known [words nwords] (seq (for [w words :when (nwords w)]  w)))
 
(defn known-edits2 [word nwords] (seq (for [e1 (edits1 word) e2 (edits1 e1) :when (nwords e2)]  e2)))
 
(defn correct [word nwords]
  (let [candidates (or (known [word] nwords) (known (edits1 word) nwords) 
                       (known-edits2 word nwords) [word])]
    (apply max-key #(get nwords % 1) candidates)))

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Programming Praxis