A Statisticle Speling Korrecter
December 29, 2009
Many word processors and search engines have a feature that suggests alternate spellings of words that appear to be misspelled; for instance, if you search Google for “speling” it replies “Did you mean: spelling?”
Peter Norvig showed how Google’s spelling corrector works in an article on his web site. Norvig’s system works in two phases. In the training phase, it reads a text, counting the number of occurrences of each word. Then, when asked to suggest a correction, it considers similar words, based on their edit distance from the target word, and chooses the one that occurred most frequently in the training text.
The edit distance between two words is the number of deletions (remove one letter), transpositions (swap two adjacent letters), alterations (change one letter to another), and insertions (add one letter) that would be required to convert one word to the other. For instance, Tuesday can be converted into Thursday by inserting an h after the initial T and changing the e to an r, so there is an edit distance of two between them.
Norvig chose a simple rule for selecting a correction: If the input word is in the training corpus, it’s spelled correctly, so return it unchanged. Otherwise, from the set of words an edit distance of one from the input word, select the word that appears most frequently in the training corpus. If neither the input word nor any words that are an edit distance of one from the input word appear in the training corpus, generate the set of words an edit distance of two from the input word and select the word that appears most frequently in the training corpus. And if that still fails, return the original input word back to the user, unchanged.
Your task is to write a spelling corrector similar to Norvig’s. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Pages: 1 2