Spell Checking
April 17, 2009
Spell checkers are ubiquitous. Word processors have spell checkers, as do browser-based e-mail clients. They all work the same way: a dictionary is stored in some data structure, then each word of input is submitted to a search in the data structure, and those that fail are flagged as spelling errors. There is a certain art to building the word list on which a spell checker is based; Exxon isn’t a word in anybody’s dictionary, but it is likely included in the word list, but cwm (a geological structure), which is certainly a word, is most likely omitted from the word list, on the grounds it is more likely an error than a legitimate spelling (unless you are a geologist).
There are many appropriate data structures to store the word list, including a sorted array accessed via binary search, a hash table, or a bloom filter. In this exercise you are challenged to store the word list character-by-character in a trie. Consider this sample trie that stores the words CAR, CART, CAT and DOG:
To see that CAT is a valid spelling, follow the C branch, then the A branch, then the T branch, where you find a filled circle, indicating a word. To see that CAB is not a valid spelling, follow the C branch, then the A branch, and fail when there is no B branch.
Tries can be very fast; for instance, to see that QARTER (a misspelling of QUARTER) is not a word, follow the Q branch, then fail when there is no A branch. This is even faster than hashing, which must read all six letters of QARTER just to compute the hash value. And tries can also be space-efficient, since space is shared between common prefixes.
Your task is to build a spell checker based on tries. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
In Haskell:
import qualified Data.ByteString.Char8 as B
import Data.Trie
main = do trie <- fmap (fromList . map (flip (,) ()) . B.lines) $ B.readFile "words.txt" print $ valid trie "qarter" valid trie = flip member trie . B.pack [/sourcecode]
In scheme :
http://codepad.org/aHBN2HbM
Another in scheme:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
trie-tests.ss
hosted with ❤ by GitHub
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
trie.ss
hosted with ❤ by GitHub
implemented in c language:
http://codepad.org/PmYVlSN6
In Rust:
https://gist.github.com/ftt/1e22d9f7873eb955455b11bfee503cf4.js
Sorry, bad link above :( https://gist.github.com/ftt/1e22d9f7873eb955455b11bfee503cf4