Spell Checking
April 17, 2009
Spell checkers are ubiquitous. Word processors have spell checkers, as do browser-based e-mail clients. They all work the same way: a dictionary is stored in some data structure, then each word of input is submitted to a search in the data structure, and those that fail are flagged as spelling errors. There is a certain art to building the word list on which a spell checker is based; Exxon isn’t a word in anybody’s dictionary, but it is likely included in the word list, but cwm (a geological structure), which is certainly a word, is most likely omitted from the word list, on the grounds it is more likely an error than a legitimate spelling (unless you are a geologist).
There are many appropriate data structures to store the word list, including a sorted array accessed via binary search, a hash table, or a bloom filter. In this exercise you are challenged to store the word list character-by-character in a trie. Consider this sample trie that stores the words CAR, CART, CAT and DOG:
To see that CAT is a valid spelling, follow the C branch, then the A branch, then the T branch, where you find a filled circle, indicating a word. To see that CAB is not a valid spelling, follow the C branch, then the A branch, and fail when there is no B branch.
Tries can be very fast; for instance, to see that QARTER (a misspelling of QUARTER) is not a word, follow the Q branch, then fail when there is no A branch. This is even faster than hashing, which must read all six letters of QARTER just to compute the hash value. And tries can also be space-efficient, since space is shared between common prefixes.
Your task is to build a spell checker based on tries. When you are finished, you can read or run a suggested solution, or post your solution or discuss the exercise in the comments below.
In Haskell:
import qualified Data.ByteString.Char8 as B
import Data.Trie
main = do trie <- fmap (fromList . map (flip (,) ()) . B.lines) $ B.readFile "words.txt" print $ valid trie "qarter" valid trie = flip member trie . B.pack [/sourcecode]
In scheme :
http://codepad.org/aHBN2HbM
Another in scheme:
trie-tests.ss
hosted with ❤ by GitHub
trie.ss
hosted with ❤ by GitHub
implemented in c language:
http://codepad.org/PmYVlSN6
In Rust:
https://gist.github.com/ftt/1e22d9f7873eb955455b11bfee503cf4.js
Sorry, bad link above :( https://gist.github.com/ftt/1e22d9f7873eb955455b11bfee503cf4