Decoding Text-Speak
July 2, 2013
Sm ppl cmprs txt msgs by rtnng only ths vwls tht bgn a wrd and by rplcng dbld ltrs wth sngl ltrs.
With a proper dictionary, it is possible to expand all the possibilities for a word. For instance, the “Sm” that starts the sentence above is properly translated “Some” but these other words are possible: same, sam, sum, seem, seam, sumo, and others.
Your task is to write a program that, given a sentence in text-speak, returns a list of all possibilities for each word. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Here an attempt in Python. I follow a simple approach. First dump the english dictionary in a trie. Then take the signature word and try to find all words in the trie by expanding with double consonants, vowels and the characters from the signature word.
Another Python version kore in line with the Programming Praxis solution. This one is much faster and much shorter.
In clojure:
@Paul – I don’t think your second version compresses words correctly if the have two of the same consonant separated by vowels, e.g., “people” should compress to “ppl”, but yours compresses to “pl”.
@Mike. That is correct. I saw it already, but did not post a correction. Thanks for pointing out the word lists from 12dicts.
“””
Try to match all the words in encrypt text
“””
import re
ENCRYPT_TEXT = “Sm ppl cmprs txt msgs by rtnng only ths vwls tht bgn ”
ENCRYPT_TEXT += “a wrd and by rplcng dbld ltrs wth sngl ltrs”
ENCRYPT_TEXT = ENCRYPT_TEXT.split()
def main():
“””
The main function
“””
word_dict = {}
answer = {}
# Build the dictionary by first letter
for capital in range(ord(‘a’), ord(‘z’)+1):
word_dict[chr(capital)] = []
with open(“/usr/share/dict/words”, “r”) as dict_file:
for word in dict_file.readlines():
word = word.strip()
if len(word) > 0:
word_dict[word[0].lower()].append(word)
for capital in word_dict.keys():
word_dict[capital] = “\n”.join(word_dict[capital])
# For each word, try to find it in dictionary
for word in ENCRYPT_TEXT:
reg_re = word[0] + “”.join([“.*”+c for c in word[1:]]) + “.*”
reg_re = reg_re.lower()
answer[word] = re.findall(reg_re, word_dict[reg_re[0]])
print answer
main()
http://pastebin.com/rkfZwvsL