Decoding Text-Speak
July 2, 2013
Our program “signs” each word in a dictionary by converting it to text-speak, then returns all real words that match a given text-speak signature when requested. For instance, “vowels” is signed as “vwls”, “letters” is signed as “ltrs”, and “and” is signed by “and”. The sign function returns the signature of a word:
(define (vowel? c)
(member c (list #\a #\e #\i #\o #\u)))
(define (unvowel cs)
(filter (lambda (c) (not (vowel? c))) cs))
(define (undouble cs) (unique char=? cs))
(define (sign word)
(let ((cs (undouble (string->list (string-downcase word)))))
(list->string (cons (car cs) (unvowel (cdr cs))))))
The next step is to create a signing dictionary in which each word is stored with its signature as a key. We use a dictionary with one word per line:
(define dict (make-dict string<?))
(with-input-from-file "words" (lambda ()
(do ((word (read-line) (read-line))) ((eof-object? word))
(set! dict (dict 'update (lambda (k v) (cons word v)) (sign word) (list word)))))
Then we can look up each word in a text message, returning either the original input if there is no corresponding signature in the dictionary, or a single word if there is exactly one word that corresponds to the signature, or a list of words if there is more than one.
(define (lookup txt)
(let ((words (dict 'lookup txt)))
(if (not words) txt
(if (pair? (cddr words))
(cdr words)
(cadr words)))))
(define (decode txt-spk)
(map lookup (map string-downcase
(string-split #\space txt-spk))))
Here’s an example:
> (decode ...)
As shown above, our decoder is rather limited. A better decoder might build in some common abbreviations such as “u r” for “you are”. Affix analysis as in the Porter Stemmer might be useful. And a smaller dictionary would probably be more rather than less helpful, as real-life text-speak seems to have a small vocabulary.
We used filter
, unique
, string-downcase
and string-split
from the Standard Prelude. You can run the program, with a tiny dictionary, at http://programmingpraxis.codepad.org/UFMBAHLE.
Here an attempt in Python. I follow a simple approach. First dump the english dictionary in a trie. Then take the signature word and try to find all words in the trie by expanding with double consonants, vowels and the characters from the signature word.
Another Python version kore in line with the Programming Praxis solution. This one is much faster and much shorter.
In clojure:
@Paul – I don’t think your second version compresses words correctly if the have two of the same consonant separated by vowels, e.g., “people” should compress to “ppl”, but yours compresses to “pl”.
@Mike. That is correct. I saw it already, but did not post a correction. Thanks for pointing out the word lists from 12dicts.
“””
Try to match all the words in encrypt text
“””
import re
ENCRYPT_TEXT = “Sm ppl cmprs txt msgs by rtnng only ths vwls tht bgn ”
ENCRYPT_TEXT += “a wrd and by rplcng dbld ltrs wth sngl ltrs”
ENCRYPT_TEXT = ENCRYPT_TEXT.split()
def main():
“””
The main function
“””
word_dict = {}
answer = {}
# Build the dictionary by first letter
for capital in range(ord(‘a’), ord(‘z’)+1):
word_dict[chr(capital)] = []
with open(“/usr/share/dict/words”, “r”) as dict_file:
for word in dict_file.readlines():
word = word.strip()
if len(word) > 0:
word_dict[word[0].lower()].append(word)
for capital in word_dict.keys():
word_dict[capital] = “\n”.join(word_dict[capital])
# For each word, try to find it in dictionary
for word in ENCRYPT_TEXT:
reg_re = word[0] + “”.join([“.*”+c for c in word[1:]]) + “.*”
reg_re = reg_re.lower()
answer[word] = re.findall(reg_re, word_dict[reg_re[0]])
print answer
main()
http://pastebin.com/rkfZwvsL