Rhyming Dictionary

April 24, 2012

We begin by reading the dictionary and storing the words in a list; we could use a hash table or some kind of balanced tree, but a list is simple, convenient to program, and fast enough. Each word is itself stored as a list, with the word in the car and the list of phonemes in reverse order in the cdr; we find it convenient to store the phonemes back-to-front instead of front-to-back, since rhyming occurs at the end of the word. Here is read-dict:

(define (read-dict file-name)
  (with-input-from-file file-name
    (lambda ()
      (let loop ((line (read-line))) ; discard commentary
        (if (string=? "##" (substring line 0 2))
            (loop (read-line))))
      (let loop ((line (read-line)) (ds (list)))
        (if (eof-object? line) ds
          (loop (read-line)
                (let ((xs (string-split #\space line)))
                  (cons (cons (car xs) (reverse (cddr xs))) ds))))))))

The dictionary supplied by CMU has some error of form in it. We clean them up, remove words with no vowels, and store the dictionary in the global variable dict:

(define dict
  (filter (lambda (w) (any? vowel? w))
    (map (lambda (w) (remove "" w))
      (read-dict "c06d"))))

Our first function distinguishes vowels from consonants:

(define (vowel? str)
  (let ((len (string-length str)))
    (char-numeric? (string-ref str (- len 1)))))

To determine if two words rhyme, we loop through the reversed phonemes. If either list ends, or if there is a mis-match, we return #f. If the two phonemes are the same vowel, we return #t. Otherwise, we go on to the next phoneme.

(define (rhyme? w1 w2)
  (let ((d1 (assoc w1 dict)) (d2 (assoc w2 dict)))
    (if (not (and d1 d2)) -1
      (if (string=? w1 w2) 0
        (let loop ((d1 (cdr d1)) (d2 (cdr d2)))
          (cond ((or (null? d1) (null? d2)) #f)
                ((not (string=? (car d1) (car d2))) #f)
                ((vowel? (car d1)) #t)
                (else (loop (cdr d1) (cdr d2)))))))))

Our strategy for finding all words that rhyme with a given word is to preprocess dict to bring together all words that have the same final syllable. The sign function computes the final syllable, which is a word’s signature:

(define (sign phonemes)
  (let loop ((ps phonemes) (ss (list)))
    (if (null? ps) #f
      (if (vowel? (car ps))
          (string-join #\space (cons (car ps) ss))
          (loop (cdr ps) (cons (car ps) ss))))))

We use a utility function to bring together groups of words with identical signatures:

(define (group eql? xs)
  (let loop ((xs xs) (ys (list)) (zs (list)))
    (cond ((and (null? xs) (null? ys)) zs)
          ((and (null? xs) (pair? ys)) (cons ys zs))
          ((null? ys)
            (loop (cdr xs)
                  (cons (caar xs) (list (cdar xs)))
                  zs))
          ((eql? (caar xs) (car ys))
            (loop (cdr xs)
                  (cons (car ys) (cons (cdar xs) (cdr ys)))
                  zs))
          (else (loop (cdr xs)
                      (cons (caar xs) (list (cdar xs)))
                      (cons ys zs))))))

Now we can create a second global variable sign-dict that encodes the rhyming dictionary:

(define sign-dict
  (group string=?
    (sort (lambda (x y) (string<? (car x) (car y)))
      (map (lambda (w) (cons (sign (cdr w)) (car w))) dict))))

And finally, here is the function that determines all the rhymes for a given input word:

(define (rhymes word)
  (cdr (assoc (sign (cdr (assoc word dict))) sign-dict)))

Let’s look at a few examples:

> (rhyme? "PRAXIS" "AXIS")
#f
> (rhyme? "PRAXIS" "TERRORISTS(4)")
#t
> (rhymes "SKYDIVE")
("SKYDIVE")
> (length (rhymes "PROGRAMMING"))
5214
> (length (rhymes "PRAXIS"))
990

These examples point out some of the problems that doubtless caused the poet on Reddit to call the CMU dictionary “choppy.” According to the dictionary, AXIS and PRAXIS don’t rhyme (AXIS AE1 K S AH0 S and PRAXIS P R AE1 K S IH0 S) but PRAXIS and TERRORISTS(4) do (TERRORISTS(4) T EH1 R ER0 IH0 S). DIVE and SKYDIVE differ because the stress on the vowel is different (DIVE D AY1 V and SKYDIVE S K AY1 D AY0 V), which we could fix by ignoring vowel stress when comparing two signatures. And of 129481 words in the dictionary, there are only 2576 signatures, of which 661 have only a single word, so that the remaining 1951 signatures have an average 67 words; the maximum signature has 12333 words (the syllable IY0 as in TURKEY). If we wanted to, we could ameliorate the latter problem by choosing rhymes of two syllables if the final syllable had more than some threshhold of matches. But we’ll stop here, leaving improvements to someone else.

We used read-line, string-split, string-join, and any? from the Standard Prelude. You can see the entire program at http://programmingpraxis.codepad.org/GijwSqhR.

Pages: 1 2 3

4 Responses to “Rhyming Dictionary”

  1. sweetopiagirl said

    Reblogged this on InspiredWeightloss!.

  2. Mike said

    Python 2.7 version

    Uses the pronunciation file to build two dictionaries:
    The first one maps a word to its ending phenomes (from the primary stress to the end of the word);
    The second one maps the ending phonemes to a list of words that have that ending.

    do_rhyme() and find_rhymes() then use the dictionaries.

    import gzip
    from collections import defaultdict
    
    db = defaultdict(set)
    rdb = defaultdict(set)
    
    with gzip.open('c06d.gz') as f:
      
      for line in (s.strip() for s in f if s[0].isalpha()):
        
        line = line.split()
        word = line[0]
    
        if word.endswith(')'):
            word = word[:word.index('(')]
    
        ending = []
        for phoneme in reversed(line[1:]):
            ending.append(phoneme)
            if phoneme.endswith('1'):
                break
        ending = ' '.join(reversed(ending))
    
        db[word].add(ending)
        rdb[ending].add(word)
    
    db  = {k:sorted(v) for k,v in db.items()}
    rdb = {k:sorted(v) for k,v in rdb.items()}
    
    def do_rhyme(word1, word2):
      return bool(set(db[word1]).intersection(db[word2]))
    
    def find_rhymes(word):
      return [rdb[sound] for sound in db[word]]
    
    
  3. Gray said

    Ruby style, matching every item after a vowel and building a map of maps of arrays hash[vowel][consonant] = [words with these stuff] (if I understood the rules well :-))

    f = File.new(ARGV[0])
    daMap = Hash.new() do |h, k|
      h[k] = Hash.new() do |hh, kk|
        hh[kk] = Array.new()
      end
    end
    
    while !f.eof()
      line = f.readline()
      trailingStr = /[A-Z]+[0-2] ?([A-Z ]+)*$/.match(line)
      next if trailingStr.nil?()
    
      daWord = /^.*? /.match(line).to_s().chop()
    
      trailingBits = trailingStr.to_s().split(' ')
      vowel = trailingBits.shift()
    
      trailingBits.each() do |bit|
        daMap[vowel][bit] << daWord
      end
    end
    
    daMap.each() do |vowel, trailingGuys|
      trailingGuys.each() do |guy, people|
        if people.size() > 1
          puts "#{people.join(', ')} do rhyme !"
        end
      end
    end
    
  4. robert said

    i like the programming job done at http://www.prime-rhyme.com

Leave a comment