Soundex
February 16, 2010
Soundex is an algorithm originally developed by Margaret Odell and Robert Russell in the early 1900s to convert people’s last names into a standard encoding in a way that brings together similar names, thus correcting misspellings. Knuth (AoCP3, introduction to Chapter 6) describes the algorithm as follows:
- Retain the first letter of the name, and drop all occurrences of a, e, h, i, o, u, w, y in other positions.
- Assign the following numbers to the remaining letters after the first:
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
- If two or more letters with the same code were adjacent in the original name (before step 1), omit all but the first.
- Convert to the form “letter, digit, digit, digit” by adding trailing zeros (if there are less than three digits), or by dropping rightmost digits (if there are more than three).
The soundex method isn’t perfect. As Knuth points out, related names like Rogers and Rodgers, or Sinclair and St. Clair, or Tchebysheff and Chebyshev, are not matched. On the other hand, soundex works well enough to be generally useful, and if your first attempt at a match fails, it isn’t too hard to try again with an alternate spelling.
Your task is to write a function that takes a name and returns its associated soundex code. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.