Levenshtein Distance
September 12, 2014
In a previous exercise we computed the edit distance between two strings where the allowable operations were insert or delete characters; we made the calculation by determining the longest common subsequence. A different definition of edit distance allows substitutions as well as insertions and deletions, and is called the Levenshtein distance, since it was studied by Vladimir Levenshtein in 1965. For example, the edit distance between “hill” and “hilt” is 2 (delete the “l” and insert the “t”) but the Levenshtein distance is 1 (replace “l” by “t”).
The basic algorithm is recursive: if either string is empty, the Levenshtein distance is the length of the other string, otherwise it is the minimum of the Levenshtein distance computed by deleting the first character from the first string, by deleting the first character from the second string, or by deleting the first characters of each of the two strings (adding 1 if the two characters differ), plus the Levenshtein distance of the remaining substrings. That computation takes exponential time as it recomputes the same substring Levenshtein distances many times. A better algorithm uses dynamic programming to build up the substring distances, so they are always available as they are needed.
Your task is to write two functions to compute the Levenshtein distance between two strings. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Very cool. Here’s my version in Swift — memoize caches the results by paired substrings for reuse:
let levenshteinDistance = memoize { (levenshteinDistance: (String, String) -> Int, s1: String, s2: String) -> Int in if isEmpty(s1) { return countElements(s2) } if isEmpty(s2) { return countElements(s1) } // drop first letter of each string let s1Crop = s1[s1.startIndex.successor()..<s1.endIndex] let s2Crop = s2[s2.startIndex.successor()..<s2.endIndex] // if first characters are equal, continue with both cropped if s1[s1.startIndex] == s2[s2.startIndex] { return levenshteinDistance(s1Crop, s2Crop) } // otherwise find smallest of the three options let (c1, c2, c3) = (levenshteinDistance(s1Crop, s2), levenshteinDistance(s1, s2Crop), levenshteinDistance(s1Crop, s2Crop)) return 1 + min(min(c1, c2), c3) } levenshteinDistance("hill", "hilt") // 1And here’s the memoize function:
func memoize<T1: Hashable, T2: Hashable, U>(body: ((T1, T2) -> U, T1, T2) -> U) -> ((T1, T2) -> U) { var memo = [T1: [T2: U]]() var result: ((T1, T2) -> U)! result = { (value1: T1, value2: T2) -> U in if let cached = memo[value1]?[value2] { return cached } let toCache = body(result, value1, value2) if memo[value1] == nil { memo[value1] = [:] } memo[value1]![value2] = toCache return toCache } return result }Two versions in Python. A recursive version with memoization and an iterative version with 2 matrix rows.
memo = {} def levenshtein_distance_recursive(a, b): if a == b: return 0 if a == "": return len(b) if b == "": return len(a) if (a, b) not in memo: l1 = levenshtein_distance_recursive(a[1:], b) + 1 l2 = levenshtein_distance_recursive(a, b[1:]) + 1 l3 = levenshtein_distance_recursive(a[1:], b[1:]) + (a[0] != b[0]) memo[(a,b)] = min(l1, l2, l3) return memo[(a,b)] def levenshtein_distance(s, t): if s == t: return 0 if s == "": return len(t) if t == "": return len(s) v0 = range(len(t) + 1) v1 = [0] * len(v0) for i, si in enumerate(s): v1[0] = i + 1 for j, tj in enumerate(t): v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + (si != tj)) v0[:] = v1 return v1[-1]