Levenshtein Distance

September 12, 2014

In a previous exercise we computed the edit distance between two strings where the allowable operations were insert or delete characters; we made the calculation by determining the longest common subsequence. A different definition of edit distance allows substitutions as well as insertions and deletions, and is called the Levenshtein distance, since it was studied by Vladimir Levenshtein in 1965. For example, the edit distance between “hill” and “hilt” is 2 (delete the “l” and insert the “t”) but the Levenshtein distance is 1 (replace “l” by “t”).

The basic algorithm is recursive: if either string is empty, the Levenshtein distance is the length of the other string, otherwise it is the minimum of the Levenshtein distance computed by deleting the first character from the first string, by deleting the first character from the second string, or by deleting the first characters of each of the two strings (adding 1 if the two characters differ), plus the Levenshtein distance of the remaining substrings. That computation takes exponential time as it recomputes the same substring Levenshtein distances many times. A better algorithm uses dynamic programming to build up the substring distances, so they are always available as they are needed.

Your task is to write two functions to compute the Levenshtein distance between two strings. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisement

Pages: 1 2

3 Responses to “Levenshtein Distance”

  1. Nate said

    Very cool. Here’s my version in Swift — memoize caches the results by paired substrings for reuse:

    let levenshteinDistance = memoize {
        (levenshteinDistance: (String, String) -> Int, s1: String, s2: String) -> Int in
    
        if isEmpty(s1) { return countElements(s2) }
        if isEmpty(s2) { return countElements(s1) }
        
        // drop first letter of each string
        let s1Crop = s1[s1.startIndex.successor()..<s1.endIndex]
        let s2Crop = s2[s2.startIndex.successor()..<s2.endIndex]
        
        // if first characters are equal, continue with both cropped
        if s1[s1.startIndex] == s2[s2.startIndex] {
            return levenshteinDistance(s1Crop, s2Crop)
        }
        
        // otherwise find smallest of the three options
        let (c1, c2, c3) = (levenshteinDistance(s1Crop, s2), levenshteinDistance(s1, s2Crop), levenshteinDistance(s1Crop, s2Crop))
        return 1 + min(min(c1, c2), c3)
    }
    
    levenshteinDistance("hill", "hilt")
    // 1
    
  2. Nate said

    And here’s the memoize function:

    func memoize<T1: Hashable, T2: Hashable, U>(body: ((T1, T2) -> U, T1, T2) -> U) -> ((T1, T2) -> U) {
        var memo = [T1: [T2: U]]()
        var result: ((T1, T2) -> U)!
        result = {
            (value1: T1, value2: T2) -> U in
            if let cached = memo[value1]?[value2] { return cached }
        
            let toCache = body(result, value1, value2)
            if memo[value1] == nil { memo[value1] = [:] }
            memo[value1]![value2] = toCache
            return toCache
        }
        return result
    }
    
  3. Paul said

    Two versions in Python. A recursive version with memoization and an iterative version with 2 matrix rows.

    memo = {}
    def levenshtein_distance_recursive(a, b):
        if a == b:
            return 0
        if a == "":
            return len(b)
        if b == "":
            return len(a)
        if (a, b) not in memo:
            l1 = levenshtein_distance_recursive(a[1:], b) + 1
            l2 = levenshtein_distance_recursive(a, b[1:]) + 1
            l3 = levenshtein_distance_recursive(a[1:], b[1:]) + (a[0] != b[0])
            memo[(a,b)] = min(l1, l2, l3)
        return memo[(a,b)]
        
    def levenshtein_distance(s, t):
        if s == t:
            return 0
        if s == "":
            return len(t)
        if t == "":
            return len(s)
        v0 = range(len(t) + 1)
        v1 = [0] * len(v0)
        for i, si in enumerate(s):
            v1[0] = i + 1
            for j, tj in enumerate(t):
                v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + (si != tj))
            v0[:] = v1
        return v1[-1]
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: