Levenshtein Distance
September 12, 2014
The recursive calculation is straight forward:
(define (exponential-levenshtein x y)
(let loop ((xs (string->list x)) (ys (string->list y)))
(cond ((null? xs) (length ys))
((null? ys) (length xs))
((char=? (car xs) (car ys))
(loop (cdr xs) (cdr ys)))
(else (+ 1 (min (loop (cdr xs) ys)
(loop xs (cdr ys))
(loop (cdr xs) (cdr ys))))))))
The first two cond
clauses are the recursive bases. The third clause recurs, adding nothing to the Levenshtein distance, when two characters are identical. The fourth clause computes the cost of deleting from the first string, deleting from the second string, and deleting from both strings, then takes the minimum. Here are two examples:
> (time (exponential-levenshtein "persimmon" "cantaloupe"))
(time (exponential-levenshtein "persimmon" ...))
3 collections
359 ms elapsed cpu time, including 0 ms collecting
346 ms elapsed real time, including 0 ms collecting
21126272 bytes allocated, including 25252720 bytes reclaimed
10
> (time (exponential-levenshtein "persimmons" "cantaloupes"))
(time (exponential-levenshtein "persimmons" ...))
9 collections
1560 ms elapsed cpu time, including 15 ms collecting
1557 ms elapsed real time, including 2 ms collecting
75050576 bytes allocated, including 75850032 bytes reclaimed
10
We can clearly see the exponential time complexity of the algorithm: adding a single character quadruples the time required for the computation. The dynamic programming approach, by comparison, is O(mn) where m and n are the lengths of the two strings:
(define (dynamic-programming-levenshtein x y)
(let* ((x-len (string-length x))
(y-len (string-length y))
(costs (make-matrix (+ x-len 1) (+ y-len 1) 0)))
(do ((i 0 (+ i 1))) ((< x-len i))
(do ((j 0 (+ j 1))) ((< y-len j))
(matrix-set! costs i j (+ i j))))
(do ((i 1 (+ i 1))) ((< x-len i))
(do ((j 1 (+ j 1))) ((< y-len j))
(let ((add-cost (+ (matrix-ref costs (- i 1) j) 1))
(del-cost (+ (matrix-ref costs i (- j 1)) 1))
(sub-cost (+ (matrix-ref costs (- i 1) (- j 1))
(if (char=? (string-ref x (- i 1))
(string-ref y (- j 1)))
0 1))))
(matrix-set! costs i j
(min add-cost del-cost sub-cost)))))
(matrix-ref costs x-len y-len)))
Here we build up an m by n matrix containing the Levenshtein distances between the initial m and n characters of the two input strings; since the substring distances are already known when they are needed for the calculation, there is no repeated calculation, and the Levenshtein distance is quickly computed. This algorithm is much faster than the first, and grows much more slowly:
> (time (dynamic-programming-levenshtein "persimmon" "cantaloupe"))
(time (dynamic-programming-levenshtein "persimmon" ...))
no collections
0 ms elapsed cpu time
0 ms elapsed real time
25824 bytes allocated
10
> (time (dynamic-programming-levenshtein "persimmons" "cantaloupes"))
(time (dynamic-programming-levenshtein "persimmons" ...))
no collections
0 ms elapsed cpu time
0 ms elapsed real time
30800 bytes allocated
10
This algorithm requires O(mn) space. That can be reduced to O(2m) by storing only the current row and prior row as the matrix is being built, but we won’t show that here. You can run the program at http://programmingpraxis.codepad.org/tmX4ta6y, which includes the matrix code from the Standard Prelude.
Very cool. Here’s my version in Swift — memoize caches the results by paired substrings for reuse:
And here’s the memoize function:
Two versions in Python. A recursive version with memoization and an iterative version with 2 matrix rows.