Longest Common Subsequence

June 9, 2009

Finding the longest common subsequence of two sequences is a classic computer science problem with an equally classic solution that dates to the folklore of computing. The longest common subsequence is the longest set of elements that appear in order (not necessarily contiguous) in two sequences; for instance, the longest common subsequence of PROGRAMMING and PRAXIS is PRAI:

P R   O G R   A   M M   I   N G
| |           |         |
P R           A    X    I    S

The classic solution uses dynamic programming. A matrix is prepared with one sequence in the rows and the other sequence in the columns, giving in each cell the minimum edit distance between the two, where the edit distance is the least number of adds, deletes and changes that can convert one sequence to the other. For instance:

    P R O G R A M M I N G
  0 0 0 0 0 0 0 0 0 0 0 0
P 0 1 1 1 1 1 1 1 1 1 1 1
R 0 1 2 2 2 2 2 2 2 2 2 2
A 0 1 2 2 2 2 3 3 3 3 3 3
X 0 1 2 2 2 2 3 3 3 3 3 3
I 0 1 2 2 2 2 3 3 3 4 4 4
S 0 1 2 2 2 2 3 3 3 4 4 4

If we represent the rows as Xi and the columns as Yj, the matrix can be filled top-to-bottom, left-to-right using the formula:

           { 0,                           if i=0 or j=0
           {
LCS(i,j) = { LCS(i-1,j-1) + 1,            if X(i) = Y(j)
           {
           { max(LCS(i-1,j), LCS(i,j-1)), otherwise

Intuitively, the two sequences are scanned in parallel. If the current two cells are identical, the length of the longest common subsequence increases by one; otherwise, there are two possibilities to consider recursively, after deleting the current cell from either one input sequence or the other, and the length of the longest common subsequence is simply the greater of the two possibilities.

Once the matrix of longest common subsequence lengths has been calculated, the longest common subsequence itself can be recovered by noting each point where the length “bumps” to the next lower value along the diagonal, starting at the lower right-hand corner: from 4 to 3 when both sequences are at I, from 3 to 2 when both sequences are at A, from 2 to 1 when both sequences are at R, and from 1 to 0 when both sequences are at P.

Note that the longest common sequence is not necessarily unique. For instance, given the two sequences ABC and BAC, there are two longest common subsequences, AC and BC. Either answer is correct.

Your task is to write a function that takes two sequences and returns their longest common subsequence. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

9 Responses to “Longest Common Subsequence”

  1. Remco Niemeijer said

    My Haskell solution (see http://bonsaicode.wordpress.com/2009/06/09/programming-praxis-longest-common-subsequence/ for a version with comments):

    import Data.Array

    at :: Int -> Int -> Array Int (Array Int e) -> e
    at x y a = a ! y ! x

    lcs :: Eq a => [a] -> [a] -> [a]
    lcs xs ys = reverse $ walk imax jmax where
    imax = length xs
    jmax = length ys
    ax = listArray (1, imax) xs
    ay = listArray (1, jmax) ys
    ls = listArray (0, jmax) [listArray (0, imax)
    [lcs’ i j | i <- [0..imax]] | j <- [0..jmax]] lcs' 0 _ = 0 lcs' _ 0 = 0 lcs' i j | ax ! i == ay ! j = 1 + at (i-1) (j-1) ls | otherwise = max (at (i-1) j ls) (at i (j-1) ls) walk 0 _ = [] walk _ 0 = [] walk i j | at (i-1) j ls == at i j ls = walk (i-1) j | at i (j-1) ls < at i j ls = ax ! i : walk i (j-1) | otherwise = walk i (j-1) main :: IO () main = print $ lcs "programming" "praxis" [/sourcecode]

  2. […] Praxis – Longest Common Subsequence By Remco Niemeijer Today’s Programming Praxis problem is about finding the longest common subsequence of two sequences. Our […]

  3. Mike said

    My Python version. Incorporates a couple optimizations:

    1. identifies matching sequences at start and end of the two input sequences, and only runs LCS algorithm on the middle portions of the input sequences.

    2. a row of the matrix in the LCS algorithm only depends on the previous row, so only keep one previous row instead of entire matrix.

    The LCS algorithm accumulates the indices of the members of the LCS. The reduce call at the end, assembles the LCS from the indices. The result has the same type (i.e., string, list, tuple) as seq1.

    from operator import add
    
    def lcs(seq1, seq2):
        """return longest common subsequence of seq1 and seq2.
    
        return type is the same as seq1
    
        >>> lcs("programming", "praxis")
        'prai'
        >>> lcs("abcDzFjkl","abcGzIjkl")
        'abczjkl'
        """
    
        start = 0
        while seq1[start] == seq2[start]:
            start += 1
    
        end1, end2 = len(seq1) - 1, len(seq2) - 1
        while seq1[end1] == seq2[end2]:
            end1 -= 1
            end2 -= 1
    
        row = [ [] ] * (end2 - start + 1)
    
        for r in range(start, end1):
            prev = row
            row = [ [] ]
            for c in range(start, end2):
                if seq1[r] == seq2[c]:
                    row.append(prev[c - start] + [r])
                else:
                    row.append(max(prev[c - start +1], row[-1], key=len))
    
        seq = reduce(add, (seq1[i:i + 1] for i in row[-1]))
    
        return seq1[:start] + seq + seq1[end1 + 1:]
    
    if __name__ == "__main__":
        import doctest
        print doctest.testmod()
    
  4. Jebb said
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <assert.h>
    #define max(A,B) (((A) > (B)) ? (A) : (B))
    #define element(M, i, j) (*((M)->elt + \
    						    ((M)->width * (i) + (j)) * sizeof(int)))
    
    typedef struct matrix {
    	int *elt;
    	int width;
    	int height;
    } matrix;
    
    char *longest_comm_seq(char *str1, char *str2);
    matrix *init_matrix(int width, int length);
    matrix *lcs_matrix(char *str1, char *str2);
    void scan_matrix(matrix *LCS, char *str1, char *tmpchar);
    
    int main(int argc, char* argv[])
    {
    	char *lcs;
    	if (argc != 3) {
    		printf("usage: lcs <string 1> <string 2>\n");
    		exit(EXIT_FAILURE);
    	}
    	lcs = longest_comm_seq(argv[1], argv[2]);
    	printf("Longest common sequence: \"%s\"\n", lcs); 
    	free(lcs);
    	exit(EXIT_SUCCESS);
    }
    
    char *longest_comm_seq(char *str1, char *str2)
    {
    	char *tmpchar;
    	int lcsLength;
    	matrix *LCS;
    	LCS = lcs_matrix(str1, str2);
    	lcsLength = element(LCS, LCS->height - 1, LCS->width - 1);
    	tmpchar = (char *)malloc((lcsLength + 1) * sizeof(char));
    	assert(tmpchar != NULL);
    	scan_matrix(LCS, str1, tmpchar);
    	free(LCS->elt);
    	free(LCS);
    	return tmpchar;
    }
    
    void scan_matrix(matrix *LCS, char *str1, char *tmpchar)
    {
    	int row, col, i;
    	row = LCS->height - 1;
    	col = LCS->width - 1;
    	i = element(LCS, row, col);
    	tmpchar[i--] = '\0';
    	while (i >= 0) {
    		while (element(LCS, row, col) == element(LCS, row - 1, col))
    			row--;
    		while (element(LCS, row, col) == element(LCS, row, col - 1))
    			col--;
    		tmpchar[i] = str1[col - 1];
    		col--;
    		row--;
    		i--;
    	}
    }
    
    matrix *init_matrix(int width, int height)
    {
    	matrix *tmpLCS;
    	tmpLCS = (matrix *)malloc(sizeof(matrix));
    	tmpLCS->elt = (int *)malloc(width * height * sizeof(int));
    	assert(tmpLCS != NULL);
    	assert(tmpLCS->elt != NULL);
    	tmpLCS->width = width;
    	tmpLCS->height = height;
    	return tmpLCS;
    }
    
    matrix *lcs_matrix(char *str1, char *str2)
    {
    	int row, col;
    	matrix *tmpLCS;
    	tmpLCS = init_matrix(strlen(str1) + 1,
    						 strlen(str2) + 1);
    	for (row = 0; row < tmpLCS->height; ++row) {
    		for (col = 0; col < tmpLCS->width; ++col) {
    			if ((row == 0) || (col == 0))
    				element (tmpLCS, row, col) = 0;
    			else if (str1[col - 1] == str2[row - 1])
    				element(tmpLCS, row, col) =
    						element(tmpLCS, row - 1, col - 1) + 1;
    			else
    				element(tmpLCS, row, col) = 
    							max(element(tmpLCS, row -1, col),
    							    element(tmpLCS, row, col - 1));
    		}
    	}
    	return tmpLCS;
    }
    
  5. Jebb said

    Ooooops… memory management bits me in the backside.

    Lines 6 & 7 should read

    7 #define element(M, i, j) (*((M)->elt + \
    8 ((M)->width * (i) + (j))))

    Nasty bug there, the sizeof(int) in the macro I’d initially written was completely uncalled for: gcc knows perfectly well already that (M)->elt is an int*. As a result, my code was writing the matrix to an area 4 times the size of what I’d malloc’d in init_matrix.

  6. Jebb said

    I’ve worked on a generic version of this in preparation for the 08 June 2010 exercise. I’ve packed all the generic functions in a str_util.c file. The headers:

    //str_util.h
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <assert.h>
    #define max(A,B) (((A) > (B)) ? (A) : (B))
    #define element(M, i, j) (*((M)->elt + \
    		            ((M)->width * (i) + (j))))
    
    typedef struct vector{
    	int length;
    	int *array;
    } vector;
    typedef struct matrix{
    	int *elt;
    	int width;
    	int height;
    } matrix;
    
    void print_vector(vector *vec);
    vector *lcs_indices(void *sequence1, 
    				 void *sequence2, 
    				 void *(*IndexFn)(void *, int),
    				 int (*compFn)(void *, void *),
    				 int (*lenFn)(void *));
    static void scan_matrix(matrix *LCS, vector *vec);
    static void print_matrix(matrix *arr);
    static matrix *init_matrix(int width, int height);
    static matrix *lcs_matrix(void *sequence1,
    				   void *sequence2,
    				   void *(*indexFn)(void *, int),
    				   int (*compFn)(void *, void *),
    				   int (*lenFn)(void *));
    

    the lcs_util.c file itself:

    #include "lcs_util.h"
    void print_vector(vector *vec)
    {
    	int i;
    	for (i = 0; i < vec->length; i++)
    		printf("%d ", vec->array[i]);
    	printf("\n");
    }
    
    vector *lcs_indices(void *sequence1, 
    				 void *sequence2, 
    				 void *(*IndexFn)(void *, int),	//See comments in lcs_matrix
    				 int (*compFn)(void *, void *),
    				 int (*lenFn)(void *))
    {
    	vector *tmpvec;
    	matrix *LCS;
    	int test;
    	LCS = lcs_matrix(sequence1, sequence2, IndexFn, compFn, lenFn);
    	
    	tmpvec = (vector *)malloc(sizeof(vector));
    	assert(tmpvec != NULL);
    	tmpvec->length = element(LCS, LCS->height - 1, LCS->width - 1);
    	tmpvec->array = (int *)malloc((tmpvec->length) * sizeof(int));
    	assert(tmpvec->array != NULL);
    
    	scan_matrix(LCS, tmpvec);
    	free(LCS->elt);
    	free(LCS);
    	return tmpvec;
    }
    
    static void scan_matrix(matrix *LCS, vector *vec)
    {
    	int row, col, i;
    	row = LCS->height - 1;	//first row and column full of 0s
    	col = LCS->width - 1;
    	i = vec->length - 1;
    	while (i >= 0) {
    		while (element(LCS, row, col) == element(LCS, row - 1, col))
    			row--;
    		while (element(LCS, row, col) == element(LCS, row, col - 1))
    			col--;
    		vec->array[i--] = col - 1; //index the array from 0
    		col--;
    		row--;
    	}
    }
    
    static void print_matrix(matrix *arr)
    {
    	int row, col;
    	for (row = 0; row < arr->height; ++row) {
    		printf("%d: ", row);
    		for (col = 0; col < arr->width; ++col)
    			printf("%d ", element(arr, row, col));
    		printf("\n");
    	}
    }
    
    static matrix *init_matrix(int width, int height)
    {
    	matrix *tmpLCS;
    	tmpLCS = (matrix *)malloc(sizeof(matrix));
    	tmpLCS->elt = (int *)malloc(width * height * sizeof(int));
    	assert(tmpLCS != NULL);
    	assert(tmpLCS->elt != NULL);
    	tmpLCS->width = width;
    	tmpLCS->height = height;
    	return tmpLCS;
    }
    
    static matrix *lcs_matrix(void *sequence1, 
    				   void *sequence2, 
    				   void *(*indexFn)(void *, int i), 
    				   int (*compFn)(void *, void *), 
    				   int (*lenFn)(void *))
    /* Initialises and returns the matrix of minimum edit distances 
     * lenFn applies to a sequence: "programming" or "praxis",
     * 		or a text seen as a sequence of lines
     * compFn compares two items in a sequence: letters in "programming" 
     * 		or "praxis", or lines in a text sequence
     * indexFn returns a pointer to an item in a sequence: 
     * 		a letter in "programming", 
     * 		a (char *) to a line in a text sequence */
    {
    	int row, col;
    	matrix *tmpLCS;
    	tmpLCS = init_matrix(lenFn(sequence1) + 1,
    						 lenFn(sequence2) + 1);
    	for (row = 0; row < tmpLCS->height; ++row) {
    		for (col = 0; col < tmpLCS->width; ++col) {
    			if ((row == 0) || (col == 0))
    				element (tmpLCS, row, col) = 0;
    			else if (compFn(indexFn(sequence1, col - 1), 
    				            indexFn(sequence2, row - 1)) == 0)
    				element(tmpLCS, row, col) =
    						element(tmpLCS, row - 1, col - 1) + 1;
    			else
    				element(tmpLCS, row, col) = 
    							max(element(tmpLCS, row -1, col),
    							    element(tmpLCS, row, col - 1));
    		}
    	}
    	return tmpLCS;
    }
    

    …and (finally) the string-specific bits:

    #include "lcs_util.h"
    int lenStr(void *str);
    void *indexStr(void *str, int i);
    int compStr(void *str1, void *str2);
    void print_str(char *str, vector *lcs);
    
    int main(int argc, char* argv[])
    {
    	vector *lcs;
    	if (argc != 3) {
    		printf("usage: lcs <string 1> <string 2>\n");
    		exit(EXIT_FAILURE);
    	}
    	lcs = lcs_indices((void *)argv[1],
    					  (void *)argv[2],
    					  indexStr,
    					  compStr,
    					  lenStr);
    	printf("Longest common sequence: "); 
    	print_str(argv[1], lcs);
    	free(lcs->array);
    	free(lcs);
    	exit(EXIT_SUCCESS);
    }
    
    void print_str(char *str, vector *lcs)
    {
    	int *ip;
    	for (ip = lcs->array; ip - lcs->array < lcs->length; ip++)
    		putchar(str[*ip]);
    	putchar('\n');
    }
    
    int lenStr(void *str)
    {
    	return strlen((char *)str);
    }
    
    void *indexStr(void *str, int i)
    {
    	char *cp;
    	cp = (char *)str + i * sizeof(char);
    	return (void *)cp;
    }
    
    int compStr(void *str1, void *str2)
    {
    	return (int)(*((char *)str1) - *((char *)str2));
    }
    
  7. […] Longest Common Subsequence Finding the longest common subsequence of two sequences is a classic computer science problem with an equally classic solution that dates to the folklore of computing. The longest common subsequence is the longest set of elements that appear in order (not necessarily contiguous) in two sequences, the solution can be used in matching DNA sequences and it is also the basis of Unix “DIFF” utility. […]

  8. Vikas Tandi said

    my implementation in c

    #include <string.h>
    #include <stdlib.h>
    #include <stdio.h>
    
    static int max(int x, int y);
    
    int longest_common_subsequence(char *s, char *t, char *lcs)
    {
    	int i, j, k;
    	int rowmax, colmax;
    	int **matrix;
    
    	/* calculate the matrix row size and column size */
    	colmax = strlen(s) + 1;
    	rowmax = strlen(t) + 1;
    
    	/* allocate memory for the matrix */
    	matrix = (int**)calloc(rowmax, sizeof(int*));
    
    	if(matrix == NULL)
    		return 0;
    
    	for(k = 0; k < rowmax; k++)
    	{
    		matrix[k] = (int*)calloc(colmax, sizeof(int));
    
    		if(matrix[k] == NULL)
    			return 0;
    	}
    
    	/* perform lcs operation on the matrix */
    	for(i = 1; i < rowmax; i++)
    	{
    		for(j = 1; j < colmax; j++)
    		{
    			if(s[j-1] == t[i-1])
    				matrix[i][j] = matrix[i-1][j-1] + 1;
    			else
    				matrix[i][j] = max(matrix[i-1][j], matrix[i][j-1]);
    		}
    	}
    
    	/* backtrack the matrix to identify the largest common subsequence. */
    	for(i = i-1, j = j-1, k = 0; i || j;)
    	{
    		if(matrix[i-1][j] == matrix[i][j]	&&
    			matrix[i][j-1] == matrix[i][j])
    		{
    			i--;
    			j--;
    		}
    		else if(matrix[i-1][j] == matrix[i][j])
    			i--;
    		else if(matrix[i][j-1] == matrix[i][j])
    			j--;
    		else
    		{
    			lcs[k++] = t[i-1];
    			i--;
    			j--;
    		}
    	}
    	lcs[k] = '\0';
    	strrev(lcs);
    	return 1;
    }
    
    static int max(int x, int y)
    {
    	return (x > y) ? x : y;
    }
    
    
  9. gautam said

    #include
    int max(int a, int b){
    return a>b?a:b;
    }
    void printResult( char * str1 , char * str2 ){
    int len1=0,len2=0;
    int i,j,count=1;
    int a[20][20];
    // find out the lengths
    while(str1[len1++] != ” );
    while(str2[len2++] != ” );
    len1–;
    len2–;
    // populate the matrix
    for ( i=0 ; i<=len2 ; i++ )
    for ( j=0 ; j<=len1 ; j++ ){
    if ( i==0 || j==0 ){
    a[i][j] = 0;
    continue;
    }
    if ( str1[j-1] == str2[i-1] )
    a[i][j]=a[i-1][j-1] +1;
    else
    a[i][j]=max(a[i-1][j],a[i][j-1]);
    }
    // print the answer
    for ( i=0 ; i<len1 ; i++ ){
    if ( a[len2][i+1] == count ){
    printf("%c",str1[i]);
    count++;
    }
    }
    printf("\n");
    }
    void main(){
    int n,i;
    char a[10][10];
    char b[10][10];
    // scan the number of elements
    scanf("%d",&n);
    // scan elements
    for ( i=0 ; i<n ; i++ ){
    scanf("%s",a[i]);
    scanf("%s",b[i]);
    }
    for ( i=0 ; i<n ; i++ )
    printResult(a[i],b[i]);
    }

Leave a comment