Longest Common Subsequence (LCS)


The Longest Common Subsequence (LCS) is a classic problem in computer science and string processing. Given two strings, the goal of LCS is to find the longest subsequence (not necessarily contiguous) that is common to both strings.

For example, consider the two strings:

  • String 1: "AGGTAB"
  • String 2: "GXTXAYB"

The longest common subsequence (LCS) of these two strings is "GTAB" with a length of 4.

Unlike substrings, subsequences do not require characters to be adjacent, but they must appear in the same relative order.

How the Longest Common Subsequence (LCS) Works

The LCS problem is typically solved using Dynamic Programming (DP). The main idea is to break the problem down into subproblems. By solving smaller subproblems, you can build the solution to the overall problem.

LCS Recursive Formula:

Given two strings X and Y of lengths m and n, the LCS can be defined as:

  • If X[m-1] == Y[n-1], then the LCS of X and Y is: LCS(X,Y)=1+LCS(X[0..m2],Y[0..n2])
  • If X[m-1] != Y[n-1], then the LCS of X and Y is: LCS(X,Y)=max(LCS(X[0..m2],Y[0..n1]),LCS(X[0..m1],Y[0..n2]))
  • Base Case: If either string is empty (i.e., m == 0 or n == 0), then LCS(X, Y) = 0.

This recursive relation can be efficiently solved using dynamic programming, which eliminates overlapping subproblems.


Dynamic Programming Approach for LCS

The dynamic programming approach stores the results of subproblems in a 2D array (or table) dp, where dp[i][j] represents the length of the LCS of the first i characters of string X and the first j characters of string Y.

Steps to Solve LCS using DP:

  1. Initialize a 2D array dp of size (m+1) x (n+1) where m and n are the lengths of strings X and Y.
  2. For each pair of characters (X[i-1], Y[j-1]), if they are equal, then: dp[i][j]=dp[i1][j1]+1 Otherwise: dp[i][j]=max(dp[i1][j],dp[i][j1])
  3. The value at dp[m][n] will contain the length of the LCS.

Example: Finding LCS of "AGGTAB" and "GXTXAYB"

LCS Table (Step by Step):

Let's break down the steps with the strings "AGGTAB" and "GXTXAYB".

    G  X  T  X  A  Y  B
A   0  0  0  0  0  0  0
G   0  0  0  0  0  0  0
G   0  0  0  0  0  0  0
T   0  0  0  0  0  0  0
A   0  0  0  0  0  0  0
B   0  0  0  0  0  0  0

Steps to Fill the Table:

  1. For each pair of characters, fill in the table according to the recursive relations described earlier.
  2. At the end, dp[6][7] will give the length of the LCS, which is 4.

Sample Code for LCS

Python Code for LCS

def lcs(X, Y):
    m = len(X)
    n = len(Y)
    
    # Create a 2D table to store lengths of longest common subsequence
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    # Fill the table in bottom-up manner
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # The length of the longest common subsequence is in dp[m][n]
    return dp[m][n]

# Example usage
X = "AGGTAB"
Y = "GXTXAYB"
print(f"Length of LCS: {lcs(X, Y)}")

C++ Code for LCS

#include <iostream>
#include <vector>
#include <string>
using namespace std;

int lcs(string X, string Y) {
    int m = X.length();
    int n = Y.length();
    
    // Create a 2D table to store lengths of longest common subsequence
    vector<vector<int>> dp(m + 1, vector<int>(n + 1, 0));
    
    // Fill the table in bottom-up manner
    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            if (X[i - 1] == Y[j - 1]) {
                dp[i][j] = dp[i - 1][j - 1] + 1;
            } else {
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]);
            }
        }
    }
    
    // The length of the longest common subsequence is in dp[m][n]
    return dp[m][n];
}

int main() {
    string X = "AGGTAB";
    string Y = "GXTXAYB";
    cout << "Length of LCS: " << lcs(X, Y) << endl;
    return 0;
}

Time and Space Complexity of LCS

  • Time Complexity:
    The algorithm has O(m * n) time complexity, where m and n are the lengths of the two strings. This is because we are filling a 2D table of size (m+1) x (n+1).

  • Space Complexity:
    The space complexity is also O(m * n) due to the 2D table used to store the lengths of the LCS at each step.


Applications of Longest Common Subsequence

  1. Text Comparison:
    LCS is often used to compare two texts or files to identify similarities and differences. This is useful in diff tools (e.g., git diff), plagiarism detection, and version control systems.

  2. Bioinformatics:
    LCS is widely used in DNA sequence analysis, where it helps in comparing sequences to identify common patterns or mutations in biological data.

  3. Data Compression:
    LCS can be used in data compression algorithms to find repeated patterns that can be compressed, reducing the size of the data.

  4. Machine Learning and Natural Language Processing (NLP):
    In NLP tasks, LCS can be applied to measure the similarity between text sequences, such as in text classification, question answering systems, and semantic analysis.

  5. Sequence Alignment:
    In computational biology, LCS is used for sequence alignment algorithms, helping to align sequences of biological molecules like DNA, RNA, or proteins.