Longest Common Subsequence (LCS)
The Longest Common Subsequence (LCS) is a classic problem in computer science and string processing. Given two strings, the goal of LCS is to find the longest subsequence (not necessarily contiguous) that is common to both strings.
For example, consider the two strings:
The longest common subsequence (LCS) of these two strings is "GTAB" with a length of 4.
Unlike substrings, subsequences do not require characters to be adjacent, but they must appear in the same relative order.
The LCS problem is typically solved using Dynamic Programming (DP). The main idea is to break the problem down into subproblems. By solving smaller subproblems, you can build the solution to the overall problem.
Given two strings X and Y of lengths m and n, the LCS can be defined as:
X[m-1] == Y[n-1]
, then the LCS of X and Y is: X[m-1] != Y[n-1]
, then the LCS of X and Y is: LCS(X, Y) = 0
.This recursive relation can be efficiently solved using dynamic programming, which eliminates overlapping subproblems.
The dynamic programming approach stores the results of subproblems in a 2D array (or table) dp
, where dp[i][j]
represents the length of the LCS of the first i
characters of string X and the first j
characters of string Y.
dp
of size (m+1) x (n+1)
where m
and n
are the lengths of strings X and Y.(X[i-1], Y[j-1])
, if they are equal, then: Otherwise: dp[m][n]
will contain the length of the LCS.Let's break down the steps with the strings "AGGTAB" and "GXTXAYB".
G X T X A Y B
A 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0
A 0 0 0 0 0 0 0
B 0 0 0 0 0 0 0
dp[6][7]
will give the length of the LCS, which is 4.
def lcs(X, Y):
m = len(X)
n = len(Y)
# Create a 2D table to store lengths of longest common subsequence
dp = [[0] * (n + 1) for _ in range(m + 1)]
# Fill the table in bottom-up manner
for i in range(1, m + 1):
for j in range(1, n + 1):
if X[i - 1] == Y[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
# The length of the longest common subsequence is in dp[m][n]
return dp[m][n]
# Example usage
X = "AGGTAB"
Y = "GXTXAYB"
print(f"Length of LCS: {lcs(X, Y)}")
#include <iostream>
#include <vector>
#include <string>
using namespace std;
int lcs(string X, string Y) {
int m = X.length();
int n = Y.length();
// Create a 2D table to store lengths of longest common subsequence
vector<vector<int>> dp(m + 1, vector<int>(n + 1, 0));
// Fill the table in bottom-up manner
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (X[i - 1] == Y[j - 1]) {
dp[i][j] = dp[i - 1][j - 1] + 1;
} else {
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]);
}
}
}
// The length of the longest common subsequence is in dp[m][n]
return dp[m][n];
}
int main() {
string X = "AGGTAB";
string Y = "GXTXAYB";
cout << "Length of LCS: " << lcs(X, Y) << endl;
return 0;
}
Time Complexity:
The algorithm has O(m * n) time complexity, where m and n are the lengths of the two strings. This is because we are filling a 2D table of size (m+1) x (n+1).
Space Complexity:
The space complexity is also O(m * n) due to the 2D table used to store the lengths of the LCS at each step.
Text Comparison:
LCS is often used to compare two texts or files to identify similarities and differences. This is useful in diff tools (e.g., git diff
), plagiarism detection, and version control systems.
Bioinformatics:
LCS is widely used in DNA sequence analysis, where it helps in comparing sequences to identify common patterns or mutations in biological data.
Data Compression:
LCS can be used in data compression algorithms to find repeated patterns that can be compressed, reducing the size of the data.
Machine Learning and Natural Language Processing (NLP):
In NLP tasks, LCS can be applied to measure the similarity between text sequences, such as in text classification, question answering systems, and semantic analysis.
Sequence Alignment:
In computational biology, LCS is used for sequence alignment algorithms, helping to align sequences of biological molecules like DNA, RNA, or proteins.