Rabin-Karp Algorithm


The Rabin-Karp Algorithm is a well-known string matching algorithm that uses hashing to efficiently find a substring within a main string. Unlike traditional algorithms like the Naive String Matching algorithm, which checks each position one by one, the Rabin-Karp algorithm computes a hash value for the substring and compares it with the hash of the text. This results in a much faster matching process, especially when looking for multiple occurrences of a substring.

The main advantage of the Rabin-Karp algorithm is its ability to handle multiple pattern searches efficiently. However, in the worst case, the time complexity can degrade to that of the Naive Algorithm, which is O(n * m).

How the Rabin-Karp Algorithm Works

The Rabin-Karp algorithm follows these basic steps:

  1. Hash the pattern: Compute a hash value for the pattern you're searching for.
  2. Hash the text: Compute the hash value for substrings of the text, with the same length as the pattern.
  3. Compare hash values: If the hash values match, compare the actual strings to confirm the match (to avoid hash collisions).
  4. Slide the window: Slide the window over the text one character at a time, updating the hash for the new substring.

Rabin-Karp Algorithm Details:

The algorithm uses rolling hash to efficiently compute the hash of the next substring in constant time.

For a substring starting at position i, the hash is calculated as:

hash(s)=(s[0]p0+s[1]p1++s[m1]pm1)modq

Where:

  • p is a prime number (usually small, like 31).
  • q is a large prime number to reduce collisions.
  • m is the length of the pattern.

For each subsequent substring, the hash can be updated using:

hash(s)=(p(hash(s)s[i]pm1)+s[i+m])modq

This allows the algorithm to update the hash in O(1) time for each shift of the window.


Example: Finding a Pattern in a Text Using Rabin-Karp

Consider the text "ABC ABCDAB ABCDABCDABDE" and the pattern "ABCDABD". The goal is to find the pattern in the text using the Rabin-Karp algorithm.

  1. Compute the hash of the pattern.
  2. Compute the hash of the first substring of the text with the same length as the pattern.
  3. Slide the window across the text, compute the hash for each new substring, and check if it matches the pattern's hash.
  4. If hashes match, compare the actual substrings to confirm the match.

Sample Code for Rabin-Karp Algorithm

Python Code for Rabin-Karp Algorithm

def rabin_karp(text, pattern):
    # Define prime numbers for hashing
    d = 256  # Number of characters in the input alphabet
    q = 101  # A prime number for modulus

    # Length of the pattern and text
    m = len(pattern)
    n = len(text)

    # Calculate the hash value of the pattern and the first window of the text
    p_hash = 0  # Hash value for pattern
    t_hash = 0  # Hash value for text

    # The value of d^m-1, used for rolling hash
    h = 1
    for i in range(m - 1):
        h = (h * d) % q

    # Calculate initial hash values of the pattern and the first window
    for i in range(m):
        p_hash = (d * p_hash + ord(pattern[i])) % q
        t_hash = (d * t_hash + ord(text[i])) % q

    # Slide the pattern over text one by one
    for i in range(n - m + 1):
        # If the hash values match, check the actual strings
        if p_hash == t_hash:
            if text[i:i + m] == pattern:
                print(f"Pattern found at index {i}")

        # Calculate the hash value for the next window
        if i < n - m:
            t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q
            if t_hash < 0:
                t_hash = t_hash + q  # Ensure positive hash value

# Example usage
text = "ABC ABCDAB ABCDABCDABDE"
pattern = "ABCDABD"
rabin_karp(text, pattern)

Explanation:

  • d = 256: We use 256 to represent all ASCII characters in the string.
  • q = 101: This is a large prime number used to reduce hash collisions.
  • p_hash and t_hash are the hash values of the pattern and the current substring of the text, respectively.
  • h is the constant used to roll the hash value efficiently for the next window of the text.
  • The algorithm computes the hash for the pattern and the first substring of the text, and then slides the window across the text to compute hashes for other substrings.

C++ Code for Rabin-Karp Algorithm

#include <iostream>
#include <string>
using namespace std;

void rabin_karp(string text, string pattern) {
    int d = 256;  // Number of characters in the input alphabet
    int q = 101;  // A prime number for modulus
    int m = pattern.length();
    int n = text.length();

    // The value of d^(m-1), used for rolling hash
    int h = 1;
    for (int i = 0; i < m - 1; i++) {
        h = (h * d) % q;
    }

    // Compute hash of the pattern and the first window of text
    int p_hash = 0, t_hash = 0;
    for (int i = 0; i < m; i++) {
        p_hash = (d * p_hash + pattern[i]) % q;
        t_hash = (d * t_hash + text[i]) % q;
    }

    // Slide the pattern over the text one by one
    for (int i = 0; i <= n - m; i++) {
        // If the hash values match, check the actual strings
        if (p_hash == t_hash) {
            if (text.substr(i, m) == pattern) {
                cout << "Pattern found at index " << i << endl;
            }
        }

        // Calculate the hash value for the next window
        if (i < n - m) {
            t_hash = (d * (t_hash - text[i] * h) + text[i + m]) % q;
            if (t_hash < 0) {
                t_hash += q;  // Ensure positive hash value
            }
        }
    }
}

int main() {
    string text = "ABC ABCDAB ABCDABCDABDE";
    string pattern = "ABCDABD";
    rabin_karp(text, pattern);
    return 0;
}

Time and Space Complexity of Rabin-Karp

  • Time Complexity:

    • Average Case: The time complexity of Rabin-Karp is O(n + m), where n is the length of the text and m is the length of the pattern. This is because the algorithm computes hash values for the pattern and the substrings in the text in O(n) time.
    • Worst Case: In the worst case (when there are many hash collisions), the algorithm might need to compare the actual strings, leading to a worst-case time complexity of O(n * m), similar to the Naive String Matching Algorithm.
  • Space Complexity:
    The space complexity is O(1), since the algorithm only uses a constant amount of extra space to store the hash values and other variables.


Applications of Rabin-Karp Algorithm

  1. String Matching:
    The primary use of the Rabin-Karp algorithm is in string matching, where you need to find one or more occurrences of a substring in a given text. It's particularly efficient when there are multiple patterns to search for in the same text.

  2. Plagiarism Detection:
    In plagiarism detection systems, Rabin-Karp can be used to identify similar sections of text by comparing substrings using hashing.

  3. Search Engines:
    Search engines use variations of string matching algorithms like Rabin-Karp to efficiently search for relevant documents based on keyword matching.

  4. Bioinformatics:
    In DNA sequence analysis, Rabin-Karp is used to find specific subsequences within a large DNA sequence.