An alternative to brute force. Text search with hash function

    Earlier, I wrote about the basics of text search, now I want to continue and write about how algorithms are developing in the direction of efficiency.
    So how did Michael Rabin and Richard Karp overclock the algorithm?


    Why is brute force so slow? Perhaps because we are doing too much unnecessary action. Then the idea appears to optimize the inner cycle. But as? One could compare strings by some numbers characterizing them.
    There are such numbers, we will get them using the hash function .

    First approach


    Maybe something comes to mind like “let's calculate the hash code for the template and for each of the substrings, if they are the same - there is a coincidence here”.
    Let's try to solve the problem by writing our hash function, and also add a code that will character-by-character match the substring with the template only if the hash codes are equal. We just make the hash function of the string just the sum of the character codes that make up this string: The substring search function itself will look like this:

    private int GetHashOfString(string s)
    {
        int result = 0;
        for (int i = 0; i < s.Length; i++)
        {
            result += s[i];
        }
        return result;
    }




    public int Match(string input, string pattern)
    {
        int inputLength = input.Length;
        int patternLength = pattern.Length;
        int patternHash = GetHashOfString(pattern);
        int substringHash;

        for (int i = 0; i <= inputLength - patternLength; i++)
        {
            substringHash = GetHashOfString(input.Substring(i, patternLength));
            if (patternHash == substringHash)
            {
                bool success = true;
                for (int j = 0; j < patternLength; j++)
                {
                    if (input[i + j] != pattern[j])
                    {
                        success = false;
                        break;
                    }
                }
                if (success)
                    return i;
            }
        }
        return -1;
    }


    Overclock the algorithm


    But again: to find a hash code at each position, we perform exactly the same actions as with brute force. It is necessary to optimize. Our hash construction option allows us to significantly speed up its calculation at each subsequent step: subtract the ASCII code of the character that was at position zero and add a new character code.

    Ring hash

    The code will change as follows:

    ...
    int patternHash = GetHashOfString(pattern);
    int substringHash = GetHashOfString(input.Substring(0, patternLength));

    for (int i = 0; i <= inputLength - patternLength; i++)
    {
        if (i > 0)
            substringHash =
                substringHash - input[i - 1] + input[i + patternLength - 1];
        if (patternHash == substringHash)
    ...


    Continue overclocking?


    You can even overclock the algorithm by using another hash function. One of these hash functions interprets each substring as a number in some number system, the base of which is a large prime number.

    Record a string in the number system

    How to get the hash value from the previous value in a new step? Since the length of the template is constant for us, we can once calculate and remember the base in the degree of the length of the template minus one: maxBase = 61 ^ (length – 1). Instead of subtracting the value of the code being thrown out, we subtract its value multiplied by maxBase, that is, 'a' * 61 ^ 3.
    After this, it is necessary to add a new code and multiply the obtained value by the base of our system (61).
    This can be written as pseudocode:

    substringHash = substringHash - input[i - 1];
    substringHash = substringHash + input[i + patternLength - 1];
    substringHash = substringHash * base; // base – основание системы


    Another question: what will happen to the hash with a longer line length (more precisely, with a sufficiently long template length)? 61 to the sixth power (for a length of 7 characters) no longer fits in a four-byte integer.

    “Modulo arithmetic” comes to the rescue. We will not store huge numbers that cannot be fit into a thirty-two-digit int, we will take their remainder by dividing by some prime q.
    Just in case, I’ll say that “modulo arithmetic” is based on identities such as:
    (a + b + c) mod x = (a mod x + b mod x + c mod x) mod x
    (a * b * c) mod x = (a mod x * b mod x * c mod x) mod x


    We implement the algorithm


    So, the hash function will not consist of the sum of character codes multiplied by the selected base base to the appropriate degree, but of this sum taken modulo q. The values of q and base will choose the longer length of the alphabet, ie for ASCII this is> 256, and for Unicode> 65536.

    Our function for string s 3 characters long will be:
    ((ascii(s[0]) * base^2) mod q + (ascii(s[1]) * base^1) mod q + (ascii(s[2]) * base^0) mod q) mod q

    Since the pattern length during a search remains unchanged, and remains unchanged base ^ (length – 1) mod q. We take out the calculation of this value in a separate method: As before, create a method for the hash function: Now the search function itself: find the value base ^ (patternLength - 1) modulo q first find the hash values ​​for the template and the first substring if the hash value match - compare the lines completely

    private int GetMaxBase(int length, int q, int b)
    {
        int result = 1;
        for (int i = 0; i < length - 1; i++)
            result = (result * b) % q;
        return result;
    }




    private int GetHashOfString(string s, int q, int b)
    {
        int result = 0;
        int length = s.Length;

        for (int i = 0; i < length; i++)
            result = (b * result + s[i]) % q;
        return result;
    }



    public int Match(string input, string pattern, int b, int q)
    {
        int inputLength = input.Length;
        int patternLength = pattern.Length;


        int maxBase = GetMaxBase(patternLength, q, b);

        int patternHash = GetHashOfString(pattern, q, b);
        int substringHash =
            GetHashOfString(input.Substring(0, patternLength), q, b);

        for (int i = 0; i <= inputLength - patternLength; i++)
        {


            if (patternHash == substringHash)
            {
                bool success = true;
                for (int j = 0; j < patternLength; j++)
                {
                    if (input[i + j] != pattern[j])
                    {
                        success = false;
                        break;
                    }
                }
                if (success)
                    return i;
            }

    if we are not at the last step, we find the new value of the hash function, if we get a negative number, we make it positive :)
            if (i != inputLength - patternLength)
                substringHash =
                    (b * (substringHash - input[i] * maxBase) +
                    input[i + patternLength]) % q;


            if (substringHash < 0) substringHash += q;
        }
        return -1;
    }


    At last


    So we are done. It is worth mentioning that the time costs of this algorithm, even in the worst case, are not inferior to the brute force algorithm, and on average it is quite nice to look at the algorithm's indicators.

    I hope that in the near future I will be able to write about which way other people were moving, trying to find effective substring search algorithms. Thanks to those who read to the end.

    The project with tests can be downloaded here.

    Also popular now: