NeoNs January 8, 2011 at 23:16

String Search Algorithms

From the sandbox

Statement of the search problem in the line

Often you have to deal with a specific search, the so-called string search (search in a string). Let there be some text T and a word (or image) W. It is necessary to find the first occurrence of this word in the specified text. This action is typical of any word processing system. (Elements of arrays T and W are symbols of some finite alphabet - for example, {0, 1}, or {a, ..., z}, or {a, ..., i}.)

The most typical application of such a task is documentary search: a collection of documents consisting of a sequence of bibliographic references is given, each link is accompanied by a “descriptor” indicating the subject of the corresponding link. We need to find some keywords found among descriptors. Could be, for example, the request "Programming" and "Java". Such a request can be interpreted as follows: are there articles with descriptors "Programming" and "Java".

A string search is formally defined as follows. Let an array T of N elements and an array W of M elements be given, and 0 Example. It is required to find all occurrences of the pattern W = abaa in the text T = abcabaabcabca.

The sample is included in the text only once, with a shift of S = 3, index i = 4.

Direct search algorithm

The idea of the algorithm:
1. I = 1,
2. compare the I-th character of the array T with the first character of the array W,
3. match → compare the second characters and so on,
4. mismatch → I: = I + 1 and go to point 2 ,

Algorithm termination condition:
1. in a row M comparisons are successful,
2. I + M> N, that is, the word was not found.

Algorithm Complexity:
Worst Case. Let an array T → {AAA ... .AAAB}, a length │T│ = N, a sample W → {A ... .AB}, a length │W│ = M. Obviously, to find a match at the end of the line, you need to make an order of N * M comparisons, that is, O (N * M).

Disadvantages of the algorithm:
1. high complexity - O (N * M), in the worst case - Θ ((N-M + 1) * M);
2. after a mismatch, viewing always starts with the first character of the sample and therefore may include T characters that were previously viewed (if the string is read from secondary memory, such returns take a lot of time);
3. information about the text T obtained by checking this shift S is not used in any way when checking subsequent shifts.

Algorithm of D. Knut, D. Maurice and V. Pratt (KMP-search)

The KMP search algorithm actually requires only the order of N comparisons even in the worst case.
Example.
(The characters that have been compared are underlined.)

After the initial part of the image W partially matches the corresponding characters of the string T, we actually know the passed part of the string and can “calculate” some information (based on the image of W itself), with the help of which we will quickly move forward in the text .

The idea of the KMP search is that with each mismatch of two characters of the text and the image, the image is shifted by the entire distance traveled, since smaller shifts cannot lead to a complete match.

Features of the KMP search:
1. Requires (N + M) character comparisons to get the result;
2. The KMP-search scheme gives a real gain only when a certain number of matches preceded the failure. Only in this case the image shifts by more than one. Unfortunately, coincidences are much less common than mismatches. Therefore, the gain from the KMP search in most cases of texts is very insignificant.

The algorithm of R. Bower and D. Moore (BM search)

In practice, the BM search algorithm is most effective if the pattern W is long and the power of the alphabet is large enough.

The idea of BM search is that the comparison of characters starts from the end of the pattern, and not from the beginning, that is, the comparison of individual characters occurs from right to left. Then, using a certain heuristic procedure, the shift to the right s is calculated. Again, characters are compared, starting at the end of the pattern.

This method not only improves the handling of the worst case, but also gives a win in intermediate situations.
Almost always, except for specially constructed examples, BM search requires significantly fewer N comparisons. In the most favorable circumstances, when the last character of the pattern always falls on a mismatching character of the text, the number of comparisons is (N / M), in the worst case, O ((N-M + 1) * M + p), where p is the power of the alphabet .

Rabin-Karp algorithm (RK-search)

Let the alphabet D = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, that is, each character in the alphabet is a d – decimal digit, where d = │D│.

Example. Let the sample have the form W = 3 1 4 1 5
Calculate the values of the numbers from the window of length | W | = 5 in mod q, q is a prime.

23590 (mod 13) = 8, 35902 (mod 13) = 9, 59023 (mod 13) = 9, ...
k1 = 314157 (mod 13) - sample entry,
k2 = 673997 (mod 13) - idle .

It does not follow from the equality ki = kj (mod q) that ki = kj (for example, 31415 = 67399 (mod 13), but this does not mean that 31415 = 67399). If ki = kj (mod q), then we still need to check whether the lines W [1 ... m] and T [s + 1 ... s + m] actually coincide.
If the prime q is large enough, then the additional cost of analyzing idle responses will be small.
In the worst case, the runtime of the RK algorithm is Θ ((N-M + 1) * M), on average, it works quite fast - during O (N + M) time.

Example: How many idle trips k will the RK algorithm make if
q = 11, 13, 17. Let W = {2 6}

26 mod 11 = 4 → k = 3 idle trials,
26 mod 13 = 0 → k = 1 idle tripping,
26 mod 17 = 9 → k = 0 idle times.

Obviously, the number of blanks k is a function of the prime q (if the sample processing function is mod q) and, in general, the type of function for processing the sample W and text T.

Tags: