Incorrect interpretation of the Aho-Korasik algorithm

In the distant (or maybe not very distant) 1975 Alfred Aho and Margaret Korasik published an article, which described in detail the algorithm for effective search of all occurrences of all sample strings in a given string. In the future, this algorithm was called the “Aho-Korasik algorithm”. It is not surprising that after some time there appeared technical and “artistic” translations of this article into Russian. Sometimes I even came across free expositions of the essence of the algorithm in the form in which the author understands it. Moreover, the latter, judging by the text, found out about the algorithm far from the original source. I don’t know if there was a translation that served as the primary source of the problem, but more and more I come across articles describing the Aho-Korasik algorithm, in which the same cardinal error was made. The last straw was an article by rmqon a hub which this error did not pass. Actually, I would like to tell the public about this mistake in my article.
Note: the rmq user article , as it turned out later, does not contain this error. How it is solved there I will tell below. And many thanks to him, if not for his topic, I would not have written my own and would not have received an invite!
Before starting, a few words about the target audience: Most likely, those who have been familiar with the Aho-Korasik algorithm for a long time will not be interested in my article, since they have long known about its features. At least, all my familiar programmers more than once using this algorithm know about the existence of its incorrect interpretations firsthand. But for beginners and those who have not been able to often apply it in practice, this article can be quite useful.
So, let's begin.

In order not to waste readers' time, I will miss a detailed description of the algorithm (this can be read in the original article) and start with an example and a picture in order to quickly go over the terms and talk further “in one language”.
Let us look for the following list of sample strings: {"HE", "SHE", "HIS", "HER"}. First of all, we need to build an automaton. The very finite state machine with states and transitions, on the basis of which the Aho-Korasik algorithm is built. The black circle indicates the initial state of the automaton. Blue circles will indicate the next state of the automaton, the achievement of which does not mean the fact of detection of the sample string. And, finally, blue circles with a crosshair - this is the next state of the machine, the achievement of which means the fact of detecting the sample string. Naturally, transitions stretch between them - black arrows signed with letters, which are the condition for the transition. We will call them the main transitions. If there is no main transition for the next input symbol, the machine goes to its initial state. We call them zero transitions. I don’t specifically indicate these transitions, so as not to clutter up the picture. About zero transitions, it is worth noting that when you click on them, the input symbol is not absorbed, that is, it remains in the input line and will be used again after the transition. An exception is the zero transition from the initial state, where, in fact, the state does not change, and the input symbol is absorbed.

The figure shows four blue circles with a crosshair, each of which corresponds to one of the lines of the samples. In fact, this very error is already inherent in this example. The original algorithm describes the automaton a little differently. But first things first.
Recall the so-called suffix links. These are transitions that indicate which state to switch to if there is no main transition for the next character of the input line. In the following figure, they are indicated by dotted black arrows. We will call them suffix transitions or suffix links. Passing through suffix links by analogy with a null transition does not absorb the input character. As in the first case, I do not depict transitions leading to the initial state in the absence of the main transition.

In the example, without suffix links, but with zero transitions, all occurrences (“HE”, “HER”, “HE”, “HIS”) will be found in the “HERHEHIS” line. But in the input string “HISHER” only three entries will be found (“HIS”, “HE”, “HER”) and the machine will not detect the occurrence of the sample string “SHE”, which is easy to verify. In the case of suffix links for the string “HERHEHIS” the result will be the same, and for the string “HISHER” the result will be a list of four found entries: (“HIS”, “SHE”, “HE”, “HER”). That is, in the latter case, all occurrences were found. No, no, this is not the mistake I wanted to talk about. In no case do not think so. For someone to forget about suffix links, I have not seen this before. Still to come.
I did not begin to describe in detail the algorithm for constructing an automaton, but we will consider the search algorithm for the cases described above in more detail. To do this, for convenience, we will first give an automaton without suffix links, in which we number its states (blue numbers near the corresponding circles).

Yes, by the way, let me remind you that, despite the absence of suffix links, this machine goes into its initial state in the absence of a main direction while preserving the input symbol. These transitions are not shown in the figure. Let's get started. Initial state 0, input string “HERHEHIS”:
0 H -> 1 ERHEHIS
1 E -> 2 RHEHIS (“HE”)
2 R -> 3 HEHIS (“HER”)
3 H -> 0 HEHIS //главного направления нет, переходим в начальное состояние с сохранением входного символа
0 H -> 1 EHIS
1 E -> 2 HIS (“HE”)
2 H -> 0 HIS //главного направления снова нет
0 H -> 1 IS
1 I -> 7 S
7 S -> 8 (“HIS”)

Here they are - all four occurrences. Now let's do the same with the “HISHER” line:
0 H -> 1 ISHER
1 I -> 7 SHER
7 S -> 8 HER (“HIS”)
8 H -> 0 HER
0 H -> 1 ER
1 E -> 2 R (“HE”)
2 R -> 3 (“HER”)

As expected, only three occurrences were found. Now you can delve into the essence of the examples and formulate the following statement: "Not all occurrences will be found in the machine without suffix links for the input string containing overlapping sample strings." We are talking about overlapping ones (“HIS” and “SHE” in “HISHE”), but not entering each other (“HE” and “HER” in “HER”).
In a machine with suffix links, of course, this does not happen. Let's check on the same examples. Let's number the state of the machine with suffix links:

Take the input string “HERHEHIS”:
0 H -> 1 ERHEHIS
1 E -> 2 RHEHIS (“HE”)
2 R -> 3 HEHIS (“HER”)
3 H -> 0 HEHIS //главного направления и суффиксных ссылок нет
0 H -> 1 EHIS
1 E -> 2 HIS (“HE”)
2 H -> 0 HIS //главного направления и суффиксных ссылок нет
0 H -> 1 IS
1 I -> 7 S
7 S -> 8 (“HIS”)

For this input line, the progress of the machine is exactly the same as for the machine without suffix links. Now slip the line “HISHER” to the machine:
0 H -> 1 ISHER
1 I -> 7 SHER
7 S -> 8 HER (“HIS”)
8 H -> 4 HER //переход по суффиксной ссылке
4 H -> 5 ER
5 E -> 6 R (“SHE”)
6 R -> 2 R (“HE”) //переход по суффиксной ссылке
2 R -> 3 (“HER”)

Indeed, all four occurrences were discovered. What can be formulated here: "Without suffix links, the hope of finding all occurrences of sample strings tends to zero."
Often I came across descriptions of the Aho-Korasik algorithm, which ended exactly on this. The algorithm for constructing the tree was written in detail, and it was written about the search algorithm: "it is elementary - we go through the states and give the result." And the authors of these works would be absolutely right if they were not mistaken. To begin with, I will show an example that even an automaton with suffix links cannot handle. We will construct a similar automaton, but introduce two additional sample strings into it: “I” and “IS”. Schematically, such an automaton with its main and suffix transitions can be represented as follows:

As an input line we take all the same “HISHER” known to us. So, let's go:
0 H -> 1 ISHER
1 I -> 7 SHER
7 S -> 8 HER (“HIS”)
8 H -> 10 HER (“IS”)
10 H -> 4 HER
4 H -> 5 ER
5 E -> 6 R (“SHE”)
6 R -> 2 R (“HE”)
2 R -> 3 (“HER”)

In total, we have five occurrences found. But among them there is no pattern string “I”. What is the matter? How did it happen? In fact, everything is elementary. When following the suffix link, we passed state 9, which we could only visit from states 7 and 0. Therefore, wherever there is a substring “HIS”, the sample string “I” will not be found. Similarly, in the input line “HISHER” we would find the sample string “H” only once, after adding it to the list and building the corresponding automaton. And the point here is not that the lines “I” and “H” consist of one character. Check for yourself: replace “HIS” with “HIPS”, “IS” with “IPS”, “I” with “IP” in the list, and feed the input line “HIPSHER” to the machine. The pattern string “IP” will not be found for the same reasons. So what was wrong: the structure of the automaton or the search algorithm?
Options for solving this problem have one obvious favorite. Faced with a similar example and sorting out the reason, the developers choose the most obvious frontal solution: “after clicking on the suffix link, run back through the main transitions to the initial state”. Indeed, this saves us from the problem, and such an implementation has the right to life. Only two can be noted here:
  1. Assessment of the complexity of the algorithm (its time costs) is changing for the worse;
  2. There was really no problem.

The rmq user’s article mentioned above uses a different solution: run not by the main transitions, but by suffix links to the initial state, and only by “good suffix links”, that is, those that lead to the state of the fixing pattern string. Such an approach is an order of magnitude better than the “phaforitic” solution, but it nevertheless forces us to do extra runs on states. Despite this, in practical implementation, this algorithm will work in the same time as the original one. Thanks, rmq , for an interesting solution.
Let us return to the original algorithm. For those who are familiar with the original article, or at least with its correct translation, they could find an error in the very first machine. If for a certain state we record the fact of detecting not only the full (from the initial state) found pattern string, but all its suffixes which are also pattern strings, then everything will work correctly.
I will give once again a diagram of the last considered automaton with some changes.

Upon reaching state 6, we fix the occurrences of “SHE” and “HE”. It is on state 6, and not after following the suffix link. Similarly, state 7 will record the occurrence of the pattern string “I”, and state 8 - “HIS” and “IS”. In this case, the search algorithm will have to remove the fixation of the entry after following the suffix link, otherwise the same entry will be fixed twice. For example, after the transition from the 6th state to the 2nd state, it is not necessary to fix the occurrence of “HE”, we did this earlier after the transition from the 5th state to the 6th state. Now no need to return to the initial state. The search is really simple and straightforward. Parsing the input string “HISHER” will now look like this:
0 H -> 1 ISHER
1 I -> 7 SHER (“I”)
7 S -> 8 HER (“HIS”, “IS”)
8 H -> 10 HER
10 H -> 4 HER
4 H -> 5 ER
5 E -> 6 R (“SHE”, “HE”)
6 R -> 2 R
2 R -> 3 (“HER”)

Great, now all 6 occurrences are detected. It may seem to some that this complicates the construction of an automaton, but this is not at all the case. It will not be more complicated than our previous example. I did not describe the algorithm for constructing an automaton from the very beginning, so I will not give it now. It is beautifully described in an original article by Alfred Aho and Margaret Korasik, which perhaps someone should read.
In conclusion, I would like to advise those who read something not from the original source to be aware of its possible incorrectness. And if possible, get to the truth before making another mistake in implementation. And for those who write “adapted” statements, take your work more responsibly, without misleading the rest.

Also popular now: