Subtleties of regular expressions. Part 2: refunds and their quantity

    Part 1: metacharacters inside and outside character classes .

    In this part, I would like to talk about how regular expression engines work, why some people think that regular expressions are very slow, and why the authors of many engines do not follow the POSIX standard.

    As in many tasks, there are many approaches and principles for solving it to the problem of “finding a match in a string for regular expression”. The basic principles in this case are 2: matching searches controlled by a regular expression, and matching searches controlled by a string. Let's take a closer look at what each of these principles means.

    String Driven Matching


    This type of regex engine is implemented using the so-called DFA ( Deterministic State Machine ). The main advantage is the huge speed of finding matches (compared to another principle). Naturally, this principle has a bunch of flaws (otherwise it would not be interesting, right?).

    Firstly, the compilation time (meaning compiling a regular expression into the internal structures of the engine, rather than compiling the source code of the program) can be extremely long (compared to another principle). In fact, during compilation, an entire program for comparing a specific pattern is created. Since DFA is “controlled” by the string, and we do not know the string at compile time, the algorithm is created taking into account all possible strings. For a complex expression, this is obviously not so fast.

    Secondly, DKA does not support many of the convenient and necessary features that we are used to in regular expressions. For example, retaining brackets are not supported. Of course, with the set available to the DKA, it is impossible to do in more or less real situations. Therefore, in its pure form, DKA is now not used almost anywhere. But it is successfully used in the so-called "hybrid" engines, which in the simplest case select the engine that is most suitable for a particular regular expression.

    Thus, if you need to search in a huge amount of text and at the same time your regular expression is not too complicated, it is difficult to find anything better than DKA.

    So how does DKA still work? The engine is called "line driven" because it works just like that. The state machine reads the line in which to find matches, and compares the line with the pattern. At each moment in time, the algorithm “knows” which parts of the pattern and the string match. When the algorithm completely scans the entire line, it remains to select only the longest match, which is closer to the beginning of the line and give it as an answer.

    For definiteness, we assume that we have a regular expression ^abc\w+e$and two lines abcdeand abce. It is obvious that the expression matches the first line, but not the second.

    Let us consider how the search for coincidence occurs in the DKA mechanism. For the first line has the following sequence:

    • The first character of the string “a” - coincides in position with the metacharacter ^. Found 1 partial match.
    • The first character of the string “a” - matches the value of the second character of the pattern. The algorithm continues to lead 1 partial match.
    • The second character of the string “b” - coincides in value with the third character of the pattern. The algorithm continues to lead 1 partial match.
    • The third character of the string “c” - coincides in value with the fourth character of the pattern. The algorithm continues to lead 1 partial match.
    • The fourth character of the string "d" - coincides in value with the fifth metacharacter of the \wpattern. A quantifier requires a metacharacter match to occur at least once. The condition is fulfilled. The algorithm continues to lead 1 partial match.
    • The fifth character of the string “e” - coincides in value with the eighth character of the pattern. The quantifier condition for the previous metacharacter is fulfilled; we can count the “e” symbol as a coincidence with the eighth character of the pattern, or as a coincidence with a metacharacter \w. The algorithm does both this and that, and continues to lead 2 partial matches. Remember this point, in another principle there will be a very significant difference.
    • The sixth character of the line is absent - coincides in position with the metacharacter $. One complete match was found, and the second partial does not live up to expectations and is discarded.
    • The line is over, we return the result.


    What will happen in the second case:
    • The beginning of the algorithm is exactly the same, immediately go to the fourth character of the string "e"
    • The fourth character of the string “e” - coincides in value with the fifth metacharacter of the pattern \w. The quantifier condition requires that there exists at least one match for a character. Therefore, "e" is captured by the metacharacter \w. 1 partial match.
    • The fifth character of the line is missing - there is no match with the character “e”. Obviously the string does not match.
    • The line has ended, there are no matches, we return the result.


    What can we see in these two examples? The algorithm doesn’t care if we have a match or not. It simply scans line by line character by character and tries to match matches based on the pattern. If at the end of the work we have several complete matches, the result will be that which is located closer to the left edge of the line and the longest.

    Actually, it becomes clear why the algorithm works so fast. Except in some cases, each character of a string is viewed only 1 time. It also becomes clear why the compilation of the pattern takes a long time - for complex patterns, many conditions and dependencies must be taken into account in order to correctly conduct partial matches.

    At the moment, the DKA engine is used in the grep utility as part of the engine selection mechanism and in the Tcl language as part of the hybrid engine, it is probably used elsewhere, but I don’t know about it.

    Regular Expression Matching


    This type of regular expression engines is implemented using the so-called NKA ( Nondeterministic State Machine ). The main advantage is the support of a large number of functions (compared to another principle). Naturally, this principle also has a bunch of disadvantages.

    Firstly, the operating time can be extremely long (compared to another principle). Up to the exponential asymptotic of the length of the input string. Many popular engines even have protection against this, interrupting the search.

    Secondly, NKA very much depends on how the regular expression is written. If DFA is absolutely the same on the form of recording: time and search results will be exactly the same for any equivalent expressions, then for NKA this is a whole problem. One expression can give exponential asymptotics, and the other is almost linear. Accordingly, the question of optimizing the pattern arises in full growth. NKA is currently used in most regular expression engines.

    A key feature of the classic NKA is that it interrupts the search as soon as it finds the first complete match. Those. search results in some cases will be very different from DKA. This is a feature of the algorithm.

    So why are the authors of the engines in no hurry to follow POSIX? The fact is that the semantics of DKA search are fixed in the standard. Those. of course, it is not literally said that DKA should be used, but it is said about the “leftmost coincidence of the greatest length”. But the NKA returns just the first, and not the greatest length. And what to do? In order to comply with POSIX, they came up with a modification of the algorithm, which is usually called POSIX NKA. And it works much longer than just an NCA. Because in order to meet the standard, one has to go through all the options that a regular expression allows. Further I will explain why it is much longer.

    Compiled (the value is the same as for DFA) NKA is a sequential algorithm, instructions based on the pattern. A key feature of the NCA is the "returns". The search speed directly depends on their number.

    What returns are best explained by example. Let's take the same two lines that were used in the example for DFA, and the same regular expression.

    For the first line:
    • The first metacharacter of the pattern ^- coincides in position with the beginning of the line. The return stack is empty.
    • The second character of the pattern “a” - coincides in value with the first character of the string. The return stack is empty.
    • The third character of the pattern “b” - coincides in value with the second character of the string. The return stack is empty.
    • The fourth character of the pattern “c” - coincides in value with the third character of the string. The return stack is empty.
    • Fifth string metacharacter \w- matches the value of the fourth character of the string. A quantifier requires at least one character, so there can be no return. The return stack is empty.
    • Fifth string metacharacter \w- matches the value of the fifth character of the string. A quantifier requires at least one character that has already been captured. At this point there may be a refund. The return stack contains 1 element.
      At this intermediate stage, the NCA found the whole line “abcde” as a coincidence, but the pattern has not yet been completed to the end. The current position in the line is located behind the symbol "e", although the symbol "e" from the pattern has not yet coincided. It is at this point that the fun begins. Watch your hands.
    • The eighth character of the pattern “e” does not coincide in value with the sixth character of the string (it simply does not exist). No match found. Checking the return stack - there is one record. We carry out the return. After returning, we are in the position “before the e symbol” in the line and in the position “before the e symbol” in the pattern. The return stack is empty.
      When the NCA does not find a match, it checks the return stack. When performing a return, the position in the line is restored and the return is removed from the stack. If the return stack is empty and there is no match, then it is considered that there are no matches from this starting position in the line.
    • The eighth character of the pattern “e” - coincides in value with the fifth character of the string. The return stack is empty.
    • Ninth metacharacter of the pattern $- coincides in position with the end of the line. The return stack is empty.
    • The pattern is over, there is a match at the moment. We return it.


    What will happen in the second case:
    • The beginning of the algorithm is exactly the same, immediately go to the eighth character of the pattern “e”.
    • The eighth character of the pattern “e” - does not coincide in value with the fifth character of the string (again, it simply does not exist). No match found. We check the return stack - it is empty (empty because in the previous step we didn’t add anything to the return stack, because the quantifier +requires at least one match). Refunds cannot be made. The line did not match, we return the result.


    For a better understanding, I’ll tell you that the quantifier +always captures as many characters of a string as it can. Therefore, it is called maximum. A common mistake made by beginners is associated with this. The pattern \(.+\)will match in the line “here is the text (bracket) and here is the text (still the bracket) and here is the text” with the part “(bracket) and here is the text (still the bracket)”, but not at all with “(bracket)”. Just because at the processing stage the .+entire line will be captured to the end, and as soon as we get to the symbol ")", we will immediately find a complete match. Let's change the regex a bit:\([^)]+\)and now there is no longer a problem in it that has interfered with us. But another appeared (and what would happen without it). Now in the line “here is the text (the bracket (and inside another bracket) and, again, the text) and here the text” will be captured “(bracket (and inside another bracket),” which is also of course incorrect. This will happen because at the capture stage, we won’t get to the second closing parenthesis and immediately return the match found on the first one. allows you to solve the problem balance checks (namely checks).

    In fact, I was a little cunning when I described how the algorithm specifically worked. The fact is that it will work this way if the engine applies optimizations (it is obvious that if the pattern starts with a character ^, then it will either coincide at the beginning of the line, or not at all). In fact, if we discard optimizations, the engine will perform exactly the same search starting from each next character of the string, if it does not find a single complete match in the previous step. Those. for the second line, 5 searches in a row will be performed (we start before the character “a”, then “b”, “c”, “e” and the last after the character “e”), and only after that we can say that there are no matches. DKA would have dealt with all cases in 1 pass. But NKA in the absence of coincidence will need 5.

    The difference between the POSIX NKA engine will be that in the first case (when a match was found) the search would also not be stopped, and all 5 searches would be performed, as in the absence of a match, which of course is much longer.

    Now that we’ve examined the technique of performing returns, we can guess where the exponential search time comes from. In some situations, the algorithm falls into the trap, creating a huge number of returns. Consider, for example, an expression (\w+)*a. If you apply it to the “bbb” line, then what will happen? First, the first quantifier+will capture the entire line and transfer control to the second quantifier. Since there are no more characters in the line, we go to the “a” character of the pattern. Which is not in the line, we return by one character, intercept the vacated character in a new group, check “a”, there is no match again, return. Further, the meaning should be clear. The algorithm will iterate over all possible line breaks for capture by two quantifiers. For the string “bbb” these will be the options: “[bbb]”, “[bb] [b]”, “[b] [bb]”, “[b] [b] [b]”. Where square brackets conditionally indicate the captured group. The number of partitions grows exponentially from the length of the string in this case.

    Another situation. Imagine 1 megabyte text and a view pattern.*a. What happens when you search? At each stage, the entire megabyte of the string will be captured and sequential returns will be performed until the “a” character is found. And if there is no such character in the string? Then it will just be done 1000001 search. Which each time will capture the entire line from the current character to the end and return on it until it finds the character "a" (which is not). In order to avoid this, it would be enough in this case to write [^a]*a. But this is for this case. In the general case (as in the example with brackets) this is not enough.

    There are a lot more to write about optimization techniques for returns. The next article I plan to devote to this.

    Based on the book Jeffrey Friedl, Mastering Regular Expressions .

    Also popular now: