Super-greedy quantifiers

    In the article, Regexp is a "programming language." The basics were: to write a regular expression that contains text in double quotes in the string of characters, and inside the quotes "..." there may be the characters themselves "if they are escaped with a backslash, for example:
    one two "foo:=\"quux\"; print" three "four"
    Here our regex must find the correspondence to the string with the
    "foo:=\"quux\"; print"
    Author ( of that article) the following solution was proposed:
    / " ( \\" | [^"] )* " /x
    (hereinafter Perl syntax; the / x switch means that gaps in the regex are not taken into account, we added them only for illustrative purposes, so that parts of the regex do not merge into a single “modem noise”).
    This regex works when there is a match (text in quotation marks). The problem is that it finds the text in quotation marks even when the text in quotation marks (according to our backslash escape rules) is simply not there. For example, in the "\" chain, regex finds a match (equal to the entire string "\"), although it should not be: the quotation mark is open, the escaped quote ... but there is no closing quote.
    The situation is easy to correct, the initial task is not difficult to solve by making a few simple changes to regex ... but this is not about that, but that if you have a modern tool in your hands, that is, the regex engine (the latest version of Perl, Java or PHP with PCRE), then you can "fix" the described regex by adding only 1 character to it. Which one? Where to? Why? If you know the answers, then you should not read further ;-)

    / " ( \\" | [^"] )* " /x
    let's see why the original version matches the string "\"
    Events develop (very roughly) like this:
    1. "(in regex) matches" (in the checked chain)
    2. \\ "(the 1st option in the alternative (... | ...) in regex) matches \" (in the chain)
    3. We have come to the end of the chain. There is still a closing quotation mark in regex "(or again the contents of the alternative (... | ...)), but in the chain it’s empty, there’s nothing to match." The regex engine sees that there is no match.
      There seems to be an end to the fairy tale ... but there it was! This is where the fun begins. The regex engine rolls back to the beginning of step 2. And this time it tries to match the next chain, the second alternative: [^ "]. And - oddly! - it works.
    4. [^ "] in regex matches \ in the chain. Is backlash not a quote? Not a quote. So, a match.
    5. The regex engine will try to find (... | ...) again, because * is a greedy quantifier (A * means "find as many gizmos as possible matching A"). He will not succeed: only the quotation mark remains in the chain, which is neither a combination of \ "nor a non-quotation mark [^"]
    6. Then the regex engine will match the rest of the chain, the lone quotation mark, the remaining part of the regex, the lone quotation mark. Match! Match found.

    In general, the whole thing is “magic bubbles”, or rather, the rollback of the regular expression engine. It rolls back after the alternative with a quantifier (... | ...) * captures the rest of the chain to the end, and the rest of the regex "gets nothing." Is it possible to ask the regex engine not to roll back? It turns out that in modern regex engines (in particular, in Perl 5.10, in relatively newer versions of Java and PHP with a fresh PCRE) this is possible. Come to the rescue ... Chip and Dale Rescuers Malibu

    ... Super-greedy quantifiers

    And what kind of fruit is it? Everyone knows the usual quantifiers:
    * + ? {m, n}
    They are greedy , that is, they “grab” as much of the checked chain as possible. There are also non-greedy (non-greedy, lazy, reluctant, lazy) quantifiers
    *? +? ?? {m, n}?
    that capture as little as possible from the chain.
    In modern regex engines, in addition to these two “classical” types of quantifiers, possessive quantifiers are also implemented :
    *+ ++ ?+ {m, n}+
    We do not know the well-established Russian translation for this term, so they were called, as it seemed reasonable: super-greedy quantifiers. Why "super greedy"? Because they, firstly, behave as greedy, that is, they capture as much as possible from the chain. Secondly, they, in a sense, are even more “greedy” greedy and go further than them: once “grabbing” something, they never roll back, they do not “give” pieces of the rehexa they have captured to the next parts.
    Example. Regex
    / " .* " /x
    with line processing
    one "two" three "four"
    finds a match:
    "two" three "four"
    as the * greedy and "eats" all the quotes, which only can eat (including he eats and the quotation mark after four, but then he "gives" it back, that is to the engine regex does not see.. why match the last "from regex).
    But the" super-greedy "option
    / " .*+ " /x
    will not find anything in the same chain at all! (i.e. there will be no match). Why? Because. * + Eats up the rest of the chain after the opening quotation mark, and he doesn’t care if there is anything to close the closing one from regex. He still won’t give away part of the piece of the chain “eaten” by him.
    Why do we need such strange quantifiers? , they are very useful when the "rollback" of the regex engine back is undesirable for us. And as practice shows, the rollback is not always desirable ... unless, of course, we are talking about embezzlement of public money ;-)

    So we have the initial task of finding text in quotation marks - there was just such a problem ema, a kickback worked, which led to the match in "unnecessary" chains like "\" Add the
    / " ( \\" | [^"] )* " /x
    little red plus sign to the original regex :
    / " ( \\" | [^"] )*+ " /x
    and the problem is solved! Such a regex option will not match strings like
    who "\"the\"heck\" is quux anyway

    From a gun on sparrows

    In fact, for such a simple task as finding “quoted” text, no advanced features like possessive quantifiers are needed. The following regex will do a great job of this task:
    /" ( \\. | [^"\\] )* "/x
    that is, literally “an opening quotation mark, (backslash followed by any character OR any character except backslash and quotation marks) zero or more times, closing the quotation mark”.
    If the new-line character can occur after \ and we consider this permissible, then you need to add the / s switch, because without this switch, a dot. the new-line character does not match.

    Also popular now: