Improve regular expressions

    After reading a book about regular expressions (hereinafter simply RV), I had some thoughts about their readability. When RVs just appeared, and there were quite a few symbols like \ d, \ w and the like, then probably everything was not so scary, although even then it was worth thinking about visualization. Now reading the code from the PB is a quiet horror. No, if the PB is short, then there are no special problems, but as they become more complicated and various brackets appear, everything becomes simply nightmare. The situation is aggravated by the fact that in some languages ​​(we will not point the finger) constantly have to double slashes.

    In addition, in the RV notation that is now used in most programming languages, in some seemingly simple situations, you have to get out using various feints. The first example that came to mind is to compose a regular expression if “abc”, then NOT “xyz” .



    In my opinion, it is time to abandon the notation that is currently being used and create a new one that will be closer to the usual programming language, because the RV notation is essentially a language, but it is simply terrible in design. The worst thing in today's notation is the abundance of brackets like (...) , (: ...) , (?: ...) , (? = ...) , (?! ...) , (? <= ...) ,(? , (? <) . It is thanks to them that the expressions become confusing and it is impossible to capture the gaze of all RVs in order to immediately say what it is looking for, but you have to check every character in the line, not forgetting, for example, that ^ in the middle of the RV is the beginning lines, and at the beginning of the square brackets [^ ...] it is an inversion. Well, the wrong thing is that when a new opportunity arises, developers create a new notation, some (& ^% $ # @ ...) , which in itself absolutely does not say anything,

    because what is the charm of ordinary programming languages ​​(without taking any extreme cases)? we see an if or while statementin an unfamiliar language, we can immediately say that he roughly does it. Yes, you can replace these operators with characters like @ # $% and # & $ ^, respectively, you can even get used to them, but as the joke about the Russian language lesson in Georgia said, “this must be remembered, it is impossible to understand this.”

    Perhaps the situation could have been improved by clever code editors, which in the regular expression would have differently highlighted brackets (?: ...) , (? = ...) , etc., to immediately see the areas of their actions, but for In most programming languages, this is almost impossible to do. RV there is a line and the editor should be able to determine by the contents of the line what is before it: RV or plain text. And anyway, with a large nesting of brackets, the PB will turn into a multi-colored rainbow.

    Generally speaking, quite good changes in terms of the readability of the RS have already occurred due to the appearance of modes when the RS are recorded on several lines, as well as thanks to the comments inside the RS. There was even a design with a clear view (? If ...), in Perl (it will not be mentioned by night), program code can be embedded in a regular expression, and in .NET instead of a simple replacement by PB, the replaced value can be generated usingspecially trainedspecial delegate. In general, it is already possible to write a more or less understandable RV, but anyway it’s not that, it is more like crutches.

    It’s time to create the RV language, similar to other “human” programming languages, and not to Brainfuck. Then in it it would be possible to organize understandable backlighting, prompts a la IntelliSense, and in the future, possibly, step-by-step debugging of the PB.

    Further I would like to show what I would like to see RV.

    First, they need to somehow separate the PB from the usual strings. It is clear that functions require exactly strings to work, I’m not sure that RV should be embedded in the languages ​​themselves, as it is done in Perl, let them remain strings, but to highlight them inside quotes, you should use some additional notation. It could be anything, for example, instead"\ d \ w" (for clarity, I will not double slashes) it is worth using "! \ d \ w!" or "<\ d \ w>" , then the editor will be able to easily distinguish PB from lines. In the future I will use the entry "! ...!", But this is not important, like the rest of the notation, the main point.

    Secondly, RTs should be written only in the mode when spaces and line breaks are ignored, and to separate literals inside the expression that always remain unchanged from the constructions of the RTs, literals can be quoted (no matter what). For example, instead of "abcd \ d \ wxyz" it would be possible to write:

    "! 'abcd'
    \d\w
    'xyz'
    !"


    Or even "! 'Abcd' \ d \ w 'xyz'!"

    The code editor here can tint abcd and xyz separately . It might be worthwhile to use the “+” sign to link these parts. So it will even be clearer: "! 'Abcd' + \ d \ w + 'xyz'!" because individual parts of the PB are more visually separated.

    You may be confused that the “+” sign is now used in the value “1 or more matches”, but this is not scary, because in this value no one is going to use it anymore. This is not logical. There are such visual constructions as {min, max}, let's use them together with the "*" operator. The operator "*" should be used just in the meaning of "multiply", that is, the expression "! 'means the string 'abc' should be repeated 3 times. PB "! 'Abc' * {1, 3}!" means that 'abc' should be repeated 1 to 3 times. Similarly, you can use the entry "! 'Abc' * {1,}!" in the value "1 and more matches" instead of "+", and instead of the operator * write: "! 'abc' * {0,}!" . And the entry "! 'Abc' * {3, 3}!" equivalent to the one we already saw "! 'abc' * 3!" . The old operator "*" will then be replaced by the expression "! 'Abc' * {,}!" .

    Perhaps instead of curly brackets, you should use square or round,

    The question remains how to denote the minimum operator "*" (it is not greedy). One could use the division operator, but this is also not logical, therefore, it can be written directly in the form "! 'Abc' min * 3!" . Here min * is a single statement with no space. I do not really like this recording option, but at least it explains the essence in its own name.

    Most brackets should be replaced with built-in functions. For example, instead of "[abc]" it is worth writing in the form "! Any (a, b, c)!", Then it will be possible in the same way to replace the expression "(: abc) | (: xyz)" with "! Any (' abc ',' xyz ')! " and we can also get rid of the "|" operator. As parameters of the function, you can use PB, for example, "! Any (\ d \ w, 'abc')!" .

    We need to decide what to do with the simplest expressions like \ w, \ b, \ d, etc. On the one hand, they are quite compact, but for example, I like the record, which now can be used in square brackets - [: alnum:]. For convenience, you can replace them with a record of the form _alnum_. Or maybe the simplest \ d and \ w should be left as is. And instead of ".", Which is not particularly visible in the text, you can use the entry _any_. The same spaces and tabs that are ignored in the expression itself can be written as _space_ or just put them in quotation marks.

    Be sure to enter the normal if-then-else statement , the essence of which is that if the expression after if is executed, then the PB is checked in the then branch , otherwise after the branchelse . I think the word then can be omitted. Then it will be possible to compose such a RV:

    "! 'abc'
    if (\w * 3)
    {
    'xyz'
    }
    else
    {
    \d * {1, } 'klmn'
    }
    !"


    Here I used the syntax as in C-like languages, but this is not critical. Literally, this expression means: First comes the string 'abc', then the PB '\ w * 3' is checked, if it is executed, then it must go 'xyz', otherwise at least one number must go, and then 'klmn'.

    It might even be worthwhile to introduce operators like case , while and for . In addition, you need to enter the logical operations AND, OR, NOT to use them in the condition. Not sure about AND and OR, because the expression "! If ('abc' && 'xyz')!" tantamount to "! if ('abcxyz')!" , and "! if ('abc' || 'xyz')!" - "!. But the negation operator is needed exactly to determine what should not be in a given place.

    You need to enter a variable that indicates the position in the line where the search is currently being performed (let there be the _pos_ variable), as well as the variable that stores the line itself, to which PB (_this_) is applied. Then the operator "^" can be replaced by a more understandable "! _Pos_ == 0!" , and "$" to "! _pos_ == (strlen (_this_) - 1)!" It may be worth introducing a separate notation for the end of the line, for example, by analogy with Python: _pos_ == -1. These same variables will make it possible to make an advanced and retrospective check.

    You need to leave comments. How they will look is no longer important.

    The assignment operator must work in two modes. The first one is checking and assigning a variable to the string corresponding to the regular expression, for which a record like "(?...) ": "! foo = \ w \ d *;!" . The semicolon will have to be used to show where the assignment operator ends.

    The second assignment mode is to save the regular expression without checking it. Used for clarity, for example,

    "!
    foo = !\d\w*!

    'abc' foo 'xyz' foo
    !"


    Here is the expression! \ D \ w *! (note the exclamation points) is then used by the variable name foo.

    These are the main ideas that have come up regarding RV. It would be interesting to try such expressions in practice, but, unfortunately, my hands are unlikely to reach my hands on the implementation of such a parser. In general, one could start with the fact that such expressions were transformed to the classical form of the PB, and then they would be processed by a ready-made library.

    At the end, a small example for finding a URL. Perhaps there is not everything taken into account, for example, it is believed that the domain zone can only be com, net, info or two-letter.

    "!
    unicode = !% any(\d, A-F) * 2 ! // Представление Unicode в адресе.
    // Переменная только создается, но не проверяется
    domain = !any ('com', 'net', 'info', (a-z) * {1, 2})!
    host = !any (\w, '_', unicode)!

    "http://" (host '.') * {1,} domain '/' * {0, 1}
    "!

    I hope that I was not mistaken anywhere, but even if I made a mistake, it’s not scary, the main thing was to show the essence.

    In conclusion, I will say again that the main goal of all this was to come up with how to increase the readability of the radioactive substances. Of course, while the volume of the typed text will increase, but with large RV it is worth it.

    Also popular now: