l2k March 28, 2009 at 17:54

Regexp is a "programming language." The basics

A few years ago, I thought regexp was doing a linear text search, but what a surprise it was when I realized that it wasn’t. Then I was convinced from my own experience that from a simple change of places a and b in the scheme (... a ...) | (... b ...) the result completely changed.

So now I will tell you how regexp actually works.
Having understood these simple principles and how it works, you can write any queries.
For example, I will analyze a difficult one at first approximation, but in fact the simplest task is to identify all strings in quotation marks.

Branches at regexp

Regular expressions work in a directional tree pattern.
For example, branching (a ...) | (b ...) | (c ...) can be written: That is why interchanging places the result - because execution priorities are set here (in what order you put it will be done) . That is why it is worth monitoring what is more important and trying to make the conditions optimally crowding out each other. Moreover, everything that is done for 'a' will be executed, provided that everything up to 'a' (including 'a') is completed.

If (char=='a'){

...

}else if (char =='b'){

...

}else if (char =='c'){

...

}

'b ...' и 'a ...'

Writing a quotes parser

Consider a simple example.

Let's take the experimental line 123"ABC\"D\EF"GHI""\"KEYand start mocking it:

The first thing that appears in the head is an /".*"/expression. It will act according to the algorithm:
1) We search for the first one "
2) As long as we have any character (including "), we go further
3) At the end it should also be "
As a result, correctly, we got it "ABC\"D\EF"GHI""\".
That is, they found the first quote. Further, while the condition was fulfilled, they took the following characters (including ") and did it until the last one turned out to be ".

Now let's improve the algorithm - we will make it look for any character, excluding it ".
Our regex has turned into/"[^"]*"/. It will act according to the algorithm:
1) We search for the first one "
2) While we have any symbol not equal ", we go further
3) We stumbled upon "- the end.
The result has become more true - have been selected "ABC\", "GHI", "\".

Now we need to define the characters \"to leave them inside, not counting this as the end.
To do this, you need to change the condition [^"]by adding another comparison with \".
It will now look like this /"([^"]|(\\\\"))*"/.

We have added to the condition \\\\". Why four'\'? Because every two \\ in the line = one \, that is, we wrote \\ in the query line, and regexp uses the expressions \ w, \ d, \ s, etc., therefore, to put one \, you need to use \\ .

But our expression will not work yet.
Why not? It will not work, because at first the condition occurs [^"], and then, if it is not fulfilled, then it is compared with \":
1) We search for the first one "
2) While we have any character that is not equal ", we go further
if it is equal "(the previous one did not fulfill condition), we compare it with c \ (it is not equal by itself)
3) Stumbled upon "- the end.

Therefore, it will be correct to swap the conditions - / "((\\\\") | [^ "]) *" / so that it is checked first \", and then any other character is not ".
Now everything works correctly and selects the result "ABC\"D\EF", "". Sounds like magic, right? The algorithm worked correctly.

I [^"\\\\]|(\\\\")must say right \"away that the option does not fit, because when the algorithm finds \, it will go to the second condition (for \ should be "), which will not be fulfilled in our case \ E. To do this, it will be necessary to set a condition - if after \ it goes ", then we skip the character, otherwise we go further. That is, the expression will take on a form /"([^"\\\\]|(\\\\(")?))*"/.

Improving the algorithm

Let's add the parsing character '.
In regular expressions, you can use the characters found in future checks - this is what we will use:

We start our expression by searching for any quotation mark / apostrophe character /(["'])...- the quotation mark we find will fall into our special variable \ 1, which we can use further in the test. In this case, we will get one character there - either ", or '. In the end, to check for closure of this quotation mark, you must use ...(\1)/. Inside, check not for absence ", but for absence \ 1.

We optimize the code a bit and get it /(["\'])(\\\\\1|.)*?\1/. It should be noted that I used ? (lazy) in the expression - to add the last \ 1 to the condition - that is, now everything else is still checking for".
Why did I do this instead of [^ \ 1]? Because \ 1 does not work in [].

Now the code does the following:
1) We search for the first one "or '(write it in \ 1)
2) The next character "or'?
if not, then the next two characters are equal \"or \'(depending on the start)
if not, then just skip the character
3) Stumbled upon "- end.
And the expression is 1'2'a3"A'BC\"DEF"GHI""\"KEYparsed in '2', "A'BC\"DEF", "".

This expression can be used to highlight string areas inside any objects.
For example, function:
function a () {
b = "{}";
}

Add curly braces to the expression/{((["\'])(\\\\.|[^\\\\])*?\2|.)*?}/. This expression will now select {b="{}";}. Since one more parentheses appeared in the expression, \ 1 moved to \ 2 - be sure to follow this.

Upd. I forgot to mention the reverse. There is such a movement when the algorithm does not find anything, moving directly :). Therefore, it is better instead . use [^ \\\\]. (see the last example) In this case, finding the line "\" will not happen, as it should be.

Tags:

Regexp is a "programming language." The basics

Branches at regexp

Writing a quotes parser

Improving the algorithm

Also popular now: