michurin July 9, 2009 at 09:53

Caution! Regexps!

Do you often use regular expressions? Do you think about how justified their use is? What are the alternatives, what are the possibilities and limitations? What is the cost of using regexp?

I have long and often noticed that people (especially from the Perl world) tend to mystify regular expressions, endowing them (in their minds) with universal super-powers.

By this article, I urge you to think ~~again~~ .

There are two errors. And the first one

“Regular expressions are equally good for all tasks.”

However, in simple tasks, regexps are ineffective.

I am silent about decisions like:

if (/./) {print "not empty \ n"; }

obviously it is less efficient than comparing with an empty string:

if ($ _ ne "") {print "not empty \ n"; }

(by the way, these two conditions are not completely equivalent and this may conceal tricks that will then suddenly come out at the most inopportune moment).

But there are blatantly irrational decisions (I’m not saying that they are bad, but they are definitely not rational).

I propose a simple test (to determine if the penultimate dozen characters consists of one “a”)

use Benchmark qw (: all);
my $ a = 'a'x8000;
cmpthese (1_000_000, {
  'regex' => sub {$ a = ~ / a {10}. {10} $ /; },
  'noregex' => sub {substr ($ a, -20, 10) eq "aaaaaaaaaa"; },
});

The result does not require comments:

             Rate regex noregex
regex 414 / s - -100%
noregex 4413793 / s 1065100% -

<Lyrical digression 1> Strictly speaking, I chose a non-random example for the test, but the essence of this does not change; For those who are interested in optimizing regexp, I recommend reading Freedle's Regular Expressions. <Lyrical digression 2> The fact that the terrifyingly braking solution looks more elegant on the pearl should not force programmers to use an irrational approach to solving the problem. Perhaps this circumstance should make the programmer think: "and not choose the language in which the optimal solution looks compact and beautiful." Here are two pieces of Python code:

# regex option (brake)
import re
cre = re.compile (r'a {10}. {10} $ ')
if (cre.search (string)):
    # do something
# option with explicit substring comparison
if (string [-20: -10] == 'aaaaaaaaaa'):
    # do something

But this is a completely different story. And I will move on to the second error:

“Regular expressions are equally good for all tasks.”

This time we will focus on problems that cannot be solved with regular expressions.

About four years ago, I was at an interview in one large company . The interview was generally dull a little more, than completely, but I finally got the question: “Write a regular expression that checks the correct placement of the brackets.” (That is, the absence of situations "{<}>".)

I immediately asked what the maximum parenthesis depth is allowed. The answer was bewilderment: “Anyone!”

Obviously, the questioner was absolutely sure that such an expression can be written, that I will write it now, and he will easily check it.

What a bitter fallacy!

If someone did not have time to figure it out, I’ll explain.

A regular expression describes a state machine. If you need to control an infinite number of brackets, then the state machine will not help you.

The question asked is akin to the question: "What is the sum of the angles of a triangle if its sides are 1, 2, and 50 centimeters." He betrays complete ignorance of the subject.

Although, strictly speaking,

Perl developers back in 2004 (if I don’t confuse) assured that this problem is being solved. For such things, the construct "(?? {...})" was invented. But the work of this construction very often leads to an ugly fall of Perl, usually with this message:

panic: regfree data code 'b' during global destruction.

(the letter "b" depends on your encoding :-))

It is not surprising that these add-ons still cannot get rid of the stigma of experimentation. In PCRE, this feature has not been enabled.

The developers did not stop there

and recently introduced a new mechanism and syntax "(? 1)". The mechanism is spared the mass of clumsiness inherent in the old version.

But in my opinion, recursive regular expressions should have been done separately and not called them “regulars,” because

the use of recursion in regular expressions makes them irregular .

They no longer describe a finite state machine, but describe a full-fledged Turing machine. On such regular expressions, any computational problem can be solved .

Thus, Perl, in fact, lost its regular expressions. Now the programmer cannot be sure that his "regular expression" will use a finite amount of memory. Or do not go in cycles. (Classic regular expressions satisfy these requirements: they never cycle and require a finite amount of memory, which is determined when the expression is compiled.) I belong to people who consider this refinement harmful, bearing an inexhaustible charge of misfortunes and vulnerabilities, and depriving the programmer of access to the simple and reliable regular expression engine.

But this is again a completely different story. (And, by the way, the topic of recursion in regular expressions was already covered on Habr.)

And I just wanted to say: “Caution! Regexps! "

Thank you all and success!

Tags:

Caution! Regexps!

“Regular expressions are equally good for all tasks.”

“Regular expressions are equally good for all tasks.”

If someone did not have time to figure it out, I’ll explain.

Although, strictly speaking,

The developers did not stop there

the use of recursion in regular expressions makes them irregular .

Also popular now: