Learning Regular Expression Algorithm in Ruby

Original author: Pat Shaughnessy
  • Transfer

According to Wikipedia, Oniguruma means “devil’s chariot” in Japanese.

We are all familiar with regular expressions. They are the "Swiss Army Developer Knife". Whatever you look for, whatever text you parse, you can always do this using regular expressions. In fact, you probably started using them much earlier than you started using Ruby - they have long been included in most popular programming languages: Perl, JavaScript, PHP, Java and others. Ruby appeared in the mid-1990s, while regular expressions were back in the 1960s, that is, almost 30 years earlier!

But how do regular expressions actually work?

If you want to learn more about the information technology behind the regular expression engine, you must read the fantastic series of articles by Russ Cox:


I do not want to repeat everything that Russ wrote. But I will quickly note that Ruby uses the “Non-recursive Backtracking Implementation”, which is discussed in its second article, which leads to an exponential decline in performance, just like when working with regular expressions in Perl . In other words, Ruby does not use the most optimal Thompson NFA (Thompson algorithm for non-deterministic finite state machines), which Russ talks about in his first article.

Today I'm going to take a close look at Oniguruma, the regex engine that is built into Ruby. Using some simple graphical schemes and some examples of regular expressions, I will illustrate how it works. Read on to get an idea of ​​what's going on inside Ruby every time you use regular expressions, there are probably a lot of things that you did not even know about.

Oniguruma


Starting with version 1.9, the MRI uses a slightly modified version of the open C library called “Oniguruma” to process regular expressions, which is also used in PHP. Along with supporting all standard regular expression functions, it also supports multi-byte characters, such as, for example, Japanese text.

At a very high level, this is what happens when a regular expression goes through Oniguruma:



In the first step, Oniguruma reads the regular expression pattern, breaks it into tokens and translates it into a tree structure containing a set of syntax nodes - Abstract Syntax Tree (AST). This is very similar to how the interpreter itself processes your Ruby code. In fact, you can think of the Oniguruma regex engine as the second programming language built right into Ruby. Whenever you use regular expressions in your code, you really create a second program in a completely different language. After parsing the regex pattern, Oniguruma compiles it into a set of instructions that are then executed on the virtual machine. These instructions implement the state machine that Russ Cox describes in his articles.

Then, when you are actually searching for a regular expression, the Oniguruma virtual machine executes it by simply processing the instructions that were previously compiled, applying them to the given string.

This week, to understand how Oniguruma works and compare with what Russ Cox writes about, I rebuilt Ruby 2.0 with the ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH flags set. Setting these flags allowed me to see a lot of debugging messages when I searched through several regular expressions. In them, I saw in which virtual machine instructions templates were compiled, and how it executed them. Here is what I found ...

Example 1: Matching a String


Here is a very simple Ruby script that searches for the word “brown” in the given string:

str = "The quick brown fox jumps over the lazy dog."
p str.match(/brown/)

When I run this code with a rebuilt Ruby interpreter, I observe a lot of extra debugging output (much more than I will show):

$ ruby regex.rb
PATTERN: /brown/ (US-ASCII)
optimize: EXACT_BM
exact: [brown]: length: 5
code length: 7
0:[exact5:brown] 6:[end]
match_at: str: 140261368903056 (0x7f912511b590) etc ...
size: 44, start offset: 10
  10> "brown f..."         [exact5:brown]
  15> " fox ju..."         [end]
#

The key point is the following " 0:[exact5:brown] 6:[end]" - this line describes the two virtual machine commands compiled by Oniguruma when compiling the / brown / template. Here's what this program looks like:



You can think of this scheme as a state machine for searching / brown /:

  • exact5:brown checks the given string at the current position for the presence of 5 letters "brown";
  • end means that the search is finished, returns the found matches and stops the execution.

When performing a regular expression search, Oniguruma executes these instructions in a virtual machine with the given string. Let's look at how this works in my example. Firstly, " exact5:brown" is applied to a given line in the place where the word "brown" is located:



How does Oniguruma know where to start the line from? It turns out that it contains an optimizer that decides where to start the search based on the given string and the first instruction of the virtual machine. You can see it above: " optimize: EXACT_BM.. exact: [brown]: length: 5.. start offset: 10". In this case, since it was precisely known that it was necessary to search for the word "brown", Oniguruma jumped to the position where the word first appeared. Yes, I know this sounds like a hack, but in fact it is just a simple optimization of search acceleration in common regular expressions.

Then, Oniguruma executes the " exact5:brown" instruction , checking whether the next 5 characters match the pattern or not. Since they match, Oniguruma moves to the position after them and executes the following instruction:



Now the last instruction is executed, which simply means that everything is complete. Whenever the virtual machine executes " end", it stops execution, declares success, and returns a matching string.

Example 2: Matching one of two lines


Let's take a more complex example and see what happens. In this template, I want to look for the entry in the string either “black” or “brown”:

str = "The quick brown fox jumps over the lazy dog."
p str.match(/black|brown/)

Run again:

$ ruby regex.rb
PATTERN: /black|brown/ (US-ASCII)
optimize: EXACT
exact: [b]: length: 1
code length: 23
0:[push:(11)] 5:[exact5:black] 11:[jump:(6)] 16:[exact5:brown] 22:[end]
match_at: str: 140614855412048 (0x7fe37281c950), ...
size: 44, start offset: 10
  10> "brown f..."         [push:(11)]
  10> "brown f..."         [exact5:black]
  10> "brown f..."         [exact5:brown]
  15> " fox ju..."         [end]
#

Again, the key here is " 0:[push:(11)] 5:[exact5:black] 11:[jump:(6)] 16:[exact5:brown] 22:[end]". This is a virtual machine program that searches for our template / black | brown /:



Everything is a bit more complicated here! First of all, you may notice that the optimizer now only searches for the letter “b”: " optimize: EXACT.. exact: [b]: length: 1". This is because the words “black” and “brown” both begin with this letter.

Now let’s take a step-by-step look at the execution of this program:



The " push" command is executed at the first position of the letter "b". It passes the following command and place on the source line to what is called a Backtrack Stack:



A little later we will see that the Backtrack Stack is a key place in Oniguruma's work. She uses it to find alternative search paths in a given string if the first path did not produce a result. Let's continue with this example and you will see what I mean.

So, we are going to proceed to the execution of the " exact5:black" command , but there is only the word "brown" in the line. This means that no matches will be found and the search will not be successful. However, before returning the result, Oniguruma will check the stack for alternative search paths. In this case, there is also a " exact5.brown" - the second part of my regular expression / black | brown /. Now Oniguruma pulls out this command and continues execution:



Now there is a coincidence, so Oniguruma moves 5 characters and moves on to the next instruction:



We again reached the " end" command , so just return the matching value.

Example 3: Design *


Now my last example. Let's see what happens when I use the following regular expression:

str = "The quick brown fox jumps over the lazy dog."
p str.match(/brown.*/)

We want to find the occurrence of “brown”, and then any sequence of characters to the end of the line. Let's see the debug output:

$ ruby regex.rb
PATTERN: /brown.*/ (US-ASCII)
optimize: EXACT_BM
exact: [brown]: length: 5
code length: 8
0:[exact5:brown] 6:[anychar*] 7:[end]
match_at: str: 140284579067040 (0x7f968c80b4a0), ...
size: 44, start offset: 10
  10> "brown f..."         [exact5:brown]
  15> " fox ju..."         [anychar*]
  44> ""                   [end]
#

And here we have a new state machine:



This time you can see the new Oniguruma virtual machine instruction: " anychar*". As you can guess, it represents the syntax pattern " .*". Let's see what happens at each step



again : We started again at position 10 in the line where “brown” appeared for the first time, and again found a match, as a result Oniguruma goes further and proceeds to the next instruction:



Next instruction " anychar*", and here is how it works:
  • Firstly, it matches any character, therefore it always biases execution by one character;
  • But, instead of moving on to the next command, Oniguruma goes back and executes this instruction again, but for a new character;
  • It also pushes the current character and the next command, in this case " end", onto the stack .



Now Oniguruma simply walks through the rest of the line, repeating these instructions for each “brown fox jumps over the lazy dog.” Character. Since it goes through the rest of the line, it writes the instruction " end" onto the stack again and again :



And again:



And finally it reaches the end of the original line:



This time " anychar*" does not find matches, since there are no more characters in the line. What happens in such cases when a team cannot find a match? As in the previous example, Oniguruma will remove the command from the stack and continue searching with it. Thus, in this case, the command " end" will be executed with the last position of the match in the string. This means that Oniguruma will receive all the text to the end of the line “brown fox jumps over the lazy dog.”

But why put the " anychar*" instruction on the stack after each character? The reason is that if there were a few more patterns after " .*", or " .*" was embedded inside a more complex general expression, it would not be clear which of the segments of the original string would ultimately lead to a complete match. Perhaps Oniguruma would have to try all the options. In this simple example, the pattern is correct to the end of the line, so there is no need to get more than one command from the stack.

One interesting detail. If you add more commands after " .*", for example, if you search for /.*brown/, Oniguruma will not use the instruction " anychar*". Instead, it will use another similar command, for example, "anychar*-peek-next:banychar*"but instead of pushing the next position on the line each time, it pushes only the coincidence position, in this case, with" b "on the stack. This optimization works because Oniguruma knows that the next character should be" b ".

Regular Expression Pattern Pathology


I mentioned above that Ruby will show the same poor performance as Perl if you give it a very complex regex pattern. It is actually very easy to reproduce this on your computer. Try a very simple regular expression search example:

str = "aaa"
p str.match(/a?a?a?aaa/)

It should execute very quickly:

$ time ruby regex.rb
#
ruby regex.rb  0.02s user 0.01s system 30% cpu 0.080 total

However, if you repeat it using 29 repetitions instead of 3, then the time to find the entry will have explosive growth, as Russ shows in the chart on the left in his first article:

str = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"  # 29 повторений
p str.match(/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/)

Running with 29 occurrences:

$ time ruby regex.rb
#
ruby regex.rb  17.09s user 0.01s system 99% cpu 17.098 total

Or with 30 entries:

$ time ruby regex.rb
#
ruby regex.rb  34.00s user 0.01s system 99% cpu 34.025 total

For 31 entries, the time is 67 seconds and increases exponentially. The reason for this is because the algorithm that Oniguruma uses goes through the list of possible combinations, which grows exponentially with respect to the length of the regular expression pattern. This would not happen if Oniguruma and Ruby used the Thompson algorithm Russ talked about!

Superficial explanation


As I said, here is just a superficial explanation of what Oniguruma and Ruby can do. As you probably know, there are many, very many operators and regular expression options that you can use, each of them has corresponding instructions inside the virtual machine. In addition, my examples were extremely simple and straightforward - in regular applications you can have very complex regular expressions that are compiled into hundreds of different instructions on the Oniguruma virtual machine, which creates many different paths when searching for the desired expression.

However, the basic idea will always be the same. Each regular expression is compiled into a series of instructions of a virtual machine, which is a state machine. When Oniguruma comes to a standstill during the search, it will select a different option for the path to the target from the stack, each of which could be left by the " anychar*" operator or another similar command.

Also popular now: