# Inside regular expressions

Regular expressions (RVs) are a very convenient form of writing so-called regular or automaton languages. Therefore, PBs are used as an input language in many systems processing chains. Consider examples of such systems:

In this article, we will first familiarize ourselves with finite state machines and their types (DFA and NKA), and then consider an example of constructing a minimal DFA by regular expression.

A finite state machine (KA) is a converter that allows you to map the corresponding output to an input, and this output may depend not only on the current input, but also on what happened earlier, on the history of the state machine. Even human behavior, and not just artificial systems, can be described using spacecraft. For example, your reaction to your neighbor listening to loud music at night will be one after the first such event and completely different after several such cases. There can be an infinite number of such histories; the question arises: what kind of memory does the spacecraft have to behave differently for each histogram? It is clear that it is impossible to store an infinite number of histories. Therefore, the automaton, as it were, breaks up all possible histories into equivalence classes. Two stories are equivalent, if they equally affect the behavior of the automaton in the future. The equivalence class to which the automaton related its current background is also called the internal state of the automaton.

Consider an example of a primitive spacecraft:

This spacecraft consists of:

The reader can move in one direction, usually from left to right, thereby reading the characters of the input chain. For each such movement, it can count one character. Further, the read symbol is transmitted to the control unit. The control unit changes the state of the machine based on the transition rules. If the list of transition rules does not contain a rule for the character read, then the machine "dies."

Now consider the ways in which the spacecraft can be defined. They can be specified as graphs or as control tables. In the form of a graph, the spacecraft is defined as follows:

In the form of a control table, like this:

An example of a spacecraft in the form of a graph and in the form of a control table will be presented below.

The main difference between DKA and NKA is that DKA in the process of work can only be in one state, and NKA in several states at the same time. An example of the work of the NCA is the idea of the American physicist Hugh Everettfrom the fact that any event divides the world into several worlds, in each of which this event ended in its own way. For example, in one world, Hitler won the Second World War, in another - Newton went into business instead of physics, and the discovery of the laws of classical mechanics had to be postponed for 50 years. In order to draw any conclusions from the work of the automaton, one should study all the “worlds”. After the entire input chain has been read, we believe that the NCA admits this chain if it completed work in an accepting state in at least one of the many “worlds”. Accordingly, the machine rejects the chain if it completed the work in an inadmissible state in each "world". A DFA accepts a chain, this is obvious if after reading the entire input chain it is in a valid state.

In most cases, building an NKA is much easier than a DKA. But, despite this, using NKA for modeling is not a good idea. Fortunately, for each NFA, it is possible to construct a DFA that admits the same input language. In this article, we will not give an algorithm for constructing a DFA from an NFA, but consider this algorithm based on a clear example below.

To begin with, we give a list of RV operations used in this article, in order of priority:

Consider an example, given a regular expression:

xy * (x | y *)

It is necessary to construct a minimal DFA from a regular expression and demonstrate recognition of the correct and incorrect chains.

To begin with, we simplify this RV, using the right-hand distribution law of concatenation with respect to union, we obtain the following RV:

(xy * | ab | (x | a *)) (x | y *)

Now we construct an automaton according to this RV:

According to the concatenation transformation rule (we will not to give the rules for converting PBs into spacecraft, since they are quite obvious), we obtain the following automaton:

By the union transformation rule:

By the concatenation transformation rule:

And at the end, we apply the closure transformation rule and get εНКА. It should be noted here that εNKA is an NKA that contains ε-transitions. An ε-transition, in turn, is a transition in which the automaton does not take into account the input chain or, in other words, a transition by an empty symbol.

We get rid of ε-transitions (the “final state” is indicated by an “asterisk”):

In this SCA, the states s3 and s5 are equivalent, since δ (s3, x) = δ (s5, x) = s1 and δ (s3, y) = δ ( s5, y) = s5, s7. Rename the states s6 -> s5 and s7 -> s6: we

construct a DFA according to the NFA:

In this DFA, the states p1 and p5 are equivalent, since

δ (p1, x) = δ (p5, x) = p4 and δ (p1, y) = δ (p5, y) = p5. Rename the states p6 -> p5 and p7 -> p6:

This automaton is the minimum DFA.

Let δ be the transition function, then the extended transition function constructed from δ is denoted by δ ', and ω is the input chain.

Suppose that the chain ω = aaax is fed into the input, we expect the automaton to be in one of the admissible states.

δ '(p0, ε) = p0

δ' (p0, a) = δ (δ '(p0, ε), a) = δ (p0, a) = p3

δ' (p0, aa) = δ (δ ' (p0, a), a) = δ (p3, a) = p5

δ '(p0, aaa) = δ (δ' (p0, aa), a) = δ (p5, a) = p5

δ '(p0 , aaax) = δ (δ '(p0, aaa), x) = δ (p5, x) = p4

p4 is an admissible final state, so the chain aaax is correct for this automaton.

Now suppose that ω = xyyb:

δ '(p0, ε) = p0

δ' (p0, x) = δ (δ '(p0, ε), x) = δ (p0, x) = p1

δ' (p0 , xy) = δ (δ '(p0, x), y) = δ (p1, y) = p1

δ '(p0, xyy) = δ (δ' (p0, xy), y) = δ (p1, y) = p1

δ '(p0, xyyb) = δ (δ' (p0, xyy), b) = δ (p1, b) = ∅

Here we see that if the symbol b is input to the automaton when it is in state p1, then this automaton will die, therefore the chain xyyb is incorrect.

PS In this article, the algorithm for constructing DFA on RV was considered, but there are more convenient algorithms, in particular for programming, but this is a topic for another article ...

- The grep command of the Unix operating system or similar commands to find conversations that can be found in Web browsers or text formatting systems. In such systems, PBs are used to describe patterns that the user is looking for in a file. Various search engines convert RV into either a deterministic finite state machine (DFA) or a non-deterministic finite state machine (NKA) and apply this automaton to the file in which the search is performed.
- Generators of lexical analyzers. Lexical analyzers are a compiler component; they break the source program into logical units (tokens), which can consist of one or several characters and have a certain meaning. The generator of lexical analyzers receives formal descriptions of tokens, which are essentially RV, and creates a DFA, which recognizes which of the tokens appears at its input.
- RV in programming languages.

In this article, we will first familiarize ourselves with finite state machines and their types (DFA and NKA), and then consider an example of constructing a minimal DFA by regular expression.

#### State machines

A finite state machine (KA) is a converter that allows you to map the corresponding output to an input, and this output may depend not only on the current input, but also on what happened earlier, on the history of the state machine. Even human behavior, and not just artificial systems, can be described using spacecraft. For example, your reaction to your neighbor listening to loud music at night will be one after the first such event and completely different after several such cases. There can be an infinite number of such histories; the question arises: what kind of memory does the spacecraft have to behave differently for each histogram? It is clear that it is impossible to store an infinite number of histories. Therefore, the automaton, as it were, breaks up all possible histories into equivalence classes. Two stories are equivalent, if they equally affect the behavior of the automaton in the future. The equivalence class to which the automaton related its current background is also called the internal state of the automaton.

Consider an example of a primitive spacecraft:

This spacecraft consists of:

- the tape represented by the input chain.
- reader device.
- a control unit that contains a list of transition rules.

The reader can move in one direction, usually from left to right, thereby reading the characters of the input chain. For each such movement, it can count one character. Further, the read symbol is transmitted to the control unit. The control unit changes the state of the machine based on the transition rules. If the list of transition rules does not contain a rule for the character read, then the machine "dies."

Now consider the ways in which the spacecraft can be defined. They can be specified as graphs or as control tables. In the form of a graph, the spacecraft is defined as follows:

- the vertices of the graph correspond to the states of the spacecraft.
- directed edges correspond to transition functions (the symbol along which the transition is performed is indicated near each such edge).
- a vertex with an edge entering it that does not exit from more than one state corresponds to the initial state.
- final states of the spacecraft are marked in bold.

In the form of a control table, like this:

- spacecraft states are located in the rows of the table.
- Recognized language characters are in columns.
- at the intersection, the state into which you can get from this state by this symbol is indicated.

An example of a spacecraft in the form of a graph and in the form of a control table will be presented below.

#### DKA and NKA

The main difference between DKA and NKA is that DKA in the process of work can only be in one state, and NKA in several states at the same time. An example of the work of the NCA is the idea of the American physicist Hugh Everettfrom the fact that any event divides the world into several worlds, in each of which this event ended in its own way. For example, in one world, Hitler won the Second World War, in another - Newton went into business instead of physics, and the discovery of the laws of classical mechanics had to be postponed for 50 years. In order to draw any conclusions from the work of the automaton, one should study all the “worlds”. After the entire input chain has been read, we believe that the NCA admits this chain if it completed work in an accepting state in at least one of the many “worlds”. Accordingly, the machine rejects the chain if it completed the work in an inadmissible state in each "world". A DFA accepts a chain, this is obvious if after reading the entire input chain it is in a valid state.

In most cases, building an NKA is much easier than a DKA. But, despite this, using NKA for modeling is not a good idea. Fortunately, for each NFA, it is possible to construct a DFA that admits the same input language. In this article, we will not give an algorithm for constructing a DFA from an NFA, but consider this algorithm based on a clear example below.

#### Building a minimal DCA by regular expression

To begin with, we give a list of RV operations used in this article, in order of priority:

- iteration (wedge closure) using the symbol "*"
- concatenation is specified using a space or an empty string (for example: ab)
- union using the character "|"

Consider an example, given a regular expression:

xy * (x | y *)

**|**ab (x | y *)**|**(x | a *) (x | y *)It is necessary to construct a minimal DFA from a regular expression and demonstrate recognition of the correct and incorrect chains.

To begin with, we simplify this RV, using the right-hand distribution law of concatenation with respect to union, we obtain the following RV:

(xy * | ab | (x | a *)) (x | y *)

Now we construct an automaton according to this RV:

According to the concatenation transformation rule (we will not to give the rules for converting PBs into spacecraft, since they are quite obvious), we obtain the following automaton:

By the union transformation rule:

By the concatenation transformation rule:

And at the end, we apply the closure transformation rule and get εНКА. It should be noted here that εNKA is an NKA that contains ε-transitions. An ε-transition, in turn, is a transition in which the automaton does not take into account the input chain or, in other words, a transition by an empty symbol.

We get rid of ε-transitions (the “final state” is indicated by an “asterisk”):

In this SCA, the states s3 and s5 are equivalent, since δ (s3, x) = δ (s5, x) = s1 and δ (s3, y) = δ ( s5, y) = s5, s7. Rename the states s6 -> s5 and s7 -> s6: we

construct a DFA according to the NFA:

In this DFA, the states p1 and p5 are equivalent, since

δ (p1, x) = δ (p5, x) = p4 and δ (p1, y) = δ (p5, y) = p5. Rename the states p6 -> p5 and p7 -> p6:

This automaton is the minimum DFA.

Let δ be the transition function, then the extended transition function constructed from δ is denoted by δ ', and ω is the input chain.

Suppose that the chain ω = aaax is fed into the input, we expect the automaton to be in one of the admissible states.

δ '(p0, ε) = p0

δ' (p0, a) = δ (δ '(p0, ε), a) = δ (p0, a) = p3

δ' (p0, aa) = δ (δ ' (p0, a), a) = δ (p3, a) = p5

δ '(p0, aaa) = δ (δ' (p0, aa), a) = δ (p5, a) = p5

δ '(p0 , aaax) = δ (δ '(p0, aaa), x) = δ (p5, x) = p4

p4 is an admissible final state, so the chain aaax is correct for this automaton.

Now suppose that ω = xyyb:

δ '(p0, ε) = p0

δ' (p0, x) = δ (δ '(p0, ε), x) = δ (p0, x) = p1

δ' (p0 , xy) = δ (δ '(p0, x), y) = δ (p1, y) = p1

δ '(p0, xyy) = δ (δ' (p0, xy), y) = δ (p1, y) = p1

δ '(p0, xyyb) = δ (δ' (p0, xyy), b) = δ (p1, b) = ∅

Here we see that if the symbol b is input to the automaton when it is in state p1, then this automaton will die, therefore the chain xyyb is incorrect.

PS In this article, the algorithm for constructing DFA on RV was considered, but there are more convenient algorithms, in particular for programming, but this is a topic for another article ...