Regular expressions + logic programming. What is the result?
Hello, dear readers.
Regular expressions are a well-known thing that is used in various projects, most often for not very complicated cases of parsing structured texts. Being engaged, at first glance, such a slightly different task as the inverse synthesis of program models (when there is a program code generated automatically by some system according to some block model of the problem being solved, and it is necessary to recreate the original model using this code), as well as the synthesis of program models by textual description of the task, I was faced with the problem of text analysis, or rather, the identification of text fragments to some custom templates. I wanted to get a fairly simple and flexible (customizable) solution. Regular expressions, on the fly, didn’t seem to be like that, because even in such a simple task as checking words with a dictionary, unfortunately, required carefully list all the options in this expression. And they did not build a syntax tree. However, they clearly could be improved. This will be discussed.
So, the following tasks were set :
In regular expressions, parsed fragments are identified by bracket numbers. This, to put it mildly, is inconvenient, so it was decided that the brackets could be named. By the way, these particular names must appear in the parse tree. The syntax was simple:
If after the parentheses in the original expression there was any operator (*, +, etc.), then he “walked” beyond the right brace. For example:
Nothing prevents giving names and nested brackets, for example:
In the last example, the modified regular expression will generate a parse tree with some conditional root, instances of A are at the next level (there may be more than one), and ID values go at the next level. It is convenient to refer to such a tree using XPath, which was implemented by me, for example, such a query is possible:
Parsing a regular expression is very similar to parsing a simple logical expression, for example, in Prolog. This leads to the idea that Prolog-like fragments will be organically incorporated into such expressions, which will be:
a) chains of various checks. This is easy, especially since the variables have already appeared (in the form of named brackets);
b) or replenishment of tables / dictionaries with detected fragments;
c) or exclusions from the tables / dictionaries of detected fragments;
d) or a neural network creation / training team.
The general syntax here would be:
operations_character depends on the operation: checking (?), recruiting (+), excluding (-), creating / learning (*). As for predicates, the name of the predicate (standard or custom) and the list of its parameters in parentheses are indicated. Parameters can be constants, input variables, output variables (prefixed with "$"), arbitrary value of "_", reference to the current value of expression_fragment (single "$" character). Parsing such an expression is considered successful if all predicates in the chain are successful.
For example, the expression:
selects two consecutive words in Russian and puts them into variables V1 and V2, and then checks these words with the check predicate (this can be a simple check on the table) and, finally, the predicate correlate (can also be a check on the table). The type of verification (strict or non-strict) is determined by the predicate specification.
If the correlate table contains not the words themselves, but some codes of words and characteristics of their compatibility, then there may be an expression like this:
Here, two variables are first introduced, C1 and C2. The correlate predicate can also be a neural network with two inputs (word codes) and one output (compatibility), which is trained using some predefined or dynamically assembled (when the regular expression is running) set.
It remains to add that special parallel and sequential directives can also be specified as predicates, which, respectively, include and calculate the parallelism of executions in the predicate chain (which is then converted into a predicate dependency tree, according to which it is parallelized). In addition, you can enable a special mode in which an attempt is made to dynamically predict the execution time of individual predicates and decide whether parallelism will be effective, or vice versa, will only add additional costs.
All described modifications of regular expressions are implemented in the prototype (modification of the standard regexpr.pas) on Free Pascal.
I hope that these ideas will be useful to someone.
Now, with the use of such regular-logical expressions + programming on pure Prolog, we managed, first, to write an addition to the program generation system, which restores the original program model using the code previously generated by it, and second, to make elements of the natural language for this system interface (the formulation of the problem is adopted in a very simplified Russian language and a problem model is formulated for it, according to which the program is already generated).
Regular expressions are a well-known thing that is used in various projects, most often for not very complicated cases of parsing structured texts. Being engaged, at first glance, such a slightly different task as the inverse synthesis of program models (when there is a program code generated automatically by some system according to some block model of the problem being solved, and it is necessary to recreate the original model using this code), as well as the synthesis of program models by textual description of the task, I was faced with the problem of text analysis, or rather, the identification of text fragments to some custom templates. I wanted to get a fairly simple and flexible (customizable) solution. Regular expressions, on the fly, didn’t seem to be like that, because even in such a simple task as checking words with a dictionary, unfortunately, required carefully list all the options in this expression. And they did not build a syntax tree. However, they clearly could be improved. This will be discussed.
So, the following tasks were set :
- A regular expression must be able to produce a parse tree. It is necessary to implement the standard means of access to this tree.
- A regular expression should be able to include checks of found fragments in the dictionary (exact or loosely matched according to Levenshten), as well as more complex checks on several tables at the same time.
- In addition to simple tests (listed above), I would also like to have more fuzzy tests, for example, the compatibility of words and expressions, with a neural network.
Parse tree
In regular expressions, parsed fragments are identified by bracket numbers. This, to put it mildly, is inconvenient, so it was decided that the brackets could be named. By the way, these particular names must appear in the parse tree. The syntax was simple:
(фрагмент_выражения)->{имя_скобок}
If after the parentheses in the original expression there was any operator (*, +, etc.), then he “walked” beyond the right brace. For example:
(\w+\s)->{A}+
Nothing prevents giving names and nested brackets, for example:
((\w+)->{ID}\s)->{A}+
In the last example, the modified regular expression will generate a parse tree with some conditional root, instances of A are at the next level (there may be more than one), and ID values go at the next level. It is convenient to refer to such a tree using XPath, which was implemented by me, for example, such a query is possible:
//A[2]/ID[text()!=""]/text()
Word checks on dictionaries, tables and simple neural networks
Parsing a regular expression is very similar to parsing a simple logical expression, for example, in Prolog. This leads to the idea that Prolog-like fragments will be organically incorporated into such expressions, which will be:
a) chains of various checks. This is easy, especially since the variables have already appeared (in the form of named brackets);
b) or replenishment of tables / dictionaries with detected fragments;
c) or exclusions from the tables / dictionaries of detected fragments;
d) or a neural network creation / training team.
The general syntax here would be:
(фрагмент_выражения)символ_операций=>{предикат1,предикат2,...}
operations_character depends on the operation: checking (?), recruiting (+), excluding (-), creating / learning (*). As for predicates, the name of the predicate (standard or custom) and the list of its parameters in parentheses are indicated. Parameters can be constants, input variables, output variables (prefixed with "$"), arbitrary value of "_", reference to the current value of expression_fragment (single "$" character). Parsing such an expression is considered successful if all predicates in the chain are successful.
For example, the expression:
([А-Яа-я]+)->{V1}\s+([А-Яа-я]+)->{V2}()?=>{check(V1},check(V2),correlate(V1,V2)}
selects two consecutive words in Russian and puts them into variables V1 and V2, and then checks these words with the check predicate (this can be a simple check on the table) and, finally, the predicate correlate (can also be a check on the table). The type of verification (strict or non-strict) is determined by the predicate specification.
If the correlate table contains not the words themselves, but some codes of words and characteristics of their compatibility, then there may be an expression like this:
()->{C1}()->{C2}([А-Яа-я]+)->{check($,$C1}\s+([А-Яа-я]+)->{V2}()?=>{check(V2,$C2),correlate(C1,C2,1)}
Here, two variables are first introduced, C1 and C2. The correlate predicate can also be a neural network with two inputs (word codes) and one output (compatibility), which is trained using some predefined or dynamically assembled (when the regular expression is running) set.
It remains to add that special parallel and sequential directives can also be specified as predicates, which, respectively, include and calculate the parallelism of executions in the predicate chain (which is then converted into a predicate dependency tree, according to which it is parallelized). In addition, you can enable a special mode in which an attempt is made to dynamically predict the execution time of individual predicates and decide whether parallelism will be effective, or vice versa, will only add additional costs.
Conclusion
All described modifications of regular expressions are implemented in the prototype (modification of the standard regexpr.pas) on Free Pascal.
I hope that these ideas will be useful to someone.
Now, with the use of such regular-logical expressions + programming on pure Prolog, we managed, first, to write an addition to the program generation system, which restores the original program model using the code previously generated by it, and second, to make elements of the natural language for this system interface (the formulation of the problem is adopted in a very simplified Russian language and a problem model is formulated for it, according to which the program is already generated).