Expression screwdriver

    The scope of the analysis of mathematical expressions is not difficult to imagine - these are all kinds of parsers for SQL queries, and handlers of formulas entered by the user (the same graphing or filters for the database) - up to creating your own languages ​​(I don’t intentionally write the word “programming”, etc. to. Often these are data description languages ​​and others like them).

    Maybe I'm wrong, but I could not find a more or less usable expression parser for PHP on the open spaces of the network - and, as those who read my articles periodically are used to, I went to implement this business on my own, i.e. reinvent the wheel. : ^)

    The result of my attempts you can find here. In the archive you will find the scripts necessary for the library to function, and an example of its operation (sample.php). The library is compiled as standalone.

    But, I believe, it would be interesting to figure out what was happening there.

    As you might guess, the library is built on the basis of SCR (reverse Polish notation, or notation, if you are more familiar). However, the SCR made sense to finalize to the desired condition, because in its description only the basic functionality for parsing expressions is given. So, here is the library's working scheme: The

    description of expression parsing rules is made in a class inherited from the CIpr class. The definition of grammars and classes is done in the protected p_build method.

    First you need to define grammar classes (or grammar), on the basis of which the parser will divide the string passed to it into tokens. Each token consists of a token (the actual word) and a class identifier (the one that caused the selection of this token). The class identifier is the index of the array, i.e. it can be a string or an integer (I used strings for clarity). I have identified the following possible types of grammars:

    grammar_char (characters [, combinations])
    Such grammars are defined by the character set (the string in which these characters are listed) and possibly character combinations (array of strings).
    The definition looks like this: grammar_char ("+ - * / =", array ("+ =", "++")) . This grammar defines single characters +, -, *, / and = and their combinations + = and ++.

    grammar_list (tokens [,
    This type of grammar is defined by predefined sets of tokens. You can also determine whether the definition of a token is case sensitive (yes by default).
    You can use it like this: grammar_list (array ("prefix1", "prefix2"), false) . This grammar defines case insensitive prefix1 and prefix2 tokens. It is important that it will simply search for one token from the list at the beginning of the line without taking into account delimiters , i.e. the token prefix1 will be selected from the expression prefix1something.

    grammar_preg (start [, extract])
    This type of grammar, along with symbolic ones, I think will be most useful. These are regular expression grammars. They are described by the regularity of the beginning and, possibly, the regularity for extraction. If the extraction is not specified, then the beginning is used instead. Here is such a regular expression, for example, it will select comments of the form // comment : /^\/\/.*/

    grammar_quot (quotation mark symbol [, escaping mode = C_ESCAPE_CHAR])
    Grammars of this type are enclosed in quotation marks. The escaping mode can take the following values:
    C_ESCAPE_NONE - characters are not escaped in a string. (The string "10 \" "will be recognized as 10 \).
    C_ESCAPE_CHAR - only control characters can be escaped. (String" 10 \ '- \ "" will be recognized as 10 \' - ").
    C_ESCAPE_ALL - escape character \ can be placed before all characters. (The string "1 \ 0 \ '- \" "will be recognized as 10'-").

    grammar_brac (parenthesis characters [, escaping mode = C_ESCAPE_ALL [, <allow nesting = true>]])
    These are bracket grammars (that is, highlighting expressions of type) Their use to me personally seems doubtful in this context - they are intended for other purposes (this library is only part of the platform). But you can use them if you come in handy. : ^)

    grammar_user (start, extract)
    This is a custom grammar defined by the start and extract functions.

    Parsing takes place as follows: The input line is checked in turn by all grammars, and if the beginning of one of them is found (usually matching the beginning of the line), then it is extracted. As a result of parsing, you get an array of tokens.

    Further, more interestingly, classes of possible commands in the expression code are defined. The following types of commands can be defined:

    ipr_lexeme (token)
    Specifies that the specified token should be treated as a token-type command. A token can act as a full token defined through the lexeme and ilexeme functions, or just a string that defines the token of the token.

    ipr_operator (token, priority [, arity = 2 [, associativity = no (which is now equivalent to the left, which is not entirely true)]])
    Defines the operator.
    You can also use functions:
    ipr_perfix - prefix unary operator;
    ipr_postfix - postfix unary operator;
    ipr_binary is a binary operator.

    ipr_compound (token array, priority [, arity = 3 [, associativity = right [, calculate automatically = no]]])
    Defines a complex operator. What is the difference from simple? The operands of a simple operator are calculated by the engine itself, which, say, for the?: Operator is not entirely correct, because only one of its second and third operands needs to be calculated. For this, a complex operator is needed - for it, only the first operand is automatically calculated, and all subsequent operands are passed to the user method for processing. Thus, it makes sense to define complex operators with arity at least 2.
    You can also use the following functions:
    ipr_access - binary operator for accessing the object field;
    ipr_ternary is a ternary operator.

    ipr_brackets (opening token, closing token, comma [, can be used without operand = no [, arity = not defined [, calculate automatically = yes]]])
    Defines parentheses. Brackets can be used without an operand (for example (a + b)) or with an operand (for example, arr [10]). The arity of brackets can also be determined when a strictly defined number of operands must be specified in them. About automatic can be read above.
    You can also use functions:
    ipr_default - definition of brackets used to set priority (i.e., default brackets).

    ipr_instruction (token opening token, closing token, comma [, arity = not defined [, calculate automatically = no]])
    Defines the instruction. In fact, the same thing as brackets with the prohibition of auto use, only instead of the operand before the brackets, the instruction token is used.

    ipr_end (token)
    Defines the token on which the expression is considered to end.

    ipr_until (token)
    Defines the token on which the expression is considered to end. In this case, the token remains at the beginning of the remainder of the expression.

    Now you need to define handlers for all types of commands, of which 5 are the operand, token, operator, complex operator, brackets, and instruction. These are the following protected methods:

    p_run_operand (command)
    p_run_lexeme (command)
    p_run_operator (command, operands)
    p_run_compound (command, operand, operands)
    p_run_brackets (command, operand, operands)
    p_run_instruction (command, operands)

    The command contains the type of command (not important, because it defines the handler, i.e. the situation when an operator command comes to the operand handler is impossible), the class of the command, and the token that was defined as the command. The operand contains the calculated operand, and the operands contain the operands, either calculated or awaiting processing through the p_run method (depending on the type of command and the “calculate automatically” flag).

    Now, to create an expression handler, you need to create an instance of the handler class to which the expression to be processed is passed to the constructor. The constructor will parse and compile the expression. To start an expression, use the run method.

    Well, that seems to be all. Error handling and non-associative operators are on the way now - I hope I will complete all this in the near future. As usual, I ask you not to be too strict for those to whom this is no longer the case - I publish my developments for those who may find it useful, because the developments, it seems to me, are not entirely useless.

    Thank you for your attention! I will be very glad to hear your feedback and comments in the comments, regardless of their positivity. : ^)

    Also popular now: