PHP: Implementing formal grammars

    Recently, I needed to write a parser for a search string that casts strings of the form
    (aa & bb) ^ (! Cc ^! (Dd ^ ee)) to a string of the form of a piece of SQL: (? F LIKE "% aa%" AND? F LIKE "% bb% ") OR (? f NOT LIKE"% cc% "OR! ((? f LIKE"% dd% "OR? f LIKE"% ee% "))). I wrote like and SQL for the sake of simplicity, in fact, there was SPHINX, and it was not required in the end, but talk about how I achieved this by writing formal grammars and implementing them in PHP.

    If I’ve already entered a post, I’ll open before you: despite my higher education in mathematics, I did this for the first time, and before that I had heard, but had not seen how to do it. Therefore, I ask all highbrow mathematicians and gurus to throw me the right decisions and correct me in every way,

    There were several options to achieve the goal: the
    first is to write a recursive (remember “? R”) regular, but I discarded the option “not sports”, but lemmings punish me for this =) the
    second is to write the tree construction manually, but how it would be painful to write a lot, and somehow it was painfully clumsy.
    the third is to recall the words “formal grammar”. Yes, they seem scary at first glance, but in practice everything is much better and more useful - I chose it. I repeat that this is my first implementation of such a concept, so feel free to point out my mistakes!

    so let's get started:
    Let's make the alphabet
    Alphabet = {(,),!, &, ^, (,), aZ, space here, -, _, a-Z}
    we compose a meta-alphabet for our task, taking into account the priorities of the operations (! priority is & & ^, and the latter are of the same priority)

    F -> T | T&F | T ^ F
    T * -> I |! I |! S
    I -> (F) | S
    S -> C | SC
    C -> [a-Z_ a-Z-]

    These are the rules by which meta-alphabet elements are formed from bottom to top
    F - formula, T - term, I - item, S - string, C - symbol. He called it like it.

    * it is entered in T! S so that “! ball of the ball of the blah” should be interpreted as? f NOT LIKE “% of the ball of the ball blah%”, and not as! (? f LIKE “% of the ball blah blah%”), in short it is possible without it.

    As you can see, the grammar containing the parsing of negation (!) Is lower than the grammar containing the parsing of other operations (& and ^),

    I got a contextually independent G 2 grammar for the Chomsky hierarchy for such a task - simple shorter, but not as simple as a finite state machine (therefore, with pure "regulars", such problems are not solved without recursions).

    Now it remains to realize all this. I don’t think it’s interesting how I coded, so I’ll lay out the final code without tormenting the reader, I’ll just notice that it doesn’t implement the search for syntax errors (Although I got sick and parsed a couple of the most egregious errors into the code).
    Copy Source | Copy HTML
    1.  
    2. class GrammarParcer2SQLToken{
    3.     private $string = '';
    4.     private $index =  0;
    5.     private $sql_token = '';
    6.     private $error = null;
    7.     public function __construct($s=null){
    8.         if (!empty($s))
    9.             $this->string = $s;
    10.         $this->index =  0;
    11.         $this->error = null;
    12.     }
    13.  
    14.     public function parce($s=null, $e=null){
    15.         if (!empty($s))
    16.             $this->string = $s;
    17.         $this->index =  0;
    18.         $this->error = null;
    19.         $this->F();
    20.         $e = $this->error;
    21.         return $this->sql_token;
    22.     }
    23.  
    24.     public function getError(){
    25.         return $this->error;
    26.     }
    27.  
    28.     protected function setError($mes){
    29.         $this->error = $mes;
    30.     }
    31.  
    32.     protected function isEnd(){
    33.         if (!empty($this->error) or $this->index >= strlen($this->string))
    34.             return true;
    35.         return false;
    36.     }
    37.  
    38.     protected function F(){
    39.         $this->T();
    40.         if ($this->isEnd())
    41.             return;
    42.         if ($this->string{$this->index} == '&'){
    43.             $this->sql_token.= ' AND ';
    44.         }elseif ($this->string{$this->index} == '^' or $this->string{$this->index} == '|'){
    45.             $this->sql_token.= ' OR ';
    46.         }else{
    47.             return;
    48.         }
    49.         $this->index++;
    50.         $this->F();
    51.     }
    52.  
    53.     protected function T(){
    54.         $this->I();
    55.         if ($this->isEnd())
    56.             return;
    57.         if ($this->string{$this->index} == '!'){
    58.             $this->index++;
    59.             if (!$this->S('invert')){
    60.                 $this->sql_token.= '!(';
    61.                 $this->I();
    62.                 $this->sql_token.= ')';
    63.             }
    64.         }
    65.     }
    66.  
    67.     protected function I(){
    68.         if ($this->string{$this->index} == '('){
    69.             $this->sql_token.= '(';
    70.             $this->index++;
    71.             $this->F();
    72.             if ($this->string{$this->index} !== ')'){
    73.                 $this->setError('parse error, expected ")" on offset: '.$this->index);
    74.                 //нет закрывающей скобки.
    75.             }
    76.             $this->sql_token.= ')';
    77.             $this->index++;
    78.         }else{
    79.             $this->S();
    80.         }
    81.     }
    82.  
    83.     protected function S($invert=false){
    84.         if ($this->isEnd())
    85.             return false;
    86.         $string='';
    87.         while(!$this->isEnd()){
    88.             if (preg_match('~[0-9a-zа-я_ -]~i', $this->string{$this->index})){
    89.                 $string.=$this->string{$this->index};
    90.                 $this->index++;
    91.                 continue;
    92.             }elseif (!preg_match('~[&)(!|^]~i', $this->string{$this->index})){
    93.                 $this->setError('parse error, unexpected "'.$this->string{$this->index}.'" on offset: '.$this->index);
    94.                 //ненормальный символ =)
    95.             }
    96.             break;
    97.         }
    98.         if (empty($string))
    99.             return false;
    100.         $this->sql_token.= '?f '.($invert?'NOT ':'').'LIKE "%'.mysql_escape_string($string).'%"';
    101.         return true;
    102.     }
    103. }


    see an example:

    Copy Source | Copy HTML
    1. $input = '(aa&bb)^(!cc^!(dd^ee))';
    2. $parcer = new GrammarParcer2SQLToken();
    3. echo $parcer->parce($input);
    4. echo "\n".$parcer->getError();


    UPD did the code coloring - thanks sc.me

    Also popular now: