Features of writing and possible features of LR-generators

    Introduction


    Good afternoon.
    In the final part about writing my own generator of LALR parsers, I would like to describe possible features and features. In addition, I will describe what I was lacking in existing solutions and why I started writing my bike.

    In order to set the context, I’ll inform you that the grammar for analysis is ECMAScript, also known as JavaScript. The specific specification is ECMA-262, revision 5.1 of June 2011.


    When I first encountered the difficulties of writing, I did not immediately rush to write compiler code. At first, I studied how large companies solved these problems: Mozilla and Google (Opera and IE sources are not in the public domain). As it turned out, they did not really take a steam bath with the formalization of grammar and wrote the code as is, that is, as I warned you to do in the first part. An example of processing if'a in v8 (Chrome JS engine):
    IfStatement* Parser::ParseIfStatement(ZoneStringList* labels, bool* ok) {
      // IfStatement ::
      //   'if' '(' Expression ')' Statement ('else' Statement)?
      Expect(Token::IF, CHECK_OK);
      Expect(Token::LPAREN, CHECK_OK);
      Expression* condition = ParseExpression(true, CHECK_OK);
      Expect(Token::RPAREN, CHECK_OK);
      Statement* then_statement = ParseStatement(labels, CHECK_OK);
      Statement* else_statement = NULL;
      if (peek() == Token::ELSE) {
        Next();
        else_statement = ParseStatement(labels, CHECK_OK);
      } else {
        else_statement = factory()->NewEmptyStatement();
      }
      return factory()->NewIfStatement(condition, then_statement, else_statement);
    }
    


    Why did they do that? There are several reasons:
    1. Speed ​​of work. Although FSM is very fast, “linear” execution is still faster.
    2. Flexibility. The fact that bison / ANTLR / boost :: spirit / was not used, my own development indirectly plunged me into the idea that it was not so simple with this grammar, and I was not the only one who ran into limitations.
    3. They are not developers of the language from scratch (as with Dart, for example), so it’s enough to write and forget once for a couple of years before new changes to the standard. It follows from this that changes in grammar are not carried out every day.

    Of course, there are a number of minuses:
    1. 5-10 thousand lines of code only for parsing is a bit much.
    2. You can’t live without a specification. Since the code is very fragmented, it is very difficult to assemble a language scheme into the system from scratch only from the source.
    3. Error handling inflates code despite tricky macros:
      #define CHECK_OK  ok);   \
        if (!*ok) return NULL; \
        ((void)0
      



    Parser / Parser Generator Requirements


    Now you can go to those pitfalls of grammar and my expectations. I must say right away that some points are completely performed by third-party software. But in total there is no holistic program, and there are points on which I could not find a solution.

    Comments are helpful!

    In my task, I needed to save the text and position of the comments, and not discard them. ANTLR and bison / flex can do this. The first due to the multiplexing of the lexer, the second - due to the power of flex. Since I have not started writing my lexer yet, I use flex. And he solved this problem as follows:
     /* [Comment] = [MultiLineComment] | [SingleLineComment] */
     /* [MultiLineComment] */
     /* start of multiline, grab whole comment, that's why i use yymore() */
    "/*" {
        BEGIN(multilinecomment);
        yymore();
        m_loc.setMore(true);
    }
     /* skip all comments chars except "*" */
    [^*\n\r]* { yymore(); }
     /* skip all "*", that not followed by "/" */
    "*"+[^*/\n\r]* { yymore(); }
     /* end of comment */
    "*"+"/" {
        m_comments.push_back(CParser::Comment(yytext, m_loc.tokenLine(), m_loc.tokenCol()));
        m_loc.setMore(false);
        BEGIN(INITIAL);
    }
     /* [SingleLineComment] increment lineNo */
    ("//"([^\r\n])*{LineTerminator})|("//"([^\r\n])*) {
        m_comments.push_back(CParser::Comment(yytext, m_loc.tokenLine(), m_loc.tokenCol()));
        m_loc.linc();
    }
    

    Nothing special here. We use the internal state of the lexer and yymore () which allows you to create a complete token from several rules.

    Parser-controlled lexer modes

    JS has a special form for specifying reg-exps. They can be written just like that in the source code, and the signal is "/". But at the same time it is also a sign of division. Accordingly, the flow of tokens is very different in these cases. A slash is treated as a division sign if the parser in this position allows it to be seen. And as a sign of the beginning of reg-exp - in all other cases:
    // здесь слэш означает начало regexp'a
    /asd[^c]/i.match(s);
    // а уже здесь - знак деления
    a = b / c;
    

    That is, the parser should be able to check the box if the division operator can come at the moment. At the flex level, it is solved simply:
     /* [RegularExpressionLiteral] it has more length then [DivPunctor] (at least 3 == "/X/" versus 2 == "/="), that's why it would be proceeded first, before [DivPunctor] */
    "/"{RegularExpressionFirstChar}({RegularExpressionFirstChar}|"*")*"/"({IdentifierStart}|[0-9])* {
        if (m_mode != Mode::M_REGEXP)
            REJECT;
        yylval->assign(yytext, yyleng);
        return token::T_REGEXP;
    }
     /* [DivPunctuator]: "/" or "/=" */
    "/" return token::T_DIV;
    "/=" return token::T_DIVAS;
    

    Since flex is a greedy lexer and tries to process as long as possible lexemes, then first it will fall into the rule for regexp, and then if the division mode is set, it will go to the next rule.

    Parser tree at the exit

    I would really like to get exactly the parser tree as a result of the analysis. It is very convenient to work with him as part of my task. AST is also suitable, but alas, the only ANTLR format that supports this format requires explicit rules to be built on top of the grammar. No special action was required. Unless small additions to the analyzer algorithm:
        while (!accepted && !error)
        {
            ...
            switch (act.type)
            {
            ...
            case Action::ACTION_SHIFT:
                ...
                node.col = m_loc.tokenCol();
                node.line = m_loc.tokenLine();
                node.type = ParseNode::NODE_TERM;
                node.termVal = s;
                node.symb.term = tok;
                nodes.push(tree.create(node));
                break;
            case Action::ACTION_REDUCE:
                ...
                node.type = ParseNode::NODE_NONTERM;
                node.symb.nonTerm = (Symbols::SymbolsType)act.reduce.nonTerm;
                CTree::TreeNode * parent = tree.create(node);
                unsigned long ruleLen = act.reduce.ruleLength;
                while (ruleLen)
                {
                    tree.add(parent, nodes.top(), true);
                    nodes.pop();
                    ruleLen--;
                }
                nodes.push(parent);
                break;
            }
        }
        if (nodes.top())
            tree.SetRoot(nodes.top());
    


    Custom lookahead

    In grammar, you can see just such a thing
    ExpressionStatement = [lookahead ∉ {{, function}] Expression ; 


    No, this is not a standard expect-convolution symbol, which I talked about in the previous part. This is the opposite of trying to filter the first Expression character. That is, according to this rule, we must remove unnecessary subsidiary products from Expression that start with the terminal "{". In this case, this is done in order to avoid a conflict between Block, which may look like "{}" and one of the creatures is simply Expression: "{}" as setting an empty javascript object Object. Similarly for "function".

    The decision can be made without the magic of the parser, but only by tweaking the grammar:
    ExpressionStatement = ExpressionWithoutBracket
    ExpressionWithoutBracket = ... | AssignmentExpressionWithoutBracket | ...
    ...
    
    And so until we get to the rule that begins with the terminal "{", and we exclude this rule from the penultimate grammar. The problem is that there are about 20 intermediate steps. In addition, we multiply this by 2 (also “function”!) And then by 2 because this technique is used for the same rule but in a different situation (2 lines - Expression and ExpressionNoIn). No, it does not please me at all.

    Technically, in my parser, this is solved quite simply:
    ExpressionStatement = Expression {filter: [objectLiteral, functionDeclaration]}
    ...
    ObjectLiteral = {commonId: objectLiteral} '{' '}' | '{ 'PropertyNameAndValueList '}' | '{' ',' '}'
    
    When closing, we hang up the filtering label from the parent to the child, if necessary. And when we meet a rule falling under filtering, we just do not include it in the closure.

    Turn off some child rules

    This is almost a subspecies of the previous paragraph, but here you need to look not only at the first character, but also at an arbitrary one. Illustration is the aforementioned ExpressionNoIn and a related VariableDeclarationListNoIn. If you look in more detail, then the roots of the problem go into 2 types of for'a for example:
    // first syntax
    for (var id in obj) alret (id);
    // second syntax
    for (var id = 1; id <10; id) alert (id);

    Under normal conditions, when initializing variables, we can use in, a la a = 'eval' in window; This operator checks for the existence of a member in a given object (hussars, be silent!). However, in a loop, it is used to iterate over all the members of an object. And, in fact, confusion can very easily arise, so it is forbidden to use in in the second syntax. In fact, the LALR parser easily resolves all this thanks to the convolution with checking the next character (';' or ')'), but the creators of the grammar apparently focused on a wider class of parsers and therefore introduce duplicate 20 rules for the case without the “in” operator. However, you need to follow the grammar, therefore, you also need a mechanism to turn off objectionable child (great-great -...) rules.

    I have it completely similar to the previous paragraph,
    ExpressionNoIn = Expression {filter: [RelationalExpression_6]}
    ...
    RelationalExpression = ShiftExpression | RelationalExpression '<' ShiftExpression | RelationalExpression '>' ShiftExpression | RelationalExpression '<=' ShiftExpression | RelationalExpression '>=' ShiftExpression | RelationalExpression 'instanceof' ShiftExpression | {id: RelationalExpression_6} RelationalExpression 'in' ShiftExpression
    

    This is an optional feature that simply makes life easier. And how exactly are examples of errors shown. In the grammar itself, they forgot to add NoIn in one of the rules:
    ConditionalExpressionNoIn = LogicalORExpressionNoIn ? AssignmentExpression : AssignmentExpressionNoIn
    
    The second AssignmentExpression should be obvious with a postfix.

    Auto insert ';'

    Another entertaining javascript feature is the optional completion of statements with a semicolon. Even in places where grammar requires it. The insertion rules are simultaneously simple in description and difficult to implement:
    1. If the parser returned an error on the next token and this token is '{' either it is separated from the previous one by at least one line feed.
    2. Met EOF and it was unexpected.
    3. For for'a rules do not apply. Commas between brackets in the for'a header.

    The insertion is quite simple - if you determine the need for this action, then just go to the new FSM circle, replacing the last read token with a semicolon. Well, in the next shift, we do not read the new token, but use the one that caused the error.

    Unicode Support

    JS is a very competent language in this regard. Support for utf16 is spelled out rigidly in the standard. Up to the point that before parsing the entire source must be converted to utf16. However, unfortunately, flex does not know how to do this (no, \ xXX does not fit. Offhand, about a thousand characters will need to be encoded). Therefore, while this condition is not fulfilled.

    Quantifier support

    Another pretty big topic that makes grammar easier. First of all, I needed the following quantifiers: "|" - variants of the rules, where the left side is common, and the right ones are different; "?" - optional character in the rule; "*" - the character is repeated 1 or more times. Yes, they are completely resolved at the level of BNF grammars:
    A = B | C
    // превращается в
    A = B
    A = C
    A = B? C
    // превращается в
    A = B C
    A = C
    A = B*
    // превращается в
    A = A B
    A = B
    

    Everything is fine here except for the last example. In this case, we get the steps in the tree. That is, if B is repeated 5 times, then we get a node A with a depth equal to 5. This is very inconvenient. Therefore, again, I decided to introduce modifications to the parser that provide a convenient linear representation - BBBBB form one node A with 5 leaves B. In practice, this looks like the introduction of a new type of operation - ReduceRepeated. And modifications to the closure functions and convolution definitions. In closeItem () we loop an element.

    But what happens if done as usual:


    That's all. Thanks for reading this article.

    Parts of the article


    1. Part 1. Basic Theory
    2. Part 2. Description of LR generators
    3. Part 3. Features of writing and possible features of LR-generators

    Also popular now: