mark_ablov March 15, 2012 at 19:27

Writing a compiler for LALR (1) parsers. Basic theory

Introduction, or why parsers are needed

Good afternoon.
Not so long ago, I had the task of parsing a single grammar. Alas, the existing solutions did not suit me, so the problem arose of writing my own parser generator. Despite the fact that the topic is quite popular and there are not so few articles and books on this subject, I still decided to once again describe this process, and start with the most basic concepts.

This part is devoted to the basis, the general theory of computer science. It is possible that this is even taught in schools / universities in Russia. The flesh itself will go from the second part.

So, why would anyone need to write a parser, and what is it all about? A parser is a code that gives an incoming character set semantic meaning. That is, an analysis of these symbols takes place, and on the basis of this analysis, the program understands how to interpret these letters and numbers. A simple example is “1 + 2”, after or during the parsing process, the “+” sign is not just a plus symbol, but the designation of the binary addition operator, and in “+3” it is the unary operator of the number sign. This is obvious to most people, but not to the car.

Parsers are used everywhere - in Word'e for analyzing applications, word forms, formulas, etc; on almost any website when validating input data: email, phone number, credit card number; configuration files; serialized data (e.g. in xml); in many games - script clips, AI scripts, console. In general, it is an integral part of computer science.

Two types of parsers

Ok, after realizing the importance of this technology, it is necessary to raise the question of the standards for writing this class of programs.

Relatively speaking, parsers can be divided into two types: those that use formal language descriptions and those that are embedded directly into code without an abstract data scheme.

The first approach means dividing the analyzer into two parts - a description of the format / scheme of the input information and logic, which operates no longer with a clean data stream, but structured in accordance with the scheme. Example:

scheme:
    DIGIT = '0'..'9'
    ADD = DIGIT '+' DIGIT
    SIGN = ('+' | '-') DIGIT
    EXPR = ADD | SIGN
code:
    long result = 0;
    ParseNode & expr = Parse(input);
    ParseNode & oper = expr.child()
    switch (oper.type)
    {
    // 0: DIGIT, 1: '+', 2: DIGIT
    case ADD:
        result = oper.child(0) + oper.child(2);
        break;
    // 0: '+' or '-', 1: DIGIT
    case SIGN:
        result = oper.child(0) == '-' ? - oper.child(1) : oper.child(1);
        break;
    }

The second approach is easier to show than to explain:

    char c = '';
    long result = 0;
    input >> c;
    // we have three variants of first symbol
    switch (c)
    {
    case '+':
        // next symbol is digit
        input >> c;
        if (c < '0' || c > '9')
            return SyntaxError;
        result = c - '0';
        break;
    case '-':
        // next symbol is digit
        input >> c;
        if (c < '0' || c > '9')
            return SyntaxError;
        result = c - '0';
        break;
    default:
        if (c < '0' || c > '9')
            return SyntaxError;
        result = c - '0';
        input >> c;
        if (c != '+')
            return SyntaxError;
        input >> c;
        if (c < '0' || c > '9')
            return SyntaxError;
        result += (c - '0');
        break;
    }

In principle, from this example it is already possible to draw conclusions about which way is better, but I still try to collect the pros of each type. In this case, a plus for one is equal to a minus of the other.

Grammar Description Benefits:

First of all, it is easy language support. That is, modifications, even the most cardinal, are superimposed very simply.
Since the syntactic constructions are collected in one place and compactly, it is very simple to review the structure of the language as a whole. In the alternative, you can simply wallow among thousands and tens of thousands of lines mixed with logic. In addition, language transparency is ensured, the evolution of structures is evidently traced. That is, in our example, we see how DIGIT turns into ADD, and then into EXPR, actually allowing us to outline points for the implementation of semantic logic.
Easy bug tracking. Moreover, two categories at once - syntax errors of input data and errors of compiling the code itself. When writing manually, the code may contain inconsistencies, dead sections, and other logical joys and at the same time be compiled. With a given circuit, this is not possible. Contradictions are identified immediately at the generation stage.
The introduction of additional abstraction simplifies the source code and allows you to debug and test higher-level entities, which by definition are less numerous and more specific than a simple stream of characters.

Despite all this, programming has a number of advantages:

The entry threshold is much lower. That is, almost any programmer can say “sit write code”, and the parser will be programmed in the proposed style. To compile a formal grammar, you need to have a small, but still important, volume of theory - the basics of mathematical logic, to understand the construction of grammar, to own a code generation tool.
Despite the fact that grammar can describe almost everything, there are exceptions. In addition, limited instrumentation is added to them. Direct coding is highly flexible.
There is still a very tiny plus - all in one place. That is, with the normal organization of the program, you can highlight the key pieces of logic and immediately understand their essence. In the first version, we actually have two necessary screens - a scheme screen + a code screen. Sometimes you need to switch between them to clarify the nuances. This is not entirely convenient. But again, this is an extremely small advantage, especially when you take into account the second plus of separate coding.

Parser structure

Of course, I, like most normal programmers, almost always choose the first approach. Again, for the implementation of the second one does not need any knowledge, and I have no additional theory for it. Therefore, I will continue to talk about the structure of the parser based on formal grammar.

This is a very conditional scheme, the analyzer can consist of all three parts, or have a combination of the first or last two, and even be a solid piece without separation into components. There are almost no best practices in the organization. But personally, I prefer to separate flies from cutlets. I will not breed a holivar, but simply describe the options.

First of all, it is worth understanding what these components are. The lexical analyzer receives an input stream of characters of a given alphabet (letters, numbers, characters used, etc). Further, he breaks up this stream into more complex primitives (such an oxymoron, yes), the so-called tokens or terminals. For example, instead of a sequence of digits of some length, the parser receives a "Number" with an attribute equal to the value of this number. Why do I need it, I will explain below. Next, a parser based on tokens and rules for composing symbols of the second hierarchical order (the first - tokens), the so-called non-terminals, collects an output tree. This ADT uniquely represents the structure of recognized data. And finally, the last stage is a semantic analysis of the resulting tree and the implementation of business logic.

Why do we need a lexical analyzer if it can actually be integrated into the syntax part? After all, what's the difference which alphabet to get - the original or the new synthetic. The answer is quite obvious - firstly, it is usually a narrowing of the alphabet, that is, simplification of semantics; secondly, we remove one or even several lower levels of the tree while fully preserving its properties. It is clear that the lexical analyzer should not assume the role of syntactic, and even more semantic, therefore, it should not return 3 on “1 + 2”, but simple actions such as generating numbers are quite suitable. Let's complicate the example a bit, and look at the output tree in two cases. At the same time, I will show what kind of tree this is to those who do not quite understand the brief explanation.

    DIGIT = '0'..'9'
    NUMBER = NUMBER? DIGIT
    ADD = NUMBER '+' NUMBER
    SIGN = ('+' | '-') NUMBER
    EXPR = ADD | SIGN

Analysis goes

through the expressions 12 + 34. Even with such a simple example, it can be seen that it is more convenient to divide the analysis into at least 2 stages. In addition, the specialization of the lexical analyzer allows the use of more efficient algorithms that are different from parsing grammars. The main difference, apart from the empirical lexer presented above, from the parser is the absence of output rules, or rather, they can be implicitly represented, but only the characters of the alphabet will be on the right, and the rules will never have a connection with the other non-terminals of the lexer. That is, they are self-sufficient and describe only the expected stream of characters.

Now consider the second potential alloy. This is the integration of semantic logic directly into the parser, bypassing the stage of constructing the tree. Here the tactics are simple - we outline the points that have semantic meaning and write the code. Example:

    DIGIT = '0'..'9' {value = child[0] - '0'}
    ADD = DIGIT '+' DIGIT {value = child[0].value + child[1].value}
    SIGN = ('+' | '-') DIGIT {value = child[0].value == '-' ? - child[1].value : child[1].value}
    EXPR = ADD | SIGN {result = child[0].value}

It is noticeable that complex grammar cannot be described like that. Yes, and such a mixture eliminates some of the advantages of writing code separately. But for simple grammars, this is a pretty good method of writing code, and its importance should not be minimized.

LL vs LR, or elephant vs whale

Writing a good lexer is a separate big topic, so I will continue to describe only the development of a parser.

Two groups of analyzers are distinguished - ascending (LR) and descending (LL).

They take their name from the method of constructing an output tree. The first builds a tree from the bottom up, the second from the top down. I will dwell on this a bit and give the stupidest algorithms.

The LR parser conditionally has a stack in which it stores the last incoming characters (both terminals and non-terminals), at each step, reading the next token, the parser tries to find a rule that can be applied to the character set from the top of the stack, if it finds, it collapses a sequence of characters in one nonterminal. That is, if the stack looks like {DIGIT, +, DIGIT}, then the analyzer will collapse this to {ADD}, and then to {EXPR}. The pseudocode of such a parser will be something like this:

input >> term;
while (term != EOF)
{
    stack.push(term);
    do 
    {
        reduced = false;
        for (Rule rule : rules)
            if (rule.Reduce(stack))
            {
                stack.pop(rule.right.length);
                stack.push(rule.symbol);
                reduced = true;
                break;
            }
    } while (reduced);
    input >> term;
}

An LL parser tries to do the opposite - for each incoming character, guess which rule it applies to. The most primitive is the choice of alternatives (for example, EXPR can be deployed in ADD or in SIGN, that is, 2 alternatives) by the first character, and a recursive descent with a new set of rules that produce non-terminals from the selected path. And so on until the rules go down to the terminals. The description is quite complicated, in the code it will be much easier to understand:

function ExpandRule(symbol, term)
{
    for (Rule rule : rules)
    {
        if (rule.sybmol == symbol && firstTerminal(rule) == term)
            break;
    }
    for (s : rule)
    {
        if (s.type == NonTerminal)
            term = ExpandRule(s, term);
        else
        {  
            if (term != s)
                throw SyntaxError;
            input >> term;
        }
    }
    return term;
}
input >> term;
ExpandRule(EXPR, term);

Which parser to use is a matter of taste. In almost all respects, they are the same. The bike walks about the use of LL (x) in the Soviets, and LR (x) in the West, but I don’t know how true this is. Personally, I liked ideologically LR, I will tell you more about it in the next part.

PS: The importance of correctly writing parsers can be seen right in the article by looking at the unselected second section of code that is wrapped in the same source block as the first example.

Parts of the article

Part 1. Basic Theory
Part 2. Description of LR generators
Part 3. Features of writing and possible features of LR-generators

Tags: