tyomitch July 20, 2010 at 14:03

Compilation. 5: downward parsing

Still engaged in upward parsing. What other options are there?
Put the bison aside, and return to theory.

Further in the post:

Idea
Embodiment
Holivar
Backtracking

Idea

Remember that the general idea of the upward parsing is as follows: reading the input line, we shift it one character at a time onto the stack; as soon as a combination is formed at the top of the stack that matches any grammar rule, we collapse it into one nonterminal, and continue on.
An important feature of this method is that during the convolution of the rule we already have all its components on hand, and we can build a tree from them with whatever structure we want, or even use it on the go for calculations .

The opposite approach is that we begin to expand the initial symbol of the grammar, choosing the expandable rule in accordance with the next input symbol.
For example, if we have rules in grammar

OP: '{' OPS '}' // block
    | EXPR ';' // expression
    | 'if' '(' EXPR ')' OP
    | 'if' '(' EXPR ')' OP 'else' OP
    | 'while' '(' EXPR ')' OP
    | 'exit' ';' ;

and we see what goes on in the text while, then we will expand according to the rule OP: 'while' '(' EXPR ')' OP ;

For a store machine, such parsing can be implemented as follows:

During the expansion, we write on the stack the right side of the rule: we will parse the continuation of the text in accordance with it.
If there is a terminal at the top of the stack and the same terminal in the input text, then the text is as expected. We remove the upper terminal from the stack, read the next character of the text.
If there is a nonterminal at the top of the stack, then it needs to be deployed. We select the rule for development based on the input symbol.

We immediately note that it is pointless to perform the action during the sweep: the right part of the expanded rule has not yet been read, and it is not known whether it will be read at all. It would be possible, when expanding, to put a dummy signal on the stack under the symbols of the right part: “the rule has been read successfully, you can perform an action”; but by this moment the entire right side has already been erased from the stack! What will the action work with?

But the main difficulty of the downward parsing is not this, but how to choose the right rule for the development.
Recall a grammar that is too tough for a bison:

WORD: S1 'a' 'i' 'l' | S2 'a' 'l' 'e';
S1: 's';
S2: 's';

How do we expand WORDwhen the next character 's'?
It is unclear how to choose: both rules for WORDbegin precisely on 's'.

Embodiment

You can allow the machine to look ahead more than one character: judging by the reviews, this approach was used in ANTLR, the popular LL-parser generator.
The technical difficulty is that each new character, on which the development of the sweep depends, increases the dimension of the transition tables; so uncompressed tables for an automaton of N states that reads 3 characters from the text (plus a character from the stack) would contain N · 256 ⁴ elements, which is tens of gigabytes.
Therefore, the applicability of LL parsing with a long look ahead is determined precisely by the ability of the parser generator to efficiently compress tables (remember how the bison squeezed us a 258x14 table into seven short one-dimensional arrays).

As far as I understand, LR parsers with long-range looking ahead do not, because everything is fine with everyone: conflicts caused by insufficiently long-range looking are rare in LR-parsers. So that bison can recognize our grammar, it’s enough not to ask him to choose between two identical convolutions:

WORD: 's' 'a' 'i' 'l' | 's' 'a' 'l' 'e';

On the other hand, nothing has changed for the LL parser: as before, both rules correspond to the symbol 's'. Therefore, in order to “adapt” the grammar for LL-parsing, the general beginning of different rules for the same non-terminal is taken out to the “auxiliary non-terminal”:

WORD1: 's' 'a' WORD;
WORD: 'i' 'l' | 'l' 'e';

This trick is called “left factorization”: it is as if a “common factor” is taken out of equally starting rules.
Now the sweep is WORD1unique, and the two possible sweeps for WORDbegin with different letters.

Why bother with factorization and produce meaningless nonterminals in a grammar if the ANTLR miraculous compressor is able to look ahead as much as it wants?
Back to grammar

OP: '{' OPS '}' // block
    | EXPR ';' // expression
    | 'if' '(' EXPR ')' OP
    | 'if' '(' EXPR ')' OP 'else' OP
    | 'while' '(' EXPR ')' OP
    | 'exit' ';' ;

How to expand OPwhen the next character if? There are two rules that can begin this way; and their common part - 'if' '(' EXPR ')' OP- can be of arbitrary length, so no matter how far the parser looks forward, this will not save him.

Another problem LL can't handle is left recursion. Remember the grammar of the market trading calculator:

EXPR: NUM | EXPR OP NUM;

Both rules for EXPRbegin with NUM: the first is explicit, the second is implicit; however, there is no general “factor” that could be brought out of the rules. If we also assume that the length is NUMnot limited, it is clear that no looking ahead will solve the problem.

To fix the grammar, we start from its meaning : the first NUMin the line is the finished expression, and all other expressions contain two operands.

EXPR: NUM EXPR1;
EXPR1: | OP NUM EXPR1;

Using left factorization and eliminating left recursion, any unambiguous grammar can be adapted for LL-parsing, at the cost of adding a ton of auxiliary non-terminals, completely meaningless.
For example, in the convolution of a rule, EXPR: EXPR OP NUM ;we easily built a syntax tree node: the left argument - here it is, in $1; the right argument is here, in $3. And what can you do with a scan EXPR1: OP NUM EXPR1 ;? The left argument is already expanded and erased from the stack ; but we have in our hands EXPR1- a sub-expression after the right argument. It’s like it’s some kind of good!

The important thing is that left recursion is natural for all left-associative operations — and most of them. Therefore, confusion like the above is not a rare curiosity, but a typical case.

On the one hand, the reduction of grammar to LL-form is completely formal, and it can be entrusted to a soulless piece of iron. On the other hand, how to debug an automaton that does not work according to a given grammar, but according to its own?
In particular, we will not be allowed to write actions for development, because anyway, it will not be the rules that we set, but some other rules that the piece of iron set for itself. It’s good if such a parser can generate a given parse tree; for this, he will have to correlate the rules of the given grammar with those rules by which the automaton actually works, and rebuild the tree, “returning to the place” the nodes that have moved from one rule to another.

Holivar

The debate over the benefits of LL vs. LR is as old as automatic parsing in general; both approaches have their own supporters. For example, Nicklaus Wirth, who was deeply respected by me personally, was an active apologist for LL-parsing, and one of the design goals in developing Pascal was the possibility of LL (1) parsing - i.e. an automaton that sees only one next character of the text. Most of the "living languages" in LL do not fit: for example, in C, the intdeclaration of either a variable or a function can begin, and we cannot distinguish one from the other until we read the line to the end.

Concerning convenience: basically, everyone praises the instrument to which he is accustomed, and has a dislike for the unusual.
For example, I will quote one “best answer” with stackoverflow regarding the “ ANTLR vs. bison ":

In terms of personal taste, I think that LALR grammars are a lot easier to construct and debug. The downside is you have to deal with somewhat cryptic errors like shift-reduce and (the dreaded) reduce-reduce. These are errors that Bison catches when generating the parser, so it doesn't effect the end-user experience, but it can make the development process a bit more interesting.

The ANTLR advantages in that dispute include anything but the qualities of parsing itself: a convenient development environment, code generation in different languages, and user-friendly generated code.

Conveniently generated code is the strongest difference between ANTLR and table parsers. In fact, instead of a specialized parsing stack, a call stack is used, and recognition of each grammar rule is implemented as a function call (for a recursive rule, a recursive function). Therefore, when debugging from the call stack, it is immediately visible how the parser got into the current state; whereas with a bison, as we recall, you have to turn on debug printing of transitions between states, and check the machine printout to understand the reason for the transitions.
The disadvantages of a recursive implementation before a table are also clear: a much larger amount of code, which means memory consumption; the impossibility of writing "patches" on the generated code, because it changes when the grammar changes, and even multiplies by dozens of functions.

Backtracking

LL parsers that pre-select a rule for each sweep are not the only option for downstream parsing. An alternative idea: when there are several suitable rules, we will try everything , some will do. For example, you can do something like fork(): create parser clones in the current state, and let each clone try one of the scan options. If a clone comes across a syntax error, it dies. If all the clones have died, then not one scan option is suitable: there is a syntax error in the input text.

For correct texts, this approach is more or less acceptable; but there are problems with error detection. First, which of the detected errors to print? Most of them are the result of an incorrectly chosen scan, and not an error in the program text; but from the point of view of the parser, everything is completely equal.

Secondly, the parsing may never end: every time we do a sweep according to the left recursive rule, one of the clones will be in exactly the same state as before the sweep; so at each step one more exact same clone will turn out, and so on ad infinitum. If one of the clones reaches the end of the analysis, then all the rest can be killed; but if, on the contrary, all the other clones run into syntax errors and die, and only a uselessly cloning clone continues to live?

Finally, what to do with ambiguous grammars? Both LL and LR parsers will detect conflicts at compile time; here, there is no compilation, as such: the parser is guided by the grammar almost raw, i.e. interprets it on the go.
So, it may turn out that for the same text we will get either one analysis or another: depending on which clone has time to finish the analysis earlier.

The last problem was solved in the most original way: it was announced that the possibility of ambiguous parsing is a fundamental problem of grammars, and instead of them parsing expressions should be used , which differ in essence only in that strict priority is given between the development rules. For example, if grammar
OP: EXPR ';' | 'if' '(' EXPR ')' OP | 'if' '(' EXPR ')' OP 'else' OP ;
and
OP: EXPR ';' | 'if' '(' EXPR ')' OP 'else' OP | 'if' '(' EXPR ')' OP ;
- this is one and the same ambiguous grammar, then the parsing expression
OP: EXPR ';' / 'if' '(' EXPR ')' OP 'else' OP / 'if' '(' EXPR ')' OP
unambiguously indicates to the parser "first try to recognize if..else, and only if it fails, recognize- ifwithout- else". And the expression
OP: EXPR ';' / 'if' '(' EXPR ')' OP / 'if' '(' EXPR ')' OP 'else' OP
means “first recognize if-with- else”, and therefore elsethey will never be recognized at all.

Now, instead of checking all possible sweeps at the same time, we will check them in turn, in accordance with priority; we proceed to the next scan only when the previous one came across an error. The link mentioned by commentators provides an excellent metaphor for this parsing method:

Have you ever rode a lazy elephant? He walks slowly and sways violently. And he doesn’t walk along the road, but wanders in all directions, which he considers interesting, but if they do not coincide with what is necessary, then the drover to the elephant says "no, we are not here." So, having tried all the options, the elephant finds itself at the point about which they did not say "not here." So parser-combinators will also try all combinations of options until one of them works. And after each rollback, they begin to do the same work again. <...> The program in a couple of lines understood quickly. In three lines already a few seconds. Five lines I did not wait. Remember, children, such parser-combinators are only suitable for very simple grammars.

It remains to detect parsing loops, achieve an acceptable working time, and, thumbs up, the localization of the error.

Looping is simpler: if we are twice in the same state, without moving forward in the input text, then we will find ourselves in it for the third time, and as much as we like. It turns out that for each position in the input text you need to keep a list of all the states that you have already visited in this position: if we come to the same state a second time, we say, “no, that's enough, I won’t go there anymore”.

As a free bonus, we get a “linear” operating time, if we remember not just the tick “yeah, there were”, but also the result of the previous analysis (the scan came up / didn't fit); then the worst case scenario is if we visit each position of the text in all possible states. If the text is longN characters, and in the ~~grammar the~~ parsing expression of M rules (including alternative development options for each non-terminal) - we get the operating time O ( M * N ). If you slyly squint and pretend that M is a constant - yeah, time is linear. For comparison, traditional LL- and LR-parsers with a predefined action in each state are exactly the same in the worst case O ( M * N ):

S: | T1 S;
T1: T2;
T2: T3;
...
TM: 't';

Here, LR will 't'perform M convolution after each shift 't' -> TM -> ... -> T3 -> T2 -> T1; and LL, before “eating” everyone 't', makes M scans T1 -> T2 -> T3 -> ... -> TM -> 't'.
The whole question is how much the average case differs from the worst: at least the “linear elephant” performs more sweeps at any parsing than LL on the same grammar.

Another catch in memory consumption. We need to store M * Nthe results of past scans - and this, in addition to the fact that the input text will have to be stored in memory entirely, because the elephant needs to run endlessly back and forth across it. This is despite the fact that store-automatic parsers never return to already read text, and therefore their memory requirements depend only on grammar, but not on text.
Did everyone read the "history of one byte"? Moreover, one of the most natural applications for new compilers is to support all compact platforms where the saved memory matters.

And regarding the detection of syntax errors. Our elephant, which is actually called packrat ( hoarder), concludes that there is an error in the text when not a single scan came up. Therefore, the position in which the error will be detected can be much earlier than the place of the error itself: suppose a parsing expression

DECL: DECLVAR / DECLFN;
DECLVAR: TYPE ID ';' ;
DECLFN: TYPE ID '(' ARGS ')';

- remotely resembling the syntax of declarations in C.
If there is a sequence in the input text that can be parsed as TYPE ID '!', then in what position is the syntax error? Packrat will check the first scan, it will not work, the parser rolls back to the beginning TYPEand tries the second scan; she doesn't fit either. It turns out that the error will be detected before TYPE; and what if TYPE- a half-line hardcore expression?
It is logical to show the rightmost one, to which at least one scan was able to get, as an error position, i.e. the last position to which the parser still had hopes that the text will parse successfully.
I assume that there are packrat implementations in which error localization is implemented that way.

That’s it, it was the last post in the series, which deals with parsing.
Next time, let's move on to semantics, and then our toy language will really become a programming language .

Tags: