Compilation. 3: bison
This is the only post in the series that focuses on the Old Believer bison buffalo, so bored by some. For those who write not in C, the post should still be interesting, because LR-parser generators similar in principle to work exist for so many languages. Those who ideologically do not accept LR parsers, I have nothing to attract today.
What have we been doing so far?
Last time we compiled a grammar for arithmetic expressions:
We ended up staring at the machine that parses it.
We ourselves do not know how to invent soaring machines, and we don’t need to - because the bison knows how to build them for us!
With bison help, we can compile our parser and see how it all works for real.
The general principle is the same as
Last time we mentioned that during the convolution we take out a set of characters (strings or variables) from the stack, look at their values, set the value for the curled variable, and put it on the stack instead of the ones we removed.
Just like in
%% , as in
If the convolution code is not specified, and the right-hand side of the rule has one character, then by default the bison “inherits” its value:
The very first declared variable is considered the “root” of the grammar, ie all input should eventually curl up to that root.
When compiling the parser, you need to specify the library
How great it would be to combine the benefits of
If we already have a
For such a symbiosis to be possible, I
We transmit ordinary characters in the usual way (return the character code).
The option
We compile a two-stage parser: Regarding the terminology: in such a two-stage model, the lower parser that recognizes tokens is traditionally called a lexical analyzer (“lexer”), and the upper one that recognizes constructions from tokens is called a parser
. This is an indication of the role of the parser, not its device: other parsing systems can, for example, use store-automatic parsers for both stages.
To see the generated automaton, one does not need to dive into the abysses of the system code: the bison has powerful tools for debugging grammars. Specify the key
After transferring the parsing of numbers to the lexer, there are 14 states left in the machine, and they are described like this:
For each state, the rules that are recognized in this state (together with their numbers in the grammar) are indicated, and the actions performed for each input character are listed. The following condition is not indicated after folding; instead, the automaton returns to the state read from the stack and in it searches for the “go to” rule corresponding to the freshly minimized non-terminal. Thus, the transition table of the automaton is two-dimensional: in each state, the action depends only on the input symbol, and does not depend on the contents of the stack. (But the next state after folding is taken from the stack.)
In order not to crawl with a pencil on the printout of the machine, substituting symbol by symbol into it, you can ask the bison to print all the transitions from state to state during parsing. To do this, compile the grammar with the key
If, in addition, we want symbol values to be printed, then we need to define a macro
Only states are printed from the stack; types of characters on the stack, and their values, it remains to be guessed out of context.
If the macro is
The last time, ambiguous grammars were mentioned, allowing several options for parsing for one expression.
What will the bison say if we try to compile an ambiguous grammar? When a grammar admits several continuations from one state, the bison does not understand exactly what to perform. In our case, it fluctuates between a shift and a convolution. You can correct the grammar, as we did last time; and you can “push” the bison in the right direction, and hint what to do in case of conflict. It should be regarded as a quick hack: the parser starts to work, but it becomes more difficult to debug the “grammar with hints”.
Since arithmetic expressions are a typical source of conflicts, hints are also given in the form of indicating the priority of operators (to perform multiplication before addition) and their associativity (from operators with equal priority, which to perform earlier). The lower the operator in the list, the higher the priority. Operators on the same line of the list receive the same priority. There is a directive for associative operators . The unnaturalness of a hack with priorities can be estimated by the example of ambiguity. To make it parse in the usual way - you need to have a higher priority than both of these terminals are hard to call operators, and all the more difficult to give them a "natural" priority. But it works!
“Grammar with hints” is more compact than the unambiguous one both in its original form (twice as short) and in the form of an automaton (one state was saved).
Even in an unambiguous grammar, conflicts can arise related to the principle of operation of a store automaton: during the execution of each action, it sees only one next input character. For example, grammar
This is the second reason that the parser is made two-stage: due to the fact that the lexer compresses the lines into tokens and discards unnecessary characters between tokens, the LR parser manages to “look further”: not one letter, but one token forward.
Like u
Like last time, symbols identical from the point of view of the parser are combined into classes. In this case, classes are 7, and they are listed in the file
Transition tables are cleverly combined. Descriptions of actions are stored in the main table: for a shift (positive number) - what state to go to; for convolution (negative) - by what rule to collapse.
The trick is that the index of the first action for each state is stored in a separate table:
The default actions for each state are stored in another separate table:
The index from
To check which symbol each action in refers to
Let's try to find some action, for example, the above:
The terminal's NUM class is 3.
We are looking for the first shift:
second shift:
The “go to” table is broken into three arrays in the same way, which determines which state to switch to after folding.
The code of the parser itself: (uninteresting pieces are cut out, and interesting ones are commented out) Once again we see: the whole parser, along with the calculation of the expression, fit into a couple of pages of code; and even then, his third is a sophisticated search in compressed tables. Next time, we will be engaged in parsing a toy programming language . The advantage of the bison and its ilk is that only tables and a switch with convolution actions copied from
The rest of the parser code is universal: no spaghetti, no recursive functions that call each other in tricky combinations. Grammar rules are really compiled , not framed in the syntax of a high-level language.
Further in the post:
- Grammar Compilation
- Two-stage parser
- What is inside it?
- Grammar Conflicts
- How it works?
What have we been doing so far?
Last time we compiled a grammar for arithmetic expressions:
EXPR: TERM | EXPR '+' TERM | EXPR '-' TERM; TERM: NUM | TERM '*' NUM | TERM '/' NUM; NUM: DIGIT | NUM DIGIT; DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
We ended up staring at the machine that parses it.
We ourselves do not know how to invent soaring machines, and we don’t need to - because the bison knows how to build them for us!
With bison help, we can compile our parser and see how it all works for real.
Grammar Compilation
The general principle is the same as
flex
in the first part : we list the grammar rules, and opposite each rule, we write the code that will be executed during the rule folding. Last time we mentioned that during the convolution we take out a set of characters (strings or variables) from the stack, look at their values, set the value for the curled variable, and put it on the stack instead of the ones we removed.
%{
#include
int yylex() { return getc(stdin); }
void yyerror(char *s) {
fprintf (stderr, "%s\n", s);
}
%}
%%
EVALUATE: EXPR '\n' { printf("%d\n", $$) } ;
EXPR: TERM
| EXPR '+' TERM { $$ = $1 + $3; }
| EXPR '-' TERM { $$ = $1 - $3; }
;
TERM: NUM
| TERM '*' NUM { $$ = $1 * $3; }
| TERM '/' NUM { $$ = $1 / $3; }
;
NUM: DIGIT
| NUM DIGIT { $$ = $1*10+$2; }
;
DIGIT: '0' { $$=0; } | '1' { $$=1; } | '2' { $$=2; } | '3' { $$=3; }
| '4' { $$=4; } | '5' { $$=5; } | '6' { $$=6; } | '7' { $$=7; }
| '8' { $$=8; } | '9' { $$=9; }
;
%%
Code parsing
Just like in
lex
-files, the code in the % {%} tags is copied to the parser unchanged. There are two functions that must be defined: int yylex()
returns the next input character, and void yyerror(char *s)
prints an error message. The classic yacc
included a ready-made implementation yyerror
, which could be redeclared if necessary; but its GNU clone bison
requires an explicit implementation from the programmer. %% , as in
lex
-files, separates the declaration area and the grammar rule area. The rules are listed in the same form as we are used to; add a convolution code at the end of the rule. In the convolution code, to refer to the values of characters on the parser stack, we use $ -tags.$$
refers to a collapsible variable (the left side of the rule), $1
the leftmost character on the right side of the rule (the deepest in the parser stack), $2
the second on the left, etc. If the right part of the rule has N characters, then you can use values from $1
to $N
. If the convolution code is not specified, and the right-hand side of the rule has one character, then by default the bison “inherits” its value:
$$=$1
The very first declared variable is considered the “root” of the grammar, ie all input should eventually curl up to that root.
EXPR
would be suitable as a root, but then the seal of the calculated expression would have to be put into a convolution EXPR
; which means that the meaning of each subexpression along the road would also be printed. For the convenience of printing, we introduce a new variableEVALUATE
which is used only as a root. This at the same time makes it possible to read the line break at the end of the expression - in addition to the convenience of printing, we get the convenience of input. When compiling the parser, you need to specify the library
liby
in which the standard bison is located main()
.
Unlike the market trader calculator, which was able to ignore spaces and comments, the bison calculator understands strictly defined syntax: numbers, operation signs, line feed at the end. To include, for example, spaces between any pair of characters, you would have to add them explicitly to the grammar:
Clearly, this is inconvenient. Yes, and writing 10 different rules for the recognition of numbers is inconvenient; and if we needed Latin letters, for example in variable names, we would set 52 rules ?![tyomitch@home ~]$ bison 2.y
[tyomitch@home ~]$ cc -ly 2.tab.c
[tyomitch@home ~]$ ./a.out
22+3*4-5
=29
[tyomitch@home ~]$ ./a.out
2 + 2
syntax error
EXPR: TERM | EXPR WS '+' WS TERM | EXPR WS '-' WS TERM ;
TERM: NUM | TERM WS '*' WS NUM | TERM WS '/' WS NUM ;
NUM: DIGIT | NUM DIGIT ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
WS: | WS ' ' ;
How great it would be to combine the benefits of
flex
and bison
! To make simple expressions (numbers, spaces) easy to recognize , and to make complex expressions possible .Two-stage parser
If we already have a
flex
parser that successfully recognizes numbers and removes comments and spaces, then we will drive the input text through it; and what happens, let's drive through an advanced bison parser. In fact, there is no need to store an intermediate result: both steps can be performed in one pass. flex
reads the input text symbol by symbol, from time to time passing tokens to the bison “up” - terminal grammar symbols. Values for tokens are flex
set by myself. For such a symbiosis to be possible, I
flex
must somehow recognize the terminal symbols of the bison grammar. We will run bison
with the key -d
, then it will generate a .h
file listing all the terminals. To do this, declare (indicating%token
) terminals in the grammar file - all that remains is to refer to the generated .h
-file in the file .lex
. We no longer need the
function : now the input characters will come not from , but from .
In addition, they erased after (swallows it ), and removed all the rules about and .
Corresponding file :
A file with token definitions receives a suffix.
The only thing in it is that
all tokens receive numbers above 256 to differ from “ordinary” characters.
To transfer the token to the bison, its value is written to the global (oh horror!) Variable , and we return the token code.%{
#include
void yyerror(char *s) {
fprintf (stderr, "%s\n", s);
}
%}
%token NUM
%%
EVALUATE: EXPR { printf("=%d\n", $$); } ;
EXPR: TERM
| EXPR '+' TERM { $$ = $1 + $3; }
| EXPR '-' TERM { $$ = $1 - $3; }
;
TERM: NUM
| TERM '*' NUM { $$ = $1 * $3; }
| TERM '/' NUM { $$ = $1 / $3; }
;
%%
yylex
stdin
flex
'\n'
EXPR
flex
NUM
DIGIT
.lex
%{
#include "3.tab.h"
%}
%option yylineno
%option noyywrap
%%
[/][/].*\n ; // comment
[0-9]+ { yylval = atoi(yytext);
return NUM;
}
[ \t\r\n] ; // whitespace
. { return *yytext; }
%%
.tab.h
#define NUM 258
yylval
We transmit ordinary characters in the usual way (return the character code).
The option
noyywrap
indicates that the input text is one, and after reading it is EOF
not necessary to try to move on to the next text. This option was set automatically while we used %option main
, which set reading from stdin
. There main()
will be bison now, so we don’t need to ask the flex
standard main()
one or write our own. We compile a two-stage parser: Regarding the terminology: in such a two-stage model, the lower parser that recognizes tokens is traditionally called a lexical analyzer (“lexer”), and the upper one that recognizes constructions from tokens is called a parser
[tyomitch@home ~]$ lex 3.lex
[tyomitch@home ~]$ bison -d 3.y
[tyomitch@home ~]$ cc -ly lex.yy.c 3.tab.c
[tyomitch@home ~]$ ./a.out
22+ // hello
3*4 - 5
=29
[tyomitch@home ~]$ ./a.out
22+x
syntax error
. This is an indication of the role of the parser, not its device: other parsing systems can, for example, use store-automatic parsers for both stages.
What is inside it?
To see the generated automaton, one does not need to dive into the abysses of the system code: the bison has powerful tools for debugging grammars. Specify the key
-v
, and look at the file with a suffix .output
. After transferring the parsing of numbers to the lexer, there are 14 states left in the machine, and they are described like this:
... state 7 4 EXPR: EXPR '-'. Term NUM shift, and go to state 1 Term go to state 11 state 8 6 TERM: TERM '*'. Num NUM shift, and go to state 12 state 9 7 TERM: TERM '/'. Num NUM shift, and go to state 13 state 10 3 EXPR: EXPR '+' TERM. 6 TERM: TERM. '*' NUM 7 | TERM. '/' NUM '*' shift, and go to state 8 '/' shift, and go to state 9 $ default reduce using rule 3 (EXPR) ...
For each state, the rules that are recognized in this state (together with their numbers in the grammar) are indicated, and the actions performed for each input character are listed. The following condition is not indicated after folding; instead, the automaton returns to the state read from the stack and in it searches for the “go to” rule corresponding to the freshly minimized non-terminal. Thus, the transition table of the automaton is two-dimensional: in each state, the action depends only on the input symbol, and does not depend on the contents of the stack. (But the next state after folding is taken from the stack.)
In order not to crawl with a pencil on the printout of the machine, substituting symbol by symbol into it, you can ask the bison to print all the transitions from state to state during parsing. To do this, compile the grammar with the key
-t
, and a global flag appears in the parser yydebug
. It must be set to 1 - for example, to main()
. If, in addition, we want symbol values to be printed, then we need to define a macro
YYPRINT
:
The code after the second %% tag is copied to the parser unchanged in the same way as if it were in % {%} .
Now, since we determined ourselves, we no longer need to connect during compilation :%{
#include
void yyerror(char *s) {
fprintf (stderr, "%s\n", s);
}
#define YYPRINT(file, type, value) fprintf(file, "%d", value);
%}
%token NUM
%%
EVALUATE: EXPR { printf("=%d\n", $$); } ;
EXPR: TERM
| EXPR '+' TERM { $$ = $1 + $3; }
| EXPR '-' TERM { $$ = $1 - $3; }
;
TERM: NUM
| TERM '*' NUM { $$ = $1 * $3; }
| TERM '/' NUM { $$ = $1 / $3; }
;
%%
int main () { yydebug=1; return yyparse(); }
main()
liby
[tyomitch@home ~]$ lex 3.lex
[tyomitch@home ~]$ bison -dt 3.y
[tyomitch@home ~]$ cc lex.yy.c 3.tab.c
[tyomitch@home ~]$ ./a.out
Starting parse
Entering state 0
Reading a token: 22+3*4-5
Next token is token NUM (22)
Shifting token NUM, Entering state 1
Reducing stack by rule 5 (line 20), NUM -> TERM
Stack now 0
Entering state 4
Reading a token: Next token is token '+' (22)
Reducing stack by rule 2 (line 15), TERM -> EXPR
Stack now 0
Entering state 3
Next token is token '+' (22)
Shifting token '+', Entering state 6
Reading a token: Next token is token NUM (3)
Shifting token NUM, Entering state 1
Reducing stack by rule 5 (line 20), NUM -> TERM
Stack now 0 3 6
Entering state 10
Reading a token: Next token is token '*' (3)
Shifting token '*', Entering state 8
Reading a token: Next token is token NUM (4)
Shifting token NUM, Entering state 12
Reducing stack by rule 6 (line 21), TERM '*' NUM -> TERM
Stack now 0 3 6
Entering state 10
Reading a token: Next token is token '-' (4)
Reducing stack by rule 3 (line 16), EXPR '+' TERM -> EXPR
Stack now 0
Entering state 3
Next token is token '-' (4)
Shifting token '-', Entering state 7
Reading a token: Next token is token NUM (5)
Shifting token NUM, Entering state 1
Reducing stack by rule 5 (line 20), NUM -> TERM
Stack now 0 3 7
Entering state 11
Reading a token: Now at end of input.
Reducing stack by rule 4 (line 17), EXPR '-' TERM -> EXPR
Stack now 0
Entering state 3
Now at end of input.
Reducing stack by rule 1 (line 13), EXPR -> EVALUATE
=29
Stack now 0
Entering state 2
Now at end of input.
Only states are printed from the stack; types of characters on the stack, and their values, it remains to be guessed out of context.
If the macro is
YYPRINT
not set, then you will have to guess the values of the read tokens: the bison will print only empty brackets.Grammar Conflicts
The last time, ambiguous grammars were mentioned, allowing several options for parsing for one expression.
What will the bison say if we try to compile an ambiguous grammar? When a grammar admits several continuations from one state, the bison does not understand exactly what to perform. In our case, it fluctuates between a shift and a convolution. You can correct the grammar, as we did last time; and you can “push” the bison in the right direction, and hint what to do in case of conflict. It should be regarded as a quick hack: the parser starts to work, but it becomes more difficult to debug the “grammar with hints”.
%{
#include
void yyerror(char *s) {
fprintf (stderr, "%s\n", s);
}
%}
%token NUM
%%
EVALUATE: EXPR { printf("=%d\n", $$) } ;
EXPR: NUM
| EXPR '+' EXPR { $$ = $1 + $3; }
| EXPR '-' EXPR { $$ = $1 - $3; }
| EXPR '*' EXPR { $$ = $1 * $3; }
| EXPR '/' EXPR { $$ = $1 / $3; }
;
%%
[tyomitch@home ~]$ bison 4.y
4.y: conflicts: 16 shift/reduce
Since arithmetic expressions are a typical source of conflicts, hints are also given in the form of indicating the priority of operators (to perform multiplication before addition) and their associativity (from operators with equal priority, which to perform earlier). The lower the operator in the list, the higher the priority. Operators on the same line of the list receive the same priority. There is a directive for associative operators . The unnaturalness of a hack with priorities can be estimated by the example of ambiguity. To make it parse in the usual way - you need to have a higher priority than both of these terminals are hard to call operators, and all the more difficult to give them a "natural" priority. But it works!
%{
#include
void yyerror(char *s) {
fprintf (stderr, "%s\n", s);
}
%}
%token NUM
%left '+' '-'
%left '*' '/'
%%
EVALUATE: EXPR { printf("=%d\n", $$) } ;
EXPR: NUM
| EXPR '+' EXPR { $$ = $1 + $3; }
| EXPR '-' EXPR { $$ = $1 - $3; }
| EXPR '*' EXPR { $$ = $1 * $3; }
| EXPR '/' EXPR { $$ = $1 / $3; }
;
%%
%right
if (1) if (2) foo(); else bar();
if (1) { if (2) foo(); else bar(); }
else
')'
“Grammar with hints” is more compact than the unambiguous one both in its original form (twice as short) and in the form of an automaton (one state was saved).
Even in an unambiguous grammar, conflicts can arise related to the principle of operation of a store automaton: during the execution of each action, it sees only one next input character. For example, grammar
WORD: S1 'a' 'i' 'l' | S2 'a' 'l' 'e'; S1: 's'; S2: 's';- unambiguous, and it corresponds to only two words - sail and sale. When the parser has shifted the first letter 's' and sees 'a' after it, it cannot make a choice, collapse
S1
or S2
: the correct convolution depends on what letter will be after 'a'; but her machine still does not see. This is the second reason that the parser is made two-stage: due to the fact that the lexer compresses the lines into tokens and discards unnecessary characters between tokens, the LR parser manages to “look further”: not one letter, but one token forward.
How it works?
Like u
flex
, the parser core has a transition table and a small loop. Two parallel stacks are used: a state stack yyssa
and a value stack yyvsa
- all the same, states and symbols always enter and exit the stack in pairs. Like last time, symbols identical from the point of view of the parser are combined into classes. In this case, classes are 7, and they are listed in the file
.output
. An array static const unsigned char yytranslate[259]
maps to all terminals a class. (From 0 to 255 - ordinary characters; 258 - terminal announced by us NUM
). Transition tables are cleverly combined. Descriptions of actions are stored in the main table: for a shift (positive number) - what state to go to; for convolution (negative) - by what rule to collapse.
static const unsigned char yytable [] = { 6, 7, 5, 8, 9, 10, 11, 1, 12, 13 };It is surprising that the table is not only one-dimensional, but even there are fewer elements in it than our states (there are 14 of them).
The trick is that the index of the first action for each state is stored in a separate table:
#define YYPACT_NINF -5 static const yysigned_char yypact [] = { 4, -5, 2, -4, -3, -5, 4, 4, 5, 6, -3, -3, -5, -5 };
YYPACT_NINF
means that no element corresponds to the state yytable
; in other words, the action performed is independent of the input character. The default actions for each state are stored in another separate table:
static const unsigned char yydefact [] = { 0, 6, 0, 2, 3, 1, 0, 0, 0, 0, 0 4, 5, 7, 8 };Only convolution can be independent of the input character, so the values in
yydefact
are the numbers of the grammar rules by which you need to collapse. The index from
yypact[n]
stores the action for state n and for the character class 0. The action for the character class k is stored in yytable[yypact[n]+k]
; therefore, there yypact
may be negative indices in - this is just the "base" to which the class number will be added. To check which symbol each action in refers to
yytable
, there is another table:static const unsigned char yycheck [] = { 4, 5, 0, 6, 7, 6, 7, 3, 3, 3 };What do we see?
yytable[0]
refers to class 4 characters, yytable[1]
to class 5 characters, and so on. Let's try to find some action, for example, the above:
... state 7 4 EXPR: EXPR '-'. Term NUM shift, and go to state 1 Term go to state 11 state 8 6 TERM: TERM '*'. Num NUM shift, and go to state 12 ...
The terminal's NUM class is 3.
We are looking for the first shift:
yytable[yypact[7]+3]==yytable[4+3]==1
(and indeed, yycheck[yypact[7]+3]==3
) The second shift:
yytable[yypact[8]+3]==yytable[5+3]==12
(and indeed, yycheck[yypact[7]+3]==3
) The “go to” table is broken into three arrays in the same way, which determines which state to switch to after folding.
The code of the parser itself: (uninteresting pieces are cut out, and interesting ones are commented out) Once again we see: the whole parser, along with the calculation of the expression, fit into a couple of pages of code; and even then, his third is a sophisticated search in compressed tables. Next time, we will be engaged in parsing a toy programming language . The advantage of the bison and its ilk is that only tables and a switch with convolution actions copied from
yynewstate:
*++yyssp = yystate; // кладём в стек текущее состояние
yyn = yypact[yystate]; // индекс первого действия
if (yyn == YYPACT_NINF)
goto yydefault; // не зависит от входного символа
yychar = yylex(); // читаем входной символ
yytoken = yytranslate[yychar]; // определяем класс
yyn += yytoken; // индекс внутрь yytable
if (yyn < 0 || YYLAST < yyn || yycheck[yyn] != yytoken)
goto yydefault; // нет подходящего действия
yyn = yytable[yyn];
if (yyn <= 0) {
yyn = -yyn;
goto yyreduce;
}
*++yyvsp = yylval; // сдвигаем
yystate = yyn; // следующее состояние
goto yynewstate;
yydefault:
yyn = yydefact[yystate];
if (yyn == 0) // ошибка синтаксиса:
goto yyerrlab; // ни одно действие не подошло,
// и нет действия по умолчанию
yyreduce:
yylen = yyr2[yyn]; // длина правой части правила
yyval = yyvsp[1-yylen]; // по умолчанию: унаследовать $1
// действия при свёртке:
// вместо $-тегов бизон подставил yyval слева и yyvsp[] справа
switch (yyn) {
case 2:
{ printf("=%d\n", yyval); }
break;
case 4:
{ yyval = yyvsp[-2] + yyvsp[0]; }
break;
case 5:
{ yyval = yyvsp[-2] - yyvsp[0]; }
break;
case 7:
{ yyval = yyvsp[-2] * yyvsp[0]; }
break;
case 8:
{ yyval = yyvsp[-2] / yyvsp[0]; }
break;
}
yyvsp -= yylen; // удаление из стека
yyssp -= yylen;
*++yyvsp = yyval; // кладём свежесвёрнутую переменную
yyn = yyr1[yyn]; // номер переменной по номеру правила
// вычисление следующего состояния после свёртки
yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;
if (0 <= yystate && yystate <= YYLAST && yycheck[yystate] == *yyssp)
yystate = yytable[yystate];
else
yystate = yydefgoto[yyn - YYNTOKENS];
goto yynewstate;
.y
file. The rest of the parser code is universal: no spaghetti, no recursive functions that call each other in tricky combinations. Grammar rules are really compiled , not framed in the syntax of a high-level language.