Compilation. 1: lexer
The mystery of birth has always fascinated me with the programme's program. Unfortunately, Russian universities pay little attention to this interesting topic. I hope to write a series of posts in which we will gradually create a small efficient compiler.
The first posts of the series have already been prepared and beta tested in one small and tightly closed community. Nevertheless, I will continue to rule them taking into account the wishes of the venerable habrapublic.
Are not all the necessary programming languages already written? Who in our enlightened age might need to write their own compiler?
Firstly, languages are constantly being created and developed. Of the currently popular, Ruby, PHP, Java, and C # were created in our memory, and began to be heavily used several years ago. Microsoft is pushing the new F # language right now, and given the power of Microsoft, it’s also likely to be among the common ones.
There are still niches for new languages: for example, attempts to come up with a convenient imperative language for parallel programming do not stop .
Secondly, the techniques used in compilation (primarily grammar parsing) have many other applications. Often there is a need for source-to-source conversions (refactoring, translating code into another language, etc.), when you need to parse text in a programming language, process it, and output the processed text (in the same or another language) .
Of course, compilation is a rather exotic activity for a programmer, somewhere alongside programminghuge fighting humanoid robots. However, all this has practical applications, unlike many other extravagant programming hobbies.

The compilation process, in principle, consists of two main stages:
The most challenging part of the compiler is code optimization; at the first stage, high-level optimization is performed at the level of tree nodes (for example, loops are deployed, inline functions are implanted); on the second - low-level, at the level of the flow of commands (for example, they are reordered so as to more fully load the pipelines of a particular processor). Optimization, by tradition, never comes to university courses; but the simplest (copy elimination, constant propagation) we will try to implement in our example.
I saw old posts on a topic of parsing on Habré; but, it seems, the authors did not come close to generating the code even once.
When a novice programmer tries to write a text parser, his natural approach is a recursive deepening: find the beginning of a construct (for example, { ); find its end (for example, } at the same level of nesting); select the contents of the structure, and parse it recursively.
Problems with this approach - firstly, excessive complexity (we walk back and forth along the same fragment of text); secondly, the inconvenience of support (the syntax of the language is distributed across kilobytes and kilobytes of branching code).
The language syntax can be defined declaratively; for example, everyone is familiar with regular expressions. It would be ideal to write a stack of regexps for all language constructs, opposite each one - the definition of the node that should be created in the program tree; A “universal parser” would simply substitute the program in one regexp after another, and create nodes according to the description, one after the other.
The first of these problems is solved by the fact that the search for all regexps in the text can be performed in one pass, i.e. there is no need to store the entire program in memory - just read it one character at a time, process the character, and then forget it.
The second is that we now have a centralized, formal description of the language: we can change regexps without touching the code at all; and vice versa, change the code without risking damage to the parser.
The mentioned “universal parser”, of course, has long been implemented, and more than once. Classical implementation -
By tradition, the first example is writing a calculator:
The trading market calculator is simulated: without brackets, without priority of operations, without fractions. Expressions are separated by a semicolon, and you can insert comments from
Compile our calculator, and try to play with it:
Mathematicians have the concept of DFA ( deterministic finite state machine ) - a thing that can be in one of N states; which reads from the input stream character by character; and which has a table: for each combination of the current state and the character read, in which state to go. The job
You can see the constructed DKA by looking inside the generated parser
Symbols are divided into 8 classes, identical from the point of view of the parser (for example, all numbers are combined into one class). A separate array
The main parser cycle is very straightforward:
Positive numbers in the transition table mean "go to state and continue reading"; negative - "go into a state and perform an action." The number of the action to be performed upon entering the state is stored in
Consider an example: for numbers, the number of the “symbol class” is 6.
In the initial state (1), we see by symbol 6 in the transition table 9. We go on and read on.
In state 9, if there is another digit (6), go to state 12 and read on.
From state 12, if there is another figure, just read on. (There is 12 in the table)
If you saw a non-digit (any character except 6), you need to perform an action: we see in the 9th line a row of -9, and in the 12th row a row of -12.
Check
It is unclear why
The remarkable thing is how simple the ready parser is: instead of branching recognition of different constructions, we have one big table, and a loop of ten rows. But there is a significant problem. Let us return to the example from the very beginning of the post: “find the beginning of the construction (for example, { ); to find its end (for example, } at the same level of nesting ) ... "How is regexp to indicate the condition" at the same level of nesting "?
Mathematicians also tried: they proved that it is impossible. So, for parsing languages that support nested constructions (which are all languages more complicated than BAT files), we need a more powerful tool than regexps. Next time about him .
The first posts of the series have already been prepared and beta tested in one small and tightly closed community. Nevertheless, I will continue to rule them taking into account the wishes of the venerable habrapublic.
Further in the post:
- Why write compilers?
- Overall plan
- Text analysis
- Practical example
- How it works?
Why write compilers?
Are not all the necessary programming languages already written? Who in our enlightened age might need to write their own compiler?
Everything that can be invented has been invented.
--Charles H. Duell, Director of US Patent Office, 1899 (attributed)
What was, it will be; and what has been done will be done, and there is nothing new under the sun.
- Ecclesiastes 1: 9 (c. 3th century BC)
Firstly, languages are constantly being created and developed. Of the currently popular, Ruby, PHP, Java, and C # were created in our memory, and began to be heavily used several years ago. Microsoft is pushing the new F # language right now, and given the power of Microsoft, it’s also likely to be among the common ones.
There are still niches for new languages: for example, attempts to come up with a convenient imperative language for parallel programming do not stop .
Secondly, the techniques used in compilation (primarily grammar parsing) have many other applications. Often there is a need for source-to-source conversions (refactoring, translating code into another language, etc.), when you need to parse text in a programming language, process it, and output the processed text (in the same or another language) .
Of course, compilation is a rather exotic activity for a programmer, somewhere alongside programming
Overall plan

The compilation process, in principle, consists of two main stages:
- Source text analysis
- Machine code generation
The most challenging part of the compiler is code optimization; at the first stage, high-level optimization is performed at the level of tree nodes (for example, loops are deployed, inline functions are implanted); on the second - low-level, at the level of the flow of commands (for example, they are reordered so as to more fully load the pipelines of a particular processor). Optimization, by tradition, never comes to university courses; but the simplest (copy elimination, constant propagation) we will try to implement in our example.
I saw old posts on a topic of parsing on Habré; but, it seems, the authors did not come close to generating the code even once.
Text analysis
When a novice programmer tries to write a text parser, his natural approach is a recursive deepening: find the beginning of a construct (for example, { ); find its end (for example, } at the same level of nesting); select the contents of the structure, and parse it recursively.
Problems with this approach - firstly, excessive complexity (we walk back and forth along the same fragment of text); secondly, the inconvenience of support (the syntax of the language is distributed across kilobytes and kilobytes of branching code).
The language syntax can be defined declaratively; for example, everyone is familiar with regular expressions. It would be ideal to write a stack of regexps for all language constructs, opposite each one - the definition of the node that should be created in the program tree; A “universal parser” would simply substitute the program in one regexp after another, and create nodes according to the description, one after the other.
The first of these problems is solved by the fact that the search for all regexps in the text can be performed in one pass, i.e. there is no need to store the entire program in memory - just read it one character at a time, process the character, and then forget it.
The second is that we now have a centralized, formal description of the language: we can change regexps without touching the code at all; and vice versa, change the code without risking damage to the parser.
Practical example
The mentioned “universal parser”, of course, has long been implemented, and more than once. Classical implementation -
lex- accepts action descriptions for each regexp in C. The standard GNU utilities include a flexcompatible one lex, which we will use as examples. (There are similar utilities for C #, Java, etc.) By tradition, the first example is writing a calculator:
% {
#include
int reg = 0;
char op = '+';
int unmin = 0;
%}
% option main
% option yylineno
%%
[/†►/†.*\n; // comment
[0-9] {int opnd = atoi (yytext);
if (unmin) opnd = - opnd; unmin = 0;
switch (op) {
case '+': reg + = opnd; break;
case '-': reg - = opnd; break;
case '*': reg * = opnd; break;
case '/': reg / = opnd; break;
}
op = 0;
}
[- + * /] {if (op) {
if (* yytext == '-')
unmin = 1;
else {
printf ("Unexpected operator in line% d \ n", yylineno);
exit (1);
}
} else
op = * yytext;
}
[;] {if (op) {
printf ("Unexpected semicolon in line% d \ n", yylineno);
exit (1);
}
printf ("=% d \ n", reg);
reg = 0;
op = '+';
}
[\ t \ r \ n]; // whitespace
. {printf ("Syntax error in line% d \ n", yylineno); exit (1); }
%%
The trading market calculator is simulated: without brackets, without priority of operations, without fractions. Expressions are separated by a semicolon, and you can insert comments from
//the end of the line. Compile our calculator, and try to play with it:
[tyomitch@home ~]$ lex 1.lex
[tyomitch@home ~]$ cc lex.yy.c
[tyomitch@home ~]$ ./a.out
2+2;
=4
2+2*2;
=8
2 + // hello
- 3
;
=-1
2 / /
Unexpected operator in line 6
Code parsing
- At the top, in the tags % {%} , there is a code that will be copied directly to the parser. We declare three variables:
reg(“register”) for the intermediate result,opfor the typed operation, andunmin- the flag that minus was typed after the operation sign, and it should be treated as the sign of the second operand. %option mainindicates that we are satisfied with the standardmain(which reads fromstdintoEOF).%option yylinenocreates a variable in the parserint yylinenowhere the current line number of the input text will be stored: it is useful for diagnostic messages.- %% separates the declaration area from the actual language rules area
- In each rule, regexp is written on the left, C code on the right. In the first regexp, the code is empty; those. such a construct (and this is a comment) will simply be ignored. Likewise, the penultimate rule prescribes to ignore spaces and line breaks between recognizable constructs.
- In the second rule, we use a variable
yytext: it stores the text that matches the regexp (in our case, the operand number) - In the third rule - an example of error processing in the text of the program (use
yylinenoin the text of the message) - Rules are tried in the order they appear, from the first to the last. If none match, the parser just prints the current character in
stdout. Instead, we add the last rule . - it matches any character and prints an error message.
How it works?
Mathematicians have the concept of DFA ( deterministic finite state machine ) - a thing that can be in one of N states; which reads from the input stream character by character; and which has a table: for each combination of the current state and the character read, in which state to go. The job
flexis that he builds a DKA on a given set of regexps; in some states, this DCA not only goes into the next, but also causes our rules. You can see the constructed DKA by looking inside the generated parser
lex.yy.c. It took 15 states. To save space, the conversion table is not stored explicitly (15x256 in size), but divided into sophisticated overlapping lists. To see it in the most visual form, compile the parser with the option -Cef("disable table compression"):static yyconst short yy_nxt [] [8] =
{
{0, 0, 0, 0, 0, 0, 0, 0},
{3, 4, 5, 6, 7, 8, 9, 10},
{3, 4, 5, 6, 7, 8, 9, 10},
{-3, -3, -3, -3, -3, -3, -3, -3},
{3, -4, -4, -4, -4, -4, -4, -4},
{3, -5, -5, -5, -5, -5, -5, -5},
{3, -6, -6, -6, -6, -6, -6, -6},
{3, -7, -7, -7, -7, -7, -7, -7},
{3, -8, -8, -8, -8, 11, -8, -8},
{3, -9, -9, -9, -9, -9, 12, -9},
{3, -10, -10, -10, -10, -10, -10, -10},
{3, 13, 13, 14, 13, 13, 13, 13},
{3, -12, -12, -12, -12, -12, 12, -12},
{3, 13, 13, 14, 13, 13, 13, 13},
{3, -14, -14, -14, -14, -14, -14, -14},
};
static yyconst short int yy_accept [15] =
{0,
0, 0, 8, 6, 5, 5, 3, 3, 2, 4,
0, 2, 0, 1
};
Symbols are divided into 8 classes, identical from the point of view of the parser (for example, all numbers are combined into one class). A separate array
static yyconst int yy_ec[256]associates each character with a class. The main parser cycle is very straightforward:
yy_match:
while ((yy_current_state = yy_nxt [yy_current_state] [yy_ec [(unsigned char) (* yy_cp)]])> 0)
{
if (yy_accept [yy_current_state])
{
yy_last_accepting_state = yy_current_state;
yy_last_accepting_cpos = yy_cp;
}
yy_cp;
}
yy_current_state = -yy_current_state;
Positive numbers in the transition table mean "go to state and continue reading"; negative - "go into a state and perform an action." The number of the action to be performed upon entering the state is stored in
yy_accept. Consider an example: for numbers, the number of the “symbol class” is 6.
In the initial state (1), we see by symbol 6 in the transition table 9. We go on and read on.
In state 9, if there is another digit (6), go to state 12 and read on.
From state 12, if there is another figure, just read on. (There is 12 in the table)
If you saw a non-digit (any character except 6), you need to perform an action: we see in the 9th line a row of -9, and in the 12th row a row of -12.
Check
yy_accept: in both cases, we apply rule 2. (Recall that the rule that recognizes numbers is really the second in our parser.) It is unclear why
flexI decided to separate states 9 and 12, if in both it does the same thing. But he is a soulless piece of iron, he knows better. The remarkable thing is how simple the ready parser is: instead of branching recognition of different constructions, we have one big table, and a loop of ten rows. But there is a significant problem. Let us return to the example from the very beginning of the post: “find the beginning of the construction (for example, { ); to find its end (for example, } at the same level of nesting ) ... "How is regexp to indicate the condition" at the same level of nesting "?
Mathematicians also tried: they proved that it is impossible. So, for parsing languages that support nested constructions (which are all languages more complicated than BAT files), we need a more powerful tool than regexps. Next time about him .