Developing a compiler for the Cool educational language in C # for .NET (Part 1)
Introduction
Hello, dear Habrauser. I would like to present you material on the practical creation of a compiler that will translate code written in the Cool language into the code of the CIL (Common Intermediate Language) virtual machine under the .NET platform. I decided to split this material into two parts
In the first part we will describe the process of writing a grammar taking into account the priorities of operators in the ANTLR environment, as well as generating a lexer and a parser for C #. Also, it will examine the pitfalls that met on my way. Thus, I will try to save at least someone time (maybe for myself in the future).
In the secondthe same part will describe the process of constructing a semantic code analyzer, code generation, and self-made code optimization
according to ANTLR on the official website and talking with “fumbling” classmates. So the developed software is
So that the reader understands me better, I will first give some definitions first (from Wikipedia):
- Token - sequences of characters in lexical analysis in computer science, corresponding to the token.
- A lexer is a module for analytic analysis of an input sequence of characters (for example, such as source code in one of the programming languages) in order to obtain a sequence of characters called “tokens” at the output (like grouping letters into words).
- A parser is a module for comparing a linear sequence of tokens (words, tokens) of a language with its formal grammar. The result is a parse tree or a syntax tree.
So, the following programs and libraries were used to develop the compiler:
- ANTLRWorks - IDE for writing grammars and generating code for lexers and parsers for various languages (C, C #, Java, JavaScript, Objective-C, Perl, Python, Ruby, etc.). Based on LL (*) parsing.
- ANTLR C # runtime distribution (Antlr3.Runtime.dll) is used in the generated lexer and parser.
- Visual Studio 2010 In the next article, we will consider a component under WPF, an advanced text editor for syntax highlighting and code completion AvalonEdit ).
- Java Runtime Environment (since ANTLRWorks is written in Java).
- IL Disassembler will be used in the second part to understand how a compiler, such as C #, generates CIL code and how it should look.
Description of language Cool
Cool - Classroom Object-Oriented Language is, as the name implies, an educational programming language that
All fields are visible within the base and derived classes. All functions are visible from everywhere (public).
This language is built on expressions ( expression) All functions can take expressions as arguments, and the bodies of all functions are expressions that return any result. Even functions such as line output return a result, namely an instance of the class from which they are called.
Cool also contains the static classes String, Int, Bool, from which they cannot be inherited and which cannot take null values (here it is called void ). Objects of these classes are passed by value , not by reference.
Keywords of this language: class, else, false, fi, if, in, inherits, isvoid, let, loop, pool, then, while, case, esac, new, of, not, true
I want to note that some operators end not with a curly brace (as in C-like ones) or with the word end; (how in
What do you think, if each operator returns some result, then what should the if return? After all, he can return an option of two alternatives. The correct answer: the closest common ancestor (in the textbook it is written about it somehow complicated, there is still a special operator used). I give the picture for clarity:
Figure 1.
Here, for classes A and B, the closest common ancestor is class D.
The case statement is similar to the if statement applied several times.
The void is always returned in the while construct (unless of course the loop is looping).
Such a design {
Functions here can be called not only in the class and its ancestor (virtual functions), but in general in any ancestors of this class. This can be done using the '@' operator. As an example, consider the class hierarchy in Figure 1. Suppose that class E has a func () function, all other descendants redefine it in their own way. Then, if we have an instance of object A and want to call the func function from class D, we need to use the following syntax:
result <- (new A)@D.func()
The SELF_TYPE keyword is used as a synonym for a class that describes a specific function. The identifier self is a pointer to the current class (i.e. this in other languages).
Local variables are introduced in the Cool language using the let operator and they can only be used within the given let block.
void - an analogue of null, an indication that the object is not created.
I think all the other operators do not need explanations, and if it is not clear, then you can study the manual for this dialect of the Cool language using the link indicated at the end of the article.
Writing Cool Grammar in ANTLRWorks
So, the original grammar is given in the following form:
program ::= [[class; ]]+
class ::= class TYPE [inherits TYPE] { [[feature; ]]*}
feature ::= ID( [ formal [[, formal]]*] ) : TYPE { expr } | ID : TYPE [ <- expr ]
formal ::= ID : TYPE
expr ::=
ID <- expr
| expr[@TYPE].ID( [ expr [[, expr]]*] )
| ID( [ expr [[, expr]]*] )
| if expr then expr else expr fi
| while expr loop expr pool
| { [[expr; ]] +}
| let ID : TYPE [ <- expr ] [[, ID : TYPE [ <- expr ]]]* in expr
| case expr of [[ID : TYPE => expr; ]]+ esac
| new TYPE
| isvoid expr
| expr + expr
| expr ? expr
| expr ? expr
| expr / expr
| ~expr
| expr < expr
| expr <= expr
| expr = expr
| not expr
| (expr)
| ID
| integer
| string
| true
| false
Legend: [[]] * - iteration; [[]] + Is a positive iteration; So, what needs to be done after seeing such a grammar?
- The existence of left recursion . Of course, ANTLRWorks, based on the recursive descent method, will immediately detect and produce this kind of error: [fatal] rule compilation_unit has non-LL (*) decision due to recursive rule invocations reachable from alts ... Resolve by left-factoring or using syntactic predicates or using backtrack = true option ... And, of course, ANTLR has the option backtrack = true, which allows you to overcome this limitation. But applying it, you have to come to terms with another node in the generated spacecraft for expression with left recursion. So all the same, it would be nice to get rid of it. In addition, as can be seen from this article , this option slows down the parser and the author recommends abandoning it whenever possible.
- But the main thing is that the priority of operators is not taken into account in this form of grammar . If left recursion affects only the parser efficiency, ignoring this problem will lead to incorrect code generation later (for example, 2 + 2 * 2 will be 8, not 6)
- .
- @
- ~
- isvoid
- * /
- + -
- <= <=
- not
- <-
expr:
(ID ASSIGN^)* not;
not:
(NOT^)* relation;
relation:
addition ((LE^ | LT^ | GE^ | GT^ | EQUAL^) addition)*;
addition:
multiplication ((PLUS^ | MINUS^) multiplication)*;
multiplication:
isvoid ((MULT^ | DIV^) isvoid)*;
isvoid:
(ISVOID^)* neg;
neg:
(NEG^)* dot;
dot:
term ((ATSIGN typeName)? DOT ID LPAREN invokeExprs? RPAREN)? ->
^(Term term ID? typeName? invokeExprs?);
The term in this case includes all the remaining expressions that have not been considered, i.e. expressions with the same priority.
Comments in the language are essentially a sequence of characters that must be ignored when parsing. To do this, the ANTLR uses a design
{$channel = Hidden;};
written to the right of each rule describing comments. For instance:COMMENT : '--' .* ('\n'|'\r') {$channel = Hidden;};
ANTLR Operators
Next, I will describe the ANTLR statements used:
- ! - needed so that extra language tokens do not interfere with parsing. For example, brackets and other operators that are needed only for the parser (for example, the same endings of fi, esac syntax constructs). The operator is needed to ensure that the syntax tree is built from the token stream, and not the parse tree).
- -> - is needed in order to convert the sequence of tokens standing on the left side into a sequence of tokens standing on the right side, which would be convenient to process when generating the code. Together with this operator, the ^ operator is also used.
- ^ is used to create a new node in the generated syntax tree from the token marked with this operator and its descendants. There are two forms to this operator:
- Postfix. For example, for such a rule:
a node is inserted into the syntax tree with one of the operators indicated in parentheses and children addition , which is to the left of the operator and addition , which is to the right of the operator.relation: addition ((LE^ | LT^ | GE^ | GT^ | EQUAL^) addition)*;
- Prefix. For example, for such a rule:
the parent node of the Class with the descendants of typeName featureList typeName is inserted into the syntax tree ?CLASS typeName (INHERITS typeName)? LCURLY featureList RCURLY -> ^(Class typeName featureList typeName?)
- Postfix. For example, for such a rule:
All other ANTLR operators used do not need comments (for a person who has studied discrete mathematics), and if they need them, then small ones:
- * - Iteration.
- + - Positive iteration.
- ? - A left-handed token or a group of tokens can either be present or absent.
STRING: '"'
{ StringBuilder b = new StringBuilder(); }
( '"' '"' { b.Append('"');}
| c=~('"'|'\r'|'\n') { b.Append((char)c);}
)*
'"'
{ Text = b.ToString(); }
;
Here, StringBuilder is used to make string processing efficient. In principle, other intersperses of such code are acceptable for optimizing the lexer, but not for all rules. In order to specify the namespace in the C # lexer and parser code (for example, System.Text for StringBuilder), the following constructs are used (some “unnecessary” warnings are also disabled here):
@lexer::header {#pragma warning disable 3021
using System;
using System.Text;
}
@parser::header {#pragma warning disable 3021
using System.Text;}
After we compiled the grammar in ANTLRWorks, we need to generate the C # lexer and parser code (thanks, cap).
ANTLRWorks with the given settings generates 3 files:
- CoolGrammar.tokens
- CoolGrammarLexer.cs
- CoolGrammarParser.cs
Using the generated lexer in C # code
Getting a list of all tokens in the generated C # code does not look very trivial (although this is not always necessary):
var stream = new ANTLRStringStream(Source); // Source - исходный код.
lexer = new CoolGrammarLexer(stream, new RecognizerSharedState() { errorRecovery = true });
IToken token;
token = lexer.NextToken();
while (token.Type != CoolGrammarLexer.EOF)
{
// Здесь обрабатываем каждый токен.
// Имя токена: CoolTokens.Dictionary[token.Type] (Например для '(' это LPAREN,
// используется дополнительный заранее подготовленный словарь, сопоставляющий имя токена и его значение)
// Текст токена: token.Text (Например для '(' это просто "(")
// Номер линии токена в коде: token.Line
// Позиция токена в коде от начала строки: token.Column
token = lexer.NextToken(); // Читаем следующий токен, то тех пор, пока не дойдем до конца файла.
}
// Далее перегружаем поток токенов и устанавливаем позицию в коде в 0,
// чтобы использовать этот же поток токенов потом (читай в парсере).
lexer.Reset();
CoolTokens.Dictionary should be generated as follows:
var Lines = File.ReadAllLines(fileName);
Dictionary = new Dictionary(Lines.Length);
foreach (var line in Lines)
{
var parts = line.Split('=');
if (!Dictionary.ContainsKey(int.Parse(parts[1])))
Dictionary.Add(int.Parse(parts[1]), parts[0]);
}
fileName - Path to CoolCompiler.tokens fileUsing the generated parser in C # code
var tokenStream = new CommonTokenStream(lexer); // как получить lexer, было описано ранее.
parser = new CoolGrammarParser(tokenStream);
Tree = parser.program(); // Самый верхний узел, отвечающий за аксиому грамматики (т.е. program в нашем случае).
To bypass the syntax tree, the following properties and methods of the IList interface are used:- treeNode.ChildCount - the number of children at the treeNode node
- treeNode.GetChild (i) - getting a child from the treeNode node i
- treeNode.Type - the token of this node. It is of type int and must be compared with constants declared in the lexer class. For example: treeNode.Type == CoolGrammarLexer.EQUAL (whether a given token is a '=' sign). I want to note that when traversing a tree, it is necessary to use the Type property, not Text, since adding numbers is faster than comparing strings. the likelihood of error is reduced, since all constants are automatically generated.
- treeNode.Line - token line number in code
- treeNode.CharPositionInLine - token position in the code from the beginning of the line
Conclusion
This article described how to describe grammar in ANTLRWorks, which is then used to generate C # lexers and parsers, and how to use a lexer and parser. The next article will tell you about processing semantics of a language (i.e. type checking, handling semantic errors), how to bypass each node of the syntax tree and generate CIL code for the corresponding syntactic construction, explaining CIL instructions, explaining how the virtual machine stack works and using the built-in .NET library System.Reflection.Emit to facilitate code generation. Of course, I will describe the pitfalls that I met on my way (In particular, creating a class hierarchy using System.Reflection.Emit). It will also describe how to make
Finally, I would like to ask one question to the habrasociety, the answer to which I could not find:
For C # lexer, I always generate such code to exclude NoViableAltException, and this throw cannot be intercepted in any way:
catch (NoViableAltException nvae) {
DebugRecognitionException(nvae);
throw;
}
How can I write a grammar file so that in this line there is not just a throw, but for example my exception is raised or does the lexer continue to work? And you have to correct this line every time C # code is generated. Note: manipulations with @rulecatch in ANTLR did not lead to anything. It would be nice if someone figured out this problem and wrote the answer in the comments.
Recommended reading
- Definite ANTLR Reference, Terence Parr, 2007.
- Compilers Principles, Technologies and Tools, 2008, Alfred W. Aho, Monica S. Lam, Ravi Seti, Jeffrey D. Ullman