KvanTTT January 18, 2012 at 20:23

Developing a compiler for the Cool educational language in C # for .NET (Part 1)

From the sandbox

Introduction

Hello, dear Habrauser. I would like to present you material on the practical creation of a compiler that will translate code written in the Cool language into the code of the CIL (Common Intermediate Language) virtual machine under the .NET platform. I decided to split this material into two parts ~~from -laziness to describe it all at once~~

In the first part we will describe the process of writing a grammar taking into account the priorities of operators in the ANTLR environment, as well as generating a lexer and a parser for C #. Also, it will examine the pitfalls that met on my way. Thus, I will try to save at least someone time (maybe for myself in the future).

In the secondthe same part will describe the process of constructing a semantic code analyzer, code generation, and self-made code optimization ~~that is not needed by anyone~~ . It will also describe how to make a beautiful interface ~~with blackjack and whores~~ with syntax highlighting and folding of blocks, as in modern IDEs. At the end of the second part, of course, I will lay out all the sources of my solution and talk about further improvements in architecture and code, in any case, as it seems to me. Warning : before developing this compiler, I practically did not study any subject literature. Everything turned out to be reading several articles that interest me, watching video tutorials

according to ANTLR on the official website and talking with “fumbling” classmates. So the developed software is UG far from ideal.

So that the reader understands me better, I will first give some definitions first (from Wikipedia):

Token - sequences of characters in lexical analysis in computer science, corresponding to the token.
A lexer is a module for analytic analysis of an input sequence of characters (for example, such as source code in one of the programming languages) in order to obtain a sequence of characters called “tokens” at the output (like grouping letters into words).
A parser is a module for comparing a linear sequence of tokens (words, tokens) of a language with its formal grammar. The result is a parse tree or a syntax tree.

On habrahabr, and not only, there are many articles explaining in detail how to write grammar for mathematical expressions and languages, so I will not dwell on setting up the environment in detail. Also, I will not delve into the theoretical issues of constructing compilers, since there are also a number of good articles on this subject, for example, this cycle.

So, the following programs and libraries were used to develop the compiler:

ANTLRWorks - IDE for writing grammars and generating code for lexers and parsers for various languages (C, C #, Java, JavaScript, Objective-C, Perl, Python, Ruby, etc.). Based on LL (*) parsing.
ANTLR C # runtime distribution (Antlr3.Runtime.dll) is used in the generated lexer and parser.
Visual Studio 2010 In the next article, we will consider a component under WPF, an advanced text editor for syntax highlighting and code completion AvalonEdit ).
Java Runtime Environment (since ANTLRWorks is written in Java).
IL Disassembler will be used in the second part to understand how a compiler, such as C #, generates CIL code and how it should look.

Description of language Cool

Cool - Classroom Object-Oriented Language is, as the name implies, an educational programming language that ~~no one needs~~ , including classes and objects. Cool is a functional strongly typed language, i.e. type checking occurs statically at the compilation stage. A program is essentially a collection of classes. Classes include a set of fields and functions (here they are called feature ).
All fields are visible within the base and derived classes. All functions are visible from everywhere (public).
This language is built on expressions ( expression) All functions can take expressions as arguments, and the bodies of all functions are expressions that return any result. Even functions such as line output return a result, namely an instance of the class from which they are called.
Cool also contains the static classes String, Int, Bool, from which they cannot be inherited and which cannot take null values (here it is called void ). Objects of these classes are passed by value , not by reference.
Keywords of this language: class, else, false, fi, if, in, inherits, isvoid, let, loop, pool, then, while, case, esac, new, of, not, true
I want to note that some operators end not with a curly brace (as in C-like ones) or with the word end; (how in~~dead~~ pascal), and with the name the name of the same operator in reverse order (if -> fi, loop -> pool, case -> esac).
What do you think, if each operator returns some result, then what should the if return? After all, he can return an option of two alternatives. The correct answer: the closest common ancestor (in the textbook it is written about it somehow complicated, there is still a special operator used). I give the picture for clarity:

Figure 1.

Here, for classes A and B, the closest common ancestor is class D.

The case statement is similar to the if statement applied several times.

The void is always returned in the while construct (unless of course the loop is looping).

Such a design {; ... ; } returns the object of the expression exprn, i.e. last expression.

Functions here can be called not only in the class and its ancestor (virtual functions), but in general in any ancestors of this class. This can be done using the '@' operator. As an example, consider the class hierarchy in Figure 1. Suppose that class E has a func () function, all other descendants redefine it in their own way. Then, if we have an instance of object A and want to call the func function from class D, we need to use the following syntax:

result <- (new A)@D.func()

The SELF_TYPE keyword is used as a synonym for a class that describes a specific function. The identifier self is a pointer to the current class (i.e. this in other languages).

Local variables are introduced in the Cool language using the let operator and they can only be used within the given let block.

void - an analogue of null, an indication that the object is not created.

I think all the other operators do not need explanations, and if it is not clear, then you can study the manual for this dialect of the Cool language using the link indicated at the end of the article.

Writing Cool Grammar in ANTLRWorks

So, the original grammar is given in the following form:

program ::= [[class; ]]+
class ::= class TYPE [inherits TYPE] { [[feature; ]]*}
feature ::= ID( [ formal [[, formal]]*] ) : TYPE { expr } | ID : TYPE [ <- expr ]
formal ::= ID : TYPE
expr ::=
ID <- expr
| expr[@TYPE].ID( [ expr [[, expr]]*] )
| ID( [ expr [[, expr]]*] )
| if expr then expr else expr fi
| while expr loop expr pool
| { [[expr; ]] +}
| let ID : TYPE [ <- expr ] [[, ID : TYPE [ <- expr ]]]* in expr
| case expr of [[ID : TYPE => expr; ]]+ esac
| new TYPE
| isvoid expr
| expr + expr
| expr ? expr
| expr ? expr
| expr / expr
| ~expr
| expr < expr
| expr <= expr
| expr = expr
| not expr
| (expr)
| ID
| integer
| string
| true
| false

Legend: [[]] * - iteration; [[]] + Is a positive iteration; So, what needs to be done after seeing such a grammar? ~~Forget about her and the compilers forever. Who needs another undecision ??~~ So, if you immediately rewrite this grammar in ANTLR and generate the lexer and parser code, then nothing good will come of it because of the following reasons:

The existence of left recursion . Of course, ANTLRWorks, based on the recursive descent method, will immediately detect and produce this kind of error: [fatal] rule compilation_unit has non-LL (*) decision due to recursive rule invocations reachable from alts ... Resolve by left-factoring or using syntactic predicates or using backtrack = true option ... And, of course, ANTLR has the option backtrack = true, which allows you to overcome this limitation. But applying it, you have to come to terms with another node in the generated spacecraft for expression with left recursion. So all the same, it would be nice to get rid of it. In addition, as can be seen from this article , this option slows down the parser and the author recommends abandoning it whenever possible.
But the main thing is that the priority of operators is not taken into account in this form of grammar . If left recursion affects only the parser efficiency, ignoring this problem will lead to incorrect code generation later (for example, 2 + 2 * 2 will be 8, not 6)

We read the manual further. And then it’s just written about the priorities of operations. They take the following form (at the top - the highest, below - the lowest priorities).

.
@
~
isvoid
* /
+ -
<= <=
not
<-

So, we need to build our grammar with these operations in mind. To do this, we adhere to the following construction rule: For each operator of lower priority, we create a new rule that will contain the operator of this priority and the non-terminal for the operator of higher priority on the right side. I know that sounds confused, so I give this construction for our original grammar:

expr:
  (ID ASSIGN^)* not;
not:
  (NOT^)* relation;
relation:
  addition ((LE^ | LT^ | GE^ | GT^ | EQUAL^) addition)*;
addition:
  multiplication ((PLUS^ | MINUS^) multiplication)*;
multiplication:
  isvoid ((MULT^ | DIV^) isvoid)*;
isvoid:
  (ISVOID^)* neg;
neg:
  (NEG^)* dot;
dot:
  term ((ATSIGN typeName)? DOT ID LPAREN invokeExprs? RPAREN)? ->
  ^(Term term ID? typeName? invokeExprs?);

The term in this case includes all the remaining expressions that have not been considered, i.e. expressions with the same priority.

Comments in the language are essentially a sequence of characters that must be ignored when parsing. To do this, the ANTLR uses a design

{$channel = Hidden;};

written to the right of each rule describing comments. For instance:

COMMENT : '--' .* ('\n'|'\r') {$channel = Hidden;};

ANTLR Operators

Next, I will describe the ANTLR statements used:

! - needed so that extra language tokens do not interfere with parsing. For example, brackets and other operators that are needed only for the parser (for example, the same endings of fi, esac syntax constructs). The operator is needed to ensure that the syntax tree is built from the token stream, and not the parse tree).
-> - is needed in order to convert the sequence of tokens standing on the left side into a sequence of tokens standing on the right side, which would be convenient to process when generating the code. Together with this operator, the ^ operator is also used.
^ is used to create a new node in the generated syntax tree from the token marked with this operator and its descendants. There are two forms to this operator:
- Postfix. For example, for such a rule:
```
relation: addition ((LE^ | LT^ | GE^ | GT^ | EQUAL^) addition)*;
```
  a node is inserted into the syntax tree with one of the operators indicated in parentheses and children addition , which is to the left of the operator and addition , which is to the right of the operator.
- Prefix. For example, for such a rule:
```
CLASS typeName (INHERITS typeName)? LCURLY featureList RCURLY -> ^(Class typeName featureList typeName?)
```
  the parent node of the Class with the descendants of typeName featureList typeName is inserted into the syntax tree ?

All other ANTLR operators used do not need comments (for a person who has studied discrete mathematics), and if they need them, then small ones:

* - Iteration.
+ - Positive iteration.
? - A left-handed token or a group of tokens can either be present or absent.

I want to note that when using the operators '->' and '^', there is a need to introduce new tokens. This is done in the corresponding section tokens {}. By the way, I am in favor of ensuring that the .g file contains a minimum of code intended for a particular lexer and parser language (in our case, for C #). Nevertheless, such a code still had to be used for the STRING token:

STRING: '"'
{ StringBuilder b = new StringBuilder(); }
( '"' '"' { b.Append('"');}
| c=~('"'|'\r'|'\n') { b.Append((char)c);}
)*
'"'
{ Text = b.ToString(); }
;

Here, StringBuilder is used to make string processing efficient. In principle, other intersperses of such code are acceptable for optimizing the lexer, but not for all rules. In order to specify the namespace in the C # lexer and parser code (for example, System.Text for StringBuilder), the following constructs are used (some “unnecessary” warnings are also disabled here):

@lexer::header {#pragma warning disable 3021
using System;
using System.Text;
}
@parser::header {#pragma warning disable 3021
using System.Text;}

After we compiled the grammar in ANTLRWorks, we need to generate the C # lexer and parser code (thanks, cap).

ANTLRWorks with the given settings generates 3 files:

CoolGrammar.tokens
CoolGrammarLexer.cs
CoolGrammarParser.cs

Using the generated lexer in C # code

Getting a list of all tokens in the generated C # code does not look very trivial (although this is not always necessary):

var stream = new ANTLRStringStream(Source); // Source - исходный код.
lexer = new CoolGrammarLexer(stream, new RecognizerSharedState() { errorRecovery = true });
IToken token;
token = lexer.NextToken();
while (token.Type != CoolGrammarLexer.EOF)
{
  // Здесь обрабатываем каждый токен.
  // Имя токена: CoolTokens.Dictionary[token.Type] (Например для '(' это LPAREN,
  //  используется дополнительный заранее подготовленный словарь, сопоставляющий имя токена и его значение)
  // Текст токена: token.Text (Например для '(' это просто "(")
  // Номер линии токена в коде: token.Line
  // Позиция токена в коде от начала строки: token.Column
  token = lexer.NextToken(); // Читаем следующий токен, то тех пор, пока не дойдем до конца файла.
}
// Далее перегружаем поток токенов и устанавливаем позицию в коде в 0,
// чтобы использовать этот же поток токенов потом (читай в парсере).
lexer.Reset();

CoolTokens.Dictionary should be generated as follows:

var Lines = File.ReadAllLines(fileName);
Dictionary = new Dictionary(Lines.Length);
foreach (var line in Lines)
{
  var parts = line.Split('=');
  if (!Dictionary.ContainsKey(int.Parse(parts[1])))
    Dictionary.Add(int.Parse(parts[1]), parts[0]);
}

fileName - Path to CoolCompiler.tokens file

Using the generated parser in C # code

var tokenStream = new CommonTokenStream(lexer); // как получить lexer, было описано ранее. 
parser = new CoolGrammarParser(tokenStream);
Tree = parser.program(); // Самый верхний узел, отвечающий за аксиому грамматики (т.е. program в нашем случае).

To bypass the syntax tree, the following properties and methods of the IList interface are used:

treeNode.ChildCount - the number of children at the treeNode node
treeNode.GetChild (i) - getting a child from the treeNode node i
treeNode.Type - the token of this node. It is of type int and must be compared with constants declared in the lexer class. For example: treeNode.Type == CoolGrammarLexer.EQUAL (whether a given token is a '=' sign). I want to note that when traversing a tree, it is necessary to use the Type property, not Text, since adding numbers is faster than comparing strings. the likelihood of error is reduced, since all constants are automatically generated.
treeNode.Line - token line number in code
treeNode.CharPositionInLine - token position in the code from the beginning of the line

I also wanted to pay special attention to the fact that when generating code from a grammar in C #, you need to put the public modifier in front of the grammar axiom (in this case, before program) so that you can call it from the code later (otherwise it will be private)

Conclusion

This article described how to describe grammar in ANTLRWorks, which is then used to generate C # lexers and parsers, and how to use a lexer and parser. The next article will tell you about processing semantics of a language (i.e. type checking, handling semantic errors), how to bypass each node of the syntax tree and generate CIL code for the corresponding syntactic construction, explaining CIL instructions, explaining how the virtual machine stack works and using the built-in .NET library System.Reflection.Emit to facilitate code generation. Of course, I will describe the pitfalls that I met on my way (In particular, creating a class hierarchy using System.Reflection.Emit). It will also describe how to make ~~beautiful wallpapers.~~beautiful and modern interface! However, what is written in the conclusion has already been partially written in the introduction.

Finally, I would like to ask one question to the habrasociety, the answer to which I could not find:

For C # lexer, I always generate such code to exclude NoViableAltException, and this throw cannot be intercepted in any way:

catch (NoViableAltException nvae) {
  DebugRecognitionException(nvae);
  throw;
}

How can I write a grammar file so that in this line there is not just a throw, but for example my exception is raised or does the lexer continue to work? And you have to correct this line every time C # code is generated. Note: manipulations with @rulecatch in ANTLR did not lead to anything.

It would be nice if someone figured out this problem and wrote the answer in the comments.