We write our programming language, part 3: Translator Architecture. Parsing language structures and mathematical expressions

  • Tutorial


Greetings to you, interested reading developers in no matter what languages ​​in which I orient these articles and whose support and opinions I value.

To begin with, according to established traditions, I will provide links to previous articles:

Part 1: write a language VM
Part 2: intermediate presentation of programs

In order to form in your head a complete understanding of what we write in these articles, you should familiarize yourself with the previous parts.

Also, I should post at once a link to an article about a project that was written by me earlier and on the basis of which all this debriefing is taking place: Klats syudy . With him, perhaps it is worth reading the first thing.

And a little about the project:

A small project site
GitHub repository

Well, I’ll also say right away that everything is written in Object Pascal, namely, on FPC.

So, let's begin.

The principle of operation of most translators

First of all, it should be understood that I could not write anything worthwhile without first reading a bunch of theoretical material and a number of stateek. I will describe the main thing in a couple of words.

The first task of the translator is to prepare the code for the analysis (for example, delete comments from it) and break the code into tokens (the token is the minimum character set for the language).

Next, by analyzing and transforming, you need to parse the code into some kind of intermediate representation and then assemble the application ready for execution or ... What is there to do in general?

Yes, I didn’t really say anything with this bunch of text, but now the task is divided into several subtasks.

Let's skip how the code is prepared for execution, since It is too boring to describe the process. Suppose we have a set of tokens ready for analysis.

Code analysis

You may have heard about building a code tree and analyzing it or even more abstruse things. As always - this is nothing more than a clutter in simple, scary terms. By code analysis, I mean a much simpler set of actions. The task is to go through the list of tokens and parse the code, each of its construction.

As a rule, in imperative languages, the code is already represented as a kind of tree from constructions.

Agree, it is not acceptable to start the cycle “A” in the body of the cycle “B”, and to finish - outside the body of the cycle “B”. The code is a structure consisting of a set of structures.
And what is each design? That's right - the beginning and the end (and maybe something else in the middle, but not the essence).

Accordingly, the analysis of the code can be made single pass, plainly without building a tree.
To do this, you need a cycle that will run through the code and a huge switch-case that will perform the basic code analysis and analysis.

Those. we run by tokens, we have a token (for example, let it be ...) “if” - I really doubt that such a token can stand in the code just like that -> this is the beginning of the if..then [.. else] .. end!

Perform analysis of all subsequent tokens, appropriately for the construction of conditions in our language.

Some bugs in the code

At the stage of parsing structures and mileage code, error handling is better not to score. This is useful translator functionality. If an error occurs during the parsing of a structure, then it is logical that the structure is not built properly and the developer should be notified about this.

Now about Mash. How is the analysis of language constructs?

Above, I described a generalized concept of a translator device. Now it's time to talk about my work.

In fact, the translator turned out to be very similar to the one described above. But for me it does not break the code into a bunch of tokens for further analysis.

Before parsing, the code is presented in a more beautiful view. Comments are deleted and all constructs are connected in long lines if they are described in several lines.

Thus, in each separate line there is a language construction or its part. This is great, now we can parse each line in our large switch-case, instead of searching for these constructions in a set of tokens. Also, the advantage here is that the line has an end and it means that it is easier to identify errors in the construction with this approach.

Accordingly, the analysis of individual structures occurs by separate methods, which return an intermediate representation of the code of structures or its parts.

Ps In the previous article, I described building a translator from an intermediate language to bytecode for a VM. Actually - this intermediate language is an intermediate representation.

It should be understood that structures can consist of several simpler structures. Since we have a parsing of each construction described by separate methods, then we can call them from each other without any problems when parsing each construction.


Warm-up run by code

For starters, the translator should quickly get acquainted with the code, run through it and pay attention to some of the structures.

At this stage, you can deal with global variables, uses constructs, as well as imports, procedures & functions, and OOP constructs.

It is better to generate the intermediate representation in several objects for storage, so that the
code for global variables is inserted after initialization, but before starting execution of main ().

The code for OOP constructs can be inserted at the end.

Complex designs

Ok, with simple designs, we figured out. Now it's time to be tricky. I do not think that you managed to forget the example with two cycles. As we know, designs usually come in the form of a certain tree. This means that we can disassemble complex structures using the stack.

What does the stack have to do with it? Besides.

We first describe the class that will be pushed onto the stack. When parsing complex constructs, we can generate an intermediate representation for the beginning and end of this block, for example, the for, while, until loop, if constructs, methods, and indeed everything in the Mash language are parsed.

This class needs fields for storing intermediate representations, meta information (for some variable constructions) and, of course, for storing the type of block.

Just give the whole code, because there is not a lot of it here:

unit u_prep_codeblock;
{$mode objfpc}{$H+}interfaceuses
  Classes, SysUtils;
  TBlockEntryType = (btProc, btFunc, btIf, btFor, btWhile, btUntil, btTry, btClass,
                     btSwitch, btCase);
  TCodeBlock = class(TObject)
    bType: TBlockEntryType;
    mName, bMeta, bMCode, bEndCode: string;
    constructorCreate(bt: TBlockEntryType; MT, MC, EC: string);end;
implementationconstructorTCodeBlock.Create(bt: TBlockEntryType; MT, MC, EC: string);begininherited Create;
  bType := bt;
  bMeta := MT;
  bMCode := MC;
  bEndCode := EC;

Well, the stack is a simple TList, it’s just silly to reinvent the wheel.

Thus, parsing the construction, say the same while loop, looks like this:

functionParseWhile(s: string; varmgr: TVarManager):string;
  WhileNum, ExprCode: string;
  Delete(s, 1, 5);         //"while"
  Delete(s, Length(s), 1); //":" 
  s := Trim(s);
  //Циклов в коде много, а в промежуточном представлении структуры кода нету...//Нужно точки входа все же отличать как то :)
  WhileNum := '__gen_while_' + IntToStr(WhileBlCounter);
  Inc(WhileBlCounter); //Номер очередной конструкции while в коде//Метод проверяет, является ли строка логическим или математическим выражениемif IsExpr(s) then
    ExprCode := PreprocessExpression(s, varmgr)
    ExprCode := PushIt(s, varmgr);
  //Теперь в ExprCode лежит промежуточное представление выражения//Результат после выполнения будет лежать в вершине стека//Проверка условия перед итерацией//(промежуточное представление ставится на место начала конструкции)
  Result := WhileNum + ':' + sLineBreak + 'pushcp ' + WhileNum + '_end' +
    sLineBreak + ExprCode + sLineBreak + 'jz' +
    sLineBreak + 'pop';
  //Ну и соответственно генерация кода на завершение конструкции//Последний аргумент - название точки входа на выход из цикла//нужен для генератора break
  BlockStack.Add(TCodeBlock.Create(btWhile, '', 'pushcp ' + WhileNum +
    sLineBreak + 'jp' + sLineBreak + WhileNum +
    '_end:', WhileNum + '_end'));

About math expressions

Perhaps you might have overlooked this, but mathematical / logical expressions are also structured code.

I implemented their analysis in the stack way. First, all the individual elements of the expression go on the stack, then the intermediate representation code is generated in several passes.

Several times - because There are priority math operations like multiplication.
I see no reason to bring the code here, since its a lot and it's boring.

Ps /lang/u_prep_expressions.pas - here it is fully and completely put on your display.


So, we have implemented a translator that can convert ... For example, this is the code:

proc PrintArr(arr):
  for(i ?= 0; i < len(arr); i++):
    PrintLn("arr[", i, "] = ", arr[i])
proc main():
  var arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

What is missing our language? That's right, support the PLO. We will discuss this in my next article.

Thank you for reading to the end, if you did it.

If you do not understand something, then I am waiting for your comments. Or questions on the forum , yes, yes ... I sometimes check it.

And now a small survey (so that I looked at it and rejoiced at the importance of my articles):

Only registered users can participate in the survey. Sign in , please.

Was this article of interest to you?

Also popular now: