We write our programming language, part 2: intermediate presentation of programs

Tutorial

Introduction

Greetings to all who looked to read my next article.

I repeat, I describe the creation of a programming language, based on the work carried out earlier, the results of which are described in this post .

In the first part (link: habr.com/post/435202 ) I described the steps of designing and writing a language VM that will execute our future applications in our future language.
In this article, I plan to describe the main stages of creating an intermediate programming language that will be assembled into abstract bytecode for direct execution on our VM.

I think that it would not hurt to immediately provide links to the project site and its repository.

Site
repository

I’ll say right away that all the code is written in FPC and I’ll give examples on it.

So let's start our enlightenment.

Why did we give up intermediate PP?

It is worth understanding that converting a program from a high-level language right away with an executable bytecode, which consists of a limited set of instructions, is so trivial that it is better to simplify it by an order of magnitude by adding an intermediate language to the project. It is much better to simplify the code gradually, than to immediately represent mathematical expressions, structures and classes with a set of opcodes. By the way, this is the way most third-party translators and compilers work.

In my previous article I wrote about how you can implement a language VM. Now we need to implement an assembler-like PL for it and a functional for the further writing of a translator. At these stages, we lay the foundation for the future project. It should be understood, the better the foundation - the steeper the building.

We take the first step to the realization of this miracle

To get started is to set a goal. What are we actually going to write? What characteristics should the final code have and what should it do?

I can form a list of the main functional parts of which this part of the project should consist:

Simple assembler. Converts simple instructions into a set of opcodes for VM.
The basic implementation of the functional for the implementation of variables.
Basic implementation of functionality for working with constants.
Functionality to support entry points to methods and the calculation of their addresses at the broadcasting stage.
Perhaps a couple of functional buns.

The illustration above shows a code fragment in an intermediate language, which is converted into code for a VM by a primitive translator, which will be discussed.

So, the goals are set, let's start implementation.

We write a simple assembler

We ask ourselves what is an assembler?

In essence, this is a program that performs the substitution of opcodes instead of their text descriptions.

Consider this code:

push 0
push 1
add
peek 2
pop

After processing the code by the assembler, we get the executable code for the VM.

We see that the instructions can be monosyllabic and disyllabic. More complex instructions for stack VMs are not required.

We need a code that could extract tokens from a string (we take into account that there may be strings among them).

We write it:

functionTk(s: string; w: word):string;
begin
  Result := '';
  while (length(s) > 0) and (w > 0) dobeginif s[1] = '"'thenbegin
      Delete(s, 1, 1);
      Result := copy(s, 1, pos('"', s) - 1);
      Delete(s, 1, pos('"', s));
      s := trim(s);
    endelseif Pos(' ', s) > 0thenbegin
      Result := copy(s, 1, pos(' ', s) - 1);
      Delete(s, 1, pos(' ', s));
      s := trim(s);
    endelsebegin
      Result := s;
      s := '';
    end;
    Dec(w);
  end;
end;

Ok, now you need to implement something like a switch-case construction for each operator and our simple assembler is ready.

Variables

Recall that our VM for variable support has an array of pointers and, accordingly, static addressing. This means that the functionality for working with variables can be represented as TStringList, in which strings are the names of variables, and their indices are their static addresses. It should be understood that duplication of variable names in this list is unacceptable. I think that you can imagine the necessary code and / or even write it yourself.

If you want to look at the finished implementation, then you are welcome: /lang/u_variables.pas

Constants

The principle here is the same as with variables, but there is one thing. In order to optimize, it is better to bind not to the names of constants, but to their values. Those. each constant value can have a TStringList, which will serve to store the names of constants with this value.
For constants, you should specify the data type and, accordingly, in order to add them to the language you will have to write a small parser.

Implementation: /lang/u_consts.pas

Entry points to methods

To implement code blocking, support various constructions, etc. should support this functionality at the assembly level.

Consider the sample code:

Summ:
  peek 0
  pop
  peek 1
  pop
  push 0
  new
  peek 2
  mov
  push 2
  push 0
  add
jr

The above is an example of a translation of the Summ method:

func Summ(a, b):
  return a + b
end

It should be understood that the opcodes for entry points are missing. What is the entry point to the Summ method? This prime number is the offset of the opcode following the entry point. (opcode offset is the opcode number relative to the start of the abstract bytecode being executed). Now we have a task - we need to calculate this offset at the compilation stage and, as an option, declare the Summ constant with this number.

We will write for this a certain counter of the weight of each operator. We have simple one-syllable operators, for example “pop”. They occupy 1 byte. There are more complicated ones, for example, “push 123” - they occupy 5 bytes, 1 for the opcode and 4 for the unsigned int type.

The essence of the code for adding support for entry points by the assembler:

We have a counter, say i = 0.
We run over the code, if we have a push 123 construction, we add 5 to it, if the simple opcode is 1. If we have an entry point, then we remove it from the code and declare the corresponding constant with the counter value and the name of the entry point.

Other functionality

This is, for example, a simple conversion of the code before processing.

Results

We implemented our small assembler. We will need it to implement a more complex translator on its basis. Now we can write small programs for our VM. Accordingly, further articles will describe the process of writing a more complex translator.

Thank you for reading to the end, if you did it.

If something is not clear to you, then I am waiting for your comments.

Only registered users can participate in the survey. Sign in , please.

Was this article of interest to you?

Tags: