We write our programming language, part 1: we write the language VM

  • Tutorial


Good day to all habrachitelemy!

So, perhaps it should be said that the purpose of my work, on the basis of which a number of articles will be written, was to go all the way to create a fully functional PL with 0 itself and then share my knowledge, experience and experience with interested people.

I will describe the creation of a language that I described earlier here .

He was interested in many and caused a heated discussion in the comments. Therefore - the topic is interesting to many.

I think that it is worthwhile to immediately post information about the project: the

Site (to be filled with documentation a little later).

To touch the project yourself and see everything in action, it is better to download the repository and launch everything from the bin folder. In the release, I'm not in a hurry to post the latest versions of the language and runtime, because I sometimes just lazy to do it.

I know how to code in C / C ++ and Object Pascal. I wrote the project on FPC, because In my opinion, this language is much simpler and better suited for writing this. The second determining factor was that FPC supports a huge number of target platforms and you can rebuild the project under the desired platform with a minimum of rework. If for some reason I don’t like Object Pascal, then do not rush to close the post and run to throw stones at the comments. This language is very beautiful and clear, and I will not give so much code. Just what you need.

So, perhaps I'll start my story.

Set goals

First of all, any project needs its goals and TK, which will have to be implemented in the future. You need to decide in advance what type of language will be created to write the primary VM for it.

The key points that determined the further development of my VM are as follows:

  • Dynamic typing and type conversion. I decided to support her at the development stage.
  • Multithreading support. I included this item in this list in advance in order to properly design the VM architecture and organize support for multithreading at the VM kernel level, and not later on with crutches.
  • Export external methods. Without this, language will be useless. Is that to embed it in any project.
  • The compilability of the language (in a single abstract executable file). Partially compiled or interpreted? Much depends on it.
  • The overall architecture of the VM. Stack or register will be our VM? I tried to implement both. Chose to support stack VM.
  • How do you see work with variables, arrays, structures? Personally, at that moment I wanted to implement a language in which almost everything is automatically tied up on implicit pointers, because such an approach would greatly save memory and simplify the life of the developer. If we allow the transfer of something large to the methods, then only a pointer to this large will be automatically transmitted.

So, I chose the above priorities and I started to implement a language virtual machine. Strangely, of course, usually parsers / translators are written first, and then VM. Well, I began to develop the project in this order and will describe it further in the order in which I developed it.

I’ll say right away that I called the VM most eloquently - SVM (Stack-based Virtual Machine).

Let's start with the implementation of the class variable

Initially, I just used the variant type, because it's easier and faster. It was a crutch, but he propped up the project and allowed me to quickly implement the first version of VM and language. Later I sat down at the code and wrote the implementation of my “variant”. In essence, you need to write a class that stores a pointer to a value in memory, in my implementation it null/cardinal/int64/double/string/array. It would be possible to use case typing, but I thought it would be better to implement it the way I implemented it.

Before starting to write class code, I decided to immediately throw the {$ H +} directive into the module header for more flexible support of strings in a future language.
Ps for those who may not be aware of the difference between the H- and H + FPC mode.

When assembling code in H-mode, strings will be represented as an array of characters. With H + - in the form of a pointer to a piece of memory. In the first case, the lines will be initially fixed length and limited to 256 characters by default. In the second case, the lines will be dynamically expandable and it will be possible to shove much more characters into them. Will work a little slower, but more functional. With H +, you can also declare strings as an array of characters, like this:

var s:string[256];
So, to begin with, we will declare the Enum type, which we will use as a kind of flag, to determine the data type by the pointer:

  TSVMType = (svmtNull, svmtWord, svmtInt, svmtReal, svmtStr, svmtArr);

Next, we describe the basic structure of our variable type and some methods:

  TSVMMem = class
    m_val: pointer;
    m_type: TSVMType;
  m_val := nil;
  m_type := svmtNull;
begincase m_type of
    svmtNull: { nop };
    svmtWord: Dispose(PCardinal(m_val));
    svmtInt:  Dispose(PInt64(m_val));
    svmtReal: Dispose(PDouble(m_val));
    svmtStr:  Dispose(PString(m_val));
    svmtArr:  begin
                SetLength(PMemArray(m_val)^, 0);

The class is not inherited from anything, so the inherited calls in the constructor and destructor can be avoided. Pay attention to the directive inline. In the header of the file it is better to add {$ inline on}, for sure. Its active use in VM rather significantly increased performance (mb somewhere as much as 15-20%!). It tells the compiler that it is better to embed the body of the method in place of its call. The output code will be slightly larger in the end, but it will work faster. In this case, using inline is advisable.

Ok, we wrote down the basis of our class at this stage. Now we need to describe a number of setters and getters (setter & getter) in our class.

The task is to write a couple of methods that allow you to stuff and later get the values ​​back from our class.

First, let's look at the assignment of values ​​for our class. First you can write a generalized setter, well, then, for individual data types:

procedureTSVMMem.SetV(const value; t:TSVMType);inline;
beginif (m_val <> nil) and (m_type = t) thenbegincase t of
       svmtWord: PCardinal(m_val)^ := Cardinal(value);
       svmtInt:  PInt64(m_val)^ := Int64(value);
       svmtReal: PDouble(m_val)^ := Double(value);
       svmtStr:  PString(m_val)^ := String(value);
   endelsebeginif m_val <> nilthen
     m_type := t;
     case t of
       svmtWord: begin
                   PCardinal(m_val)^ := Cardinal(value);
       svmtInt:  begin
                   PInt64(m_val)^ := Int64(value);
       svmtReal: begin
                   PDouble(m_val)^ := Double(value);
       svmtStr:  begin
                   PString(m_val)^ := String(value);
beginif (m_val <> nil) and (m_type = svmtWord) then
   PCardinal(m_val)^ := value
  elsebeginif m_val <> nilthen
     m_type := svmtWord;
     PCardinal(m_val)^ := value;

Now you can write a code for a pair of getters:

functionTSVMMem.GetW:cardinal; inline;
  Result := 0;
  case m_type of
    svmtWord: Result := PCardinal(m_val)^;
    svmtInt:  Result := PInt64(m_val)^;
    svmtReal: Result := Trunc(PDouble(m_val)^);
    svmtStr:  Result := StrToQWord(PString(m_val)^);

Ok, great, now, after you've spent some time staring at the IDE and typing the code of setters and getters enthusiastically, we are faced with the task of implementing support for our type of mathematical and logical operations. As an example, I will give the implementation of the operation of addition:

begincase m_type of
    svmtWord: case m.m_type of
                svmtWord: SetW(GetW             + m.GetW);
                svmtInt:  SetI(GetW             + m.GetI);
                svmtReal: SetD(GetW             + m.GetD);
                svmtStr:  SetD(GetW             + StrToFloat(m.GetS));
    svmtInt:  case m.m_type of
                svmtWord: SetI(GetI             + m.GetW);
                svmtInt:  SetI(GetI             + m.GetI);
                svmtReal: SetD(GetI             + m.GetD);
                svmtStr:  SetD(GetI             + StrToFloat(m.GetS));
    svmtReal: case m.m_type of
                svmtWord: SetD(GetD             + m.GetW);
                svmtInt:  SetD(GetD             + m.GetI);
                svmtReal: SetD(GetD             + m.GetD);
                svmtStr:  SetD(GetD             + StrToFloat(m.GetS));
    svmtStr:  case m.m_type of
                svmtWord: SetS(GetS             + IntToStr(m.GetW));
                svmtInt:  SetS(GetS             + IntToStr(m.GetI));
                svmtReal: SetS(GetS             + FloatToStr(m.GetD));
                svmtStr:  SetS(GetS             + m.GetS);

It's simple. Similarly, further operations can be described and our class is ready.
For arrays, of course, we need a couple of methods, an example of getting an element by index:

functionTSVMMem.ArrGet(index: cardinal; grabber:PGrabber): pointer; inline;
  Result := nil;
  case m_type of
    svmtArr: Result := PMemArray(m_val)^[index];
    svmtStr: begin
               Result := TSVMMem.CreateFW(Ord(PString(m_val)^[index]));

Super. Now we can move on.

We implement the stack

After a while I came to such thoughts. The stack must be both static (for speed) and dynamic (for flexibility) at the same time.

Therefore, the stack is block-implemented. Those. how it should work - initially the stack array has a certain size (I decided to set the block size to 256 elements so that it was beautiful and not small). Accordingly, complete with an array is a counter indicating the current top of the stack. Re-allocating memory is an extra long operation that can be performed less frequently. If more values ​​are added to the stack, then its size can always be expanded by the size of one more block.

I cite the implementation of the entire stack:

  TStack = objectpublic
    items: arrayof pointer;
    size, i_pos: cardinal;
    parent_vm: pointer;
    procedureinit(vm: pointer);procedurepush(p: pointer);functionpeek: pointer;
    procedurepop;functionpopv: pointer;
  PStack = ^TStack;
  procedureTStack.init(vm: pointer);begin
    SetLength(items, StackBlockSize);
    i_pos := 0;
    size := StackBlockSize;
    parent_vm := vm;
  procedureTStack.push(p: pointer);inline;
    items[i_pos] := p;
    if i_pos >= size thenbegin
       size := size + StackBlockSize;
       SetLength(items, size)
  functionTStack.peek: pointer; inline;
    Result := items[i_pos - 1];
    if size - i_pos > StackBlockSize thenbegin
       size := size - StackBlockSize;
       SetLength(items, size);
  functionTStack.popv: pointer; inline;
    Result := items[i_pos];
    if size - i_pos > StackBlockSize thenbegin
       size := size - StackBlockSize;
       SetLength(items, size);
    p: pointer;
    p := items[i_pos - 2];
    items[i_pos - 2] := items[i_pos - 1];
    items[i_pos - 1] := p;
    SetLength(items, StackBlockSize);
    size := StackBlockSize;
    i_pos := 0;

The external methods of the VM will pass a pointer to the stack so that they can take the necessary arguments from there. A pointer to the VM thread was added later so that callback calls from external methods could be implemented and, in general, to transfer more power over VM methods.

So, as with how the stack is arranged you have read. The callback stack is arranged in the same way, for simplicity and convenience of call & return operations and the garbage collector stack. The only thing - other block sizes.

Talk about trash

It is usually a lot, a lot. And you need to do something with it.

First of all, I want to talk about how garbage collectors are arranged in other languages, for example, in Lua, Ruby, Java, Perl, PHP, etc. They work on the principle of counting pointers to objects in memory.

Those. we allocated memory for something, it is logical - the pointer was immediately placed in a variable / array / somewhere else. The runtime garbage collector immediately adds this pointer to itself with a list of possible garbage objects. In the background, the garbage collector constantly monitors all variables, arrays, etc. If there does not appear to be a pointer to something from the list of possible garbage, then this is garbage and the memory from under it must be removed.

I decided to sell my bike. I am more accustomed to working with memory on the principle of Taras Bulba. I gave birth to you - I will kill you, I mean, when I call the next Free from the next class. Therefore, the garbage collector from my VM is semi-automatic. Those. It needs to be called manually and work with it accordingly. Pointers to the declared temporary objects are added to its turn (this role falls mostly on the translator and a little on the developer). To release memory from under other objects, you can use a separate opcode.

Those. at the time of the call, the garbage collector has a ready list of pointers to run through and free up memory.

So, now let's deal with the compilation into an abstract executable file.

The idea was originally that applications written in my language will be able to run without source, as is the case with many similar languages. Those. It can be used for commercial purposes.

To do this, determine the format of executable files. I got the following:

  1. Title, for example, "SVMEXE_CNS".
  2. Section containing the list of libraries from which methods will be imported.
  3. The import section of the required methods, the libraries from which the methods are imported are indicated by their number in the section above.
  4. Section constants.
  5. Code section

I don’t think it’s worthwhile to lay out the detailed steps for implementing parsers for these sections, because you can see everything in my repository yourself.

Code execution

After parsing the sections listed above and initializing the VM, we are left with one section with the code. In my VM, an unaligned bytecode is executed, i.e. instructions can be of arbitrary length.

A set of opcodes - instructions for a virtual machine with small comments I show in advance below:

  TComand = (
    {** for stack **}
    bcPH,     // [top] = [var]
    bcPK,     // [var] = [top]
    bcPP,     // pop
    bcSDP,    // stkdrop
    bcSWP,    // [top] <-> [top-1]{** jump's **}
    bcJP,     // jump [top]
    bcJZ,     // [top] == 0 ? jp [top-1]
    bcJN,     // [top] <> 0 ? jp [top-1]
    bcJC,     // jp [top] & push callback point as ip+1
    bcJR,     // jp to last callback point & rem last callback point{** for untyped's **}
    bcEQ,     // [top] == [top-1] ? [top] = 1 : [top] = 0
    bcBG,     // [top] >  [top-1] ? [top] = 1 : [top] = 0
    bcBE,     // [top] >= [top-1] ? [top] = 1 : [top] = 0
    bcNOT,    // [top] = ![top]
    bcAND,    // [top] = [top] and [top-1]
    bcOR,     // [top] = [top] or  [top-1]
    bcXOR,    // [top] = [top] xor [top-1]
    bcSHR,    // [top] = [top] shr [top-1]
    bcSHL,    // [top] = [top] shl [top-1]
    bcNEG,    // [top] = -[top]
    bcINC,    // [top]++
    bcDEC,    // [top]--
    bcADD,    // [top] = [top] + [top-1]
    bcSUB,    // [top] = [top] - [top-1]
    bcMUL,    // [top] = [top] * [top-1]
    bcDIV,    // [top] = [top] / [top-1]
    bcMOD,    // [top] = [top] % [top-1]
    bcIDIV,   // [top] = [top] \ [top-1]
    bcMV,     // [top]^ = [top-1]^
    bcMVBP,   // [top]^^ = [top-1]^
    bcGVBP,   // [top]^ = [top-1]^^
    bcMVP,    // [top]^ = [top-1]{** memory operation's **}
    bcMS,     // memory map size = [top]
    bcNW,     // [top] = @new
    bcMC,     // copy [top]
    bcMD,     // double [top]
    bcRM,     // rem @[top]
    bcNA,     // [top] = @new array[  [top]  ] of pointer
    bcTF,     // [top] = typeof( [top] )
    bcSF,     // [top] = sizeof( [top] ){** array's **}
    bcAL,     // length( [top] as array )
    bcSL,     // setlength( [top] as array, {stack} )
    bcPA,     // push ([top] as array)[top-1]
    bcSA,     // peek [top-2] -> ([top] as array)[top-1]{** memory grabber **}
    bcGPM,    // add pointer to TMem to grabber task-list
    bcGC,     // run grabber{** constant's **}
    bcPHC,    // push copy of const
    bcPHCP,   // push pointer to original const{** external call's **}
    bcPHEXMP, // push pointer to external method
    bcINV,    // call external method
    bcINVBP,  // call external method by pointer [top]{** for thread's **}
    bcPHN,    // push null
    bcCTHR,   // [top] = thread(method = [top], arg = [top+1]):id
    bcSTHR,   // suspendthread(id = [top])
    bcRTHR,   // resumethread(id = [top])
    bcTTHR,   // terminatethread(id = [top]){** for try..catch..finally block's **}
    bcTR,     // try @block_catch = [top], @block_end = [top+1]
    bcTRS,    // success exit from try/catch block
    bcTRR,    // raise exception, message = [top]{** for string's **}
    bcSTRD,     // strdel
    {** [!] directly memory operations **}
    bcALLC,  //alloc memory
    bcRALLC, //realloc memory
    bcDISP,  //dispose memory
    bcGTB,   //get byte
    bcSTB,   //set byte
    bcCBP,   //mem copy
    bcRWBP,  //read word
    bcWWBP,  //write word
    bcRIBP,  //read int
    bcWIBP,  //write int
    bcRFBP,  //read float
    bcWFBP,  //write float
    bcRSBP,  //read string
    bcWSBP,  //write string
    bcTHREXT,//stop code execution
    bcDBP    //debug method call

So, you are fluent in what operations a VM written by me can perform. Now I want to say how it all works.

VM is implemented as an object, so you can easily implement support for multithreading.

It has a pointer to an array with opcodes, IP (Instruction Pointer) is the offset of the instruction being executed and pointers to other VM structures.

Code execution is a big switch-case.

Just give a description of VM:

  TSVM = objectpublic
    ip, end_ip: TInstructionPointer;
    mainclasspath: string;
    mem: PMemory;
    stack: TStack;
    cbstack: TCallBackStack;
    bytes: PByteArr;
    grabber: TGrabber;
    consts: PConstSection;
    extern_methods: PImportSection;
    try_blocks: TTRBlocks;
    procedureRun;procedureRunThread;procedureLoadByteCodeFromFile(fn: string);procedureLoadByteCodeFromArray(b: TByteArr);end;

Little about exception handling

To do this, the VM has a stack of exception handlers and a large try / catch block in which the execution of the code is wrapped. With the stack, you can put a structure that has an entry point offset at the catch and finally / end of the exception handling block. I also provided a trs opcode that is placed before catch and flips the code to finally / end, if it succeeds, simultaneously removing a block with information about exception handlers from the top of the corresponding stack. Simply? Simply. Conveniently? Conveniently.

Talk about external methods and libraries.

I have already mentioned them before. Imports, libraries ... Without them, the language will not have the desired flexibility and functionality.

First of all, in the implementation of the VM, we will declare the type of the external method and the protocol for calling it.

  TExternalFunction = procedure(PStack: pointer);cdecl;
  PExternalFunction = ^TExternalFunction;

The parser of the import section fills, when initializing the VM, an array of pointers to external methods. Consequently, each method has a static address, which is calculated at the stage of building an application for VM and by which the required method can be called.

The call then goes on like this during the execution of the code:


Let's write a simple library for our VM

And let it implement the Sleep method for starters:

library bf;
{$mode objfpc}{$H+}uses SysUtils, svm_api in'..\svm_api.pas';
exports DSleep name'SLEEP';


At this point I will probably finish my first article from the conceived cycle.

Today I described in some detail the creation of the language runtime. I believe that this article will be very useful for people who decide to try to write their own programming language or deal with how such programming languages ​​work.

The full VM code is available in the repository, in the / runtime / svm branch.

If you liked this article, then do not be lazy to throw a plus into karma and pick it up in the top, I tried and I will try for you.

If something is not clear to you, then welcome to the comments or to the forum .

Perhaps your questions and answers to them will be interesting not only to you.

Also popular now: