OdobenusRosmarus August 25, 2010 at 11:08

Compilation. 5 and 1/2: llvm as back-end

In a series of articles by tyomitch “Compilation” ( here , here , here , here , here and here ), the construction of the jsk toy language translator described in 4 parts was considered .
As a back-end for this translator, tyomitch proposed a bytecode implementation and an interpreter for this bytecode.

In my opinion, a more reasonable approach would be to use existing solutions for the backend, for example llvm, and following the principle of “Criticism without concrete sentences - criticism”, I propose an implementation option for this small jsk language with llvm.

What will this give for jsk? This compilation, that is, the result will be an executable file that does not depend on any runtime, the possibility of serious optimization, code profiling, and we automatically get back-end documentation (which will facilitate maintenance).

How will llvm work from the point of view of the front-end developer?

For llvm to work, you need to create context, use it to create a module. Each module can have several functions. Since jsk does not support functions, we will create one function - main ().
Each function can have several blocks of code.

What is a code block

Often programs create the need to transfer control forward to a label that is not yet defined. libjit allows you to create and use a label first, and define it later, Gnu llightning allows you to write control transfers to code, and later patch this code to substitute the address you want, and llvm uses a completely different approach - code blocks.

That is, the llvm API does not know such a thing as a label at all. If you need to transfer control, a link to the code block is passed to the command. For example, for a construction In the LLVM API, you need to create code blocks for the then branch, a code block for the else branch, and specify these blocks in the conditional branch command. Something like that. For the actual code generation, IRBuilder is used, which contains (roughly speaking) methods for generating all LLVM commands.

 if условие then какой то код else код endif

BasicBlock *ThenBB = BasicBlock::Create(context, "");

 BasicBlock *ElseBB = BasicBlock::Create(context, "");

 BasicBlock *AfterIfBB = BasicBlock::Create(context, "");

 Builder.CreateCondBr(CondV, ThenBB, ElseBB);


И после этого заполнить кодом ThenBB, ElseBB, завершив и тот и другой безусловным переходом на AfterIfBB.

Once you have created all the code, what's next?

This is at our discretion. We can simply print the received bytecode in readable form using the dump () method of the llvm module, execute the code (that is, use llvm as a JIT compiler), or, more correctly for our case, save the bytecode in a file. This will allow us to further optimize this byte code and convert it to an executable file using standard llvm tools.

So,

Goals are defined, tasks are set, comrades work (s)

I can’t put all the code here, if anyone wants to play with the sources, the archive is available here or here .
First of all, we need to include certain headers in the source text.

#include "llvm / DerivedTypes.h"
#include "llvm / LLVMContext.h"
#include "llvm / Module.h"
#include "llvm / Analysis / Verifier.h"
#include "llvm / Support / IRBuilder.h"
# include
#include
using namespace llvm;

That should be enough. Next, we initialize the context, the module, create the main function and IRBuilder to add code to the function

In the declaration:

static Module * TheModule;
static IRBuilder <> Builder (getGlobalContext ());

and at the beginning of the code is the following:

  LLVMContext & Context = getGlobalContext ();
  TheModule = new Module ("jsk", Context);
  const Type * voidType = Type :: getVoidTy (Context);

  func_main = cast(TheModule-> getOrInsertFunction ("main", voidType, NULL));
  func_main-> setCallingConv (CallingConv :: C);
  BasicBlock * block = BasicBlock :: Create (getGlobalContext (), "code", func_main);
  Builder.SetInsertPoint (block);

After generating the code, we generate ret void to exit our dummy function and dump the bytecode into the a.out.bc file

  Builder.CreateRetVoid ();

  std :: string ErrStr;
  raw_fd_ostream bitcode ("a.out.bc", ErrStr, 0);
  WriteBitcodeToFile (TheModule, bitcode);
  bitcode.close ();

Now about generating code for the specific operation of our jsk language

Variables

For each variable, we reserve a place on the stack using the

%varname = alloca i32

Or instruction , in terms of the API.

Builder.CreateAlloca(IntegerType::get(context,32), 0, VarName.c_str());

The problem is that we cannot allocate a place on the stack exactly where we met the variable for the first time. Because if a variable occurs in the middle of a while loop, we will spoil the entire stack with copies of the variable.

Therefore, we need to have a list of already defined variables, and when we meet a variable in the text, check, and if it occurs the first time, place the memory allocation for it at the very top of the function code. That is, take the first block from the current function, and add the alloca command there on top of this block. Fortunately, llvm allows you to write commands not only at the end, but also at the beginning of the block. How this is done, you can see in the source code - the CreateEntryBlockAlloca () function;

Assigning a variable to a value Accordingly, get the value of a variable

 Value *result = value->emit();

 Builder.CreateStore(result,varreference);


т.е.


store i32 %result, i32* %varreference



return Builder.CreateLoad(varref);

 то есть

%val = load i32* %varref

Binary and unary operations.

It's simple here

switch (op) {
      case '+': return Builder.CreateAdd (L, R);
      case '-': return Builder.CreateSub (L, R);
      case '*': return Builder.CreateMul (L, R);
      case '/': return Builder.CreateSDiv (L, R);
      case '<': tmp = Builder.CreateICmpSLT (L, R); return Builder.CreateZExt (tmp, IntegerType :: get (getGlobalContext (), 32));
      case '>': tmp = Builder.CreateICmpSGT (L, R); return Builder.CreateZExt (tmp, IntegerType :: get (getGlobalContext (), 32));
      case 'L': tmp = Builder.CreateICmpSLE (L, R); return Builder.CreateZExt (tmp, IntegerType :: get (getGlobalContext (), 32));
      case 'G': tmp = Builder.CreateICmpSGE (L, R); return Builder.CreateZExt (tmp, IntegerType ::
      case 'N': tmp = Builder.CreateICmpNE (L, R); return Builder.CreateZExt (tmp, IntegerType :: get (getGlobalContext (), 32));
      case '=': tmp = Builder.CreateICmpEQ (L, R); return Builder.CreateZExt (tmp, IntegerType :: get (getGlobalContext (), 32));

      default: return ErrorV ("invalid binary operator");
      }

IF and WHILE

As already described, we use code blocks. For example for IF:

     Value * CondV = cond-> emit ();
      // Compare the condition expression with 0
      CondV = Builder.CreateICmpNE (CondV, zero, "ifcond");
      // Code blocks for IF branches
      BasicBlock * ThenBB = BasicBlock :: Create (getGlobalContext (), "thenblock");
      BasicBlock * ElseBB = BasicBlock :: Create (getGlobalContext (), "elseblock");
      BasicBlock * MergeBB = BasicBlock :: Create (getGlobalContext (), "afterifblock");
      // Actually branching for IF
      Builder.CreateCondBr (CondV, ThenBB, ElseBB);

      // Next, add the code to the THEN
      Builder.SetInsertPoint (ThenBB) branch ;
      this-> thenops.emit ();
      // At the end of THEN switch to "after IF"

      TheFunction-> getBasicBlockList (). Push_back (ElseBB);
      // Next, add code to the century ELSE
      Builder.SetInsertPoint (ElseBB);
      this-> elseops.emit ();
      // At the end of else, go to "After IF"
      Builder.CreateBr (MergeBB);

      TheFunction-> getBasicBlockList (). Push_back (MergeBB);
      // After IF, add code only to MergeBB
      Builder.SetInsertPoint (MergeBB);
      return zero;

For WHILE:

      BasicBlock * CondBB = BasicBlock :: Create (getGlobalContext (), "whilexpr", TheFunction);
      BasicBlock * LoopBB = BasicBlock :: Create (getGlobalContext (), "loop");
      BasicBlock * AfterBB = BasicBlock :: Create (getGlobalContext (), "after");
      Builder.CreateBr (CondBB);

      // Condition block. Add the code here. we generate a condition code, Compare the result with 0
      // And if the condition is not met, go to AfterBB
      Builder.SetInsertPoint (CondBB);
      Value * CondV = cond-> emit ();
      if (CondV == 0) return 0;
      // Convert condition to a bool by comparing equal to 0.
      CondV = Builder.CreateICmpNE (CondV, zero, "whilecond");
      Builder.CreateCondBr (CondV,

      // Body block while. We write the body code here, then go to the WHILE condition
      TheFunction-> getBasicBlockList (). Push_back (LoopBB);
      Builder.SetInsertPoint (LoopBB);
      Value * Ops = this-> ops.emit ();
      Builder.CreateBr (CondBB);

      // Block 'After WHILE', At the moment, we just say
      // that all the following code should be written to this block
      TheFunction-> getBasicBlockList (). Push_back (AfterBB);
      Builder.SetInsertPoint (AfterBB);

Finally, we can try our compiler

#make

bison -d jsk.y

flex jsk.lex

g++ -O2 jsk.tab.c lex.yy.c `llvm-config --cxxflags --libs` -lrt -ldl -o jskc 

jsk.tab.c: In function ‘int yyparse()’:

jsk.tab.c:2026: warning: deprecated conversion from string constant to ‘char*’

jsk.tab.c:2141: warning: deprecated conversion from string constant to ‘char*’

jsk.lex: In function ‘int yylex()’:

jsk.lex:34: warning: deprecated conversion from string constant to ‘char*’

jsk.lex:35: warning: deprecated conversion from string constant to ‘char*’

jsk.lex:39: warning: deprecated conversion from string constant to ‘char*’

So. The compiler is ready. Let's try it on some test task. For example on Weigh: ./jskc

a=b=88;

b=b+1;

echo("test4=",a," ",b,"\n");

And we get a.out.bc in the current directory. We can disassemble it:

llvm-dis

; ModuleID = ''


@.format1 = internal constant [3 x i8] c"%d\00" ; <[3 x i8]*> [#uses=1]

@.format2 = internal constant [3 x i8] c"%s\00" ; <[3 x i8]*> [#uses=1]

@0 = internal constant [7 x i8] c"test4=\00" ; <[7 x i8]*> [#uses=1]

@1 = internal constant [2 x i8] c" \00" ; <[2 x i8]*> [#uses=1]

@2 = internal constant [2 x i8] c"\0A\00" ; <[2 x i8]*> [#uses=1]


define void @main() {

code:

 %a = alloca i32 ;  [#uses=2]

 %b = alloca i32 ;  [#uses=4]

 %int_for_scanf___ = alloca i32 ;  [#uses=0]

 store i32 88, i32* %b

 store i32 88, i32* %a

 %0 = load i32* %b ;  [#uses=1]

 %1 = add i32 %0, 1 ;  [#uses=1]

 store i32 %1, i32* %b

 call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format2, i32 0, i32 0), i8* getelementptr inbounds ([7 x i8]* @0, i32 0, i32 0))

 %2 = load i32* %a ;  [#uses=1]

 call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format1, i32 0, i32 0), i32 %2)

 call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format2, i32 0, i32 0), i8* getelementptr inbounds ([2 x i8]* @1, i32 0, i32 0))

 %3 = load i32* %b ;  [#uses=1]

 call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format1, i32 0, i32 0), i32 %3)

 call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format2, i32 0, i32 0), i8* getelementptr inbounds ([2 x i8]* @2, i32 0, i32 0))

 ret void

}


declare void @printf(i8*, ...)


declare void @scanf(i8*, ...)


declare void @exit(i32)



Мы можем интерпретировать при помощи lli, или (что нас интересует) - странслировать его в выполняемый файл:


$ llvmc a.out.bc 


Получаем (УРА!) a.out, первый в мире бинарник, скомпилированный из программы на языке JSK:


$ ls -al ./a.out

-rwxr-xr-x 1 walrus walrus 4639 2010-08-25 16:49 ./a.out


$ file a.out

a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped


$ ldd a.out

 linux-gate.so.1 => (0x00762000)

 libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x00456000)

 /lib/ld-linux.so.2 (0x00ee4000)



Запускаем, 
$ ./a.out

test4=88 89



Что нам и требовалось.

Tags: