My first LLVM compiler

Original author: Wilfred Hughes
  • Transfer
  • Tutorial
This tutorial is about writing the simplest compiler in LLVM. No preliminary preparation is required.



The input language of our compiler will be BF. This is a classic “toy” language for compilers, and there is even a BF compiler in the LLVM examples ! In this post I will give the compiler writing process with explanations.

The first LLVM program


The LLVM framework is built around the LLVM IR programming language . First, we will write the simplest LLVM IR program as possible.

Since LLVM already uses LLVM IR, we can simply use clang to see what a simple program looks like. Take a simple C program

int main() {
    return 42;
}

Pass it through clang:

# -O3 ensures we discard any unnecessary instructions.
$ clang -S -emit-llvm -O3 forty_two.c

We get the file forty_two.ll, which looks like this:

define i32 @main() {
  ret i32 42
}

Our first LLVM IR program! We can use lli to run it:

$ lli forty_two.ll 
$ echo $?
42

Writing a skeleton


We will compile the BF commands into the LLVM IR instruction sequence. However, these instructions must be placed inside the main function so that the LLVM knows where the entry point is. We will also need to allocate and initialize the memory cells and cell index.

And again, we just write the equivalent in C and look at the LLVM IR code that Clang will generate. Here is the skeleton of the program that we will use:

declare i8* @calloc(i32, i32)
declare void @free(i8*)
define i32 @main() {
  ; Allocate 30,000 cells on the heap.
  %cells = call i8* @calloc(i32 30000, i32 1)
  ; Allocate a stack variable to track the cell index.
  %cell_index_ptr = alloca i32
  ; Initialise it to zero.
  store i32 0, i32* %cell_index_ptr
  ;;;;;;;;;;;;;;;;;;;;
  ; Our BF code will go here!
  ;;;;;;;;;;;;;;;;;;;;
  ; Free the memory for the cells.
  call void @free(i8* %cells)
  ret i32 0
}

Manually compiling>


">" Is the easiest BF command to compile. Open a text editor and write its equivalent on LLVM IR.

If your BF knowledge is a little rusty, ">" simply increments the cell index.

%cell_index = load i32* %cell_index_ptr
%new_cell_index = add i32 1, %cell_index
store i32 %new_cell_index, i32* %cell_index_ptr

To verify that the code is correct, run it. Just insert it into our skeleton , run it in lli, and see what happened. The implementation of ">" is straightforward, and we will also write a program for "<".

Manually compiling +


The BF "+" command increments the current cell. We are required to dereference the cell pointer, calculate the new value and save it. In C, it looks like this:

char *cell_ptr = cells + cell_index;
char current_value = *cell_ptr;
char new_value = current_value + 1;
*cell_ptr = new_value;

LLVM uses the getelementptr statement to calculate the pointer. The code will look like this:

%cell_index = load i32* %cell_index_ptr
%cell_ptr = getelementptr i8* %cells, i32 %cell_index
%current_value = load i8* %cell_ptr
%new_value = add i8 %current_value, 1
store i8 %new_value, i8* %cell_ptr

Again we test this by putting programs in our skeleton, and we do the same for "-".

Input Output


BF has two I / O commands: "," which reads from stdin into a cell, and "." which writes from a cell to stdout. We need to call the C functions for this: putchar and getchar .

We need to declare these functions, as we did before for malloc :

declare i32 @putchar(i32)
declare i32 @getchar()

To implement the command, we call getchar , trim the result to char, and write to the current cell.

%cell_index = load i32* %cell_index_ptr
%cell_ptr = getelementptr i8* %cells, i32 %cell_index
%input_int = call i32 @getchar()
%input_byte = trunc i32 %input_int to i8
store i8 %input_byte, i8* %cell_ptr

"." it works in the reverse order: read from the cell, make a signed extension, and call putchar .

%cell_index = load i32* %cell_index_ptr
%cell_ptr = getelementptr i8* %cells, i32 %cell_index
%current_cell = load i8* %cell_ptr
%current_cell_word = sext i8 %current_cell to i32
call i32 @putchar(i32 %current_cell_word)

Cycles


LLVM IR instructions are organized into base units; sequence of instructions without transitions inside the block. At the end of each base unit, you must either switch to another base unit or complete a return.

To compile "[x] y", we need to check the current cell, then either go to x , our loop body, or to y . At the end of x , we need to go to the beginning.

loop_header:
  %cell_index = load i32* %cell_index_ptr
  %cell_ptr = getelementptr i8* %cells, i32 %cell_index
  %cell_value = load i8* %cell_ptr
  %is_zero = icmp eq i8 %cell_value, 0
  br i1 %is_zero, label %loop_after, label %loop_body
loop_body:
  ; x
  br label %loop_header
loop_after:
  ; y

Note that there is recursion, x may also contain loops. An example program is here .

Putting it all together


We did the hardest! We are now using the LLVM API to generate instructions. Each IR instruction has a corresponding C ++ object that you can add to your base unit.

The LLVM API also has a convenient IRBuilder concept . The IRBuilder class provides methods for creating all IR instructions, making IR generation simple.

Here is the C ++ code to generate the LLVM IR for ">". The excellent LLVM tutorial includes instructions for compiling a simple C ++ program using the LLVM API.

BasicBlock *BB;
// Instantiate an IRBuilder that appends to our
// current basic block.
IRBuilder<> Builder(getGlobalContext());
Builder.SetInsertPoint(BB);
// We want to increment by 1, but since cell_index is
// 32-bit, our constant must be 32-bit too.
Value *IncrementAmount =
    ConstantInt::get(getGlobalContext(), APInt(32, 1));
// Emit the load, add and store instructions.
Value *CellIndex = Builder.CreateLoad(CellIndexPtr, "cell_index");
Value *NewCellIndex =
    Builder.CreateAdd(CellIndex, IncrementAmount, "new_cell_index");
Builder.CreateStore(NewCellIndex, CellIndexPtr);

Compiling other BF commands is a simple translation of our hand-written snippets. You can see the fully working compiler here .

Machine code generation


Our compiler generates only LLVM IR. Real compilers generate machine code. We use llc to convert to an object file, and then link to get an executable file.

$ ./compiler hello_world.bf
$ llc -filetype=obj hello_world.ll
$ gcc hello_world.o -o hello_world
$ ./hello_world
Hello World!

That's all!

Application: Optimization


You can do optimizations to get faster executables from BF programs. However, LLVM is smart enough to compile a simple, loop-free BF program into an optimal LLVM IR view!

$ cat exclamation.bf 
+++++ +++++
+++++ +++++
+++++ +++++
+++ ASCII 33 is '!'
. Write ! to stdout
$ ./compiler exclamation.bf 
$ opt -S -O3 exclamation.ll -o optimised_exclamation.ll

We get:

define i32 @main() {
entry:
  %0 = tail call i32 @putchar(i32 33)
  ret i32 0
}

Also popular now: