Flying Pigs, or Optimizing Bytecode Interpreters

"You can however, make a faster pig" (a comment in the Emaks source code)

Everyone knows the fact that pigs do not fly. No less popular is the opinion that bytecode interpreters as a technique for executing high-level languages cannot be accelerated without the use of time-consuming dynamic compilation.

In the second part of a series of articles on bytecode interpreters, I will try to show by the example of a small PVM virtual machine (“Pig Virtual Machine”) that not everything is lost for hard-working piglets with ambitions and that it is possible to accelerate within the (mostly) standard C the work of such interpreters at least one and a half times.

Part One, introductory
Part Two, optimizing (current)
Part Three, applied

Pig VM

Let's get acquainted.

“Piglet VM ” is an ordinary stack machine based on an example from the first part of a series of articles. Our piggy knows only one type of data - a 64-bit machine word, and all (integer) calculations are performed on a stack with a maximum depth of 256 machine words. In addition to the stack, this pig has a working memory of 65,536 machine words. The result of the program - one machine word - can either be placed in the register-result, or simply output to standard output (stdout).

All state in the car "Pig VM" is stored in a single structure:

staticstruct {/* Current instruction pointer */uint8_t *ip;
    /* Fixed-size stack */uint64_tstack[STACK_MAX];
    uint64_t *stack_top;
    /* Operational memory */uint64_t memory[MEMORY_SIZE];
    /* A single register containing the result */uint64_t result;
} vm;

The above allows you to refer this machine to low-level virtual machines, almost all of the overhead in which falls on the maintenance of the main program cycle:

interpret_result vm_interpret(uint8_t *bytecode){
    vm_reset(bytecode);
    for (;;) {
        uint8_t instruction = NEXT_OP();
        switch (instruction) {
        case OP_PUSHI: {
            /* get the argument, push it onto stack */uint16_t arg = NEXT_ARG();
            PUSH(arg);
            break;
        }
        case OP_ADD: {
            /* Pop 2 values, add 'em, push the result back to the stack */uint64_t arg_right = POP();
            *TOS_PTR() += arg_right;
            break;
        }
        /*
        * ...
        * Lots of other instruction handlers here
        * ...
        */case OP_DONE: {
            return SUCCESS;
        }
        default:
            return ERROR_UNKNOWN_OPCODE;
        }
    }
    return ERROR_END_OF_STREAM;
 }

From the code it is clear that for each opcode the piglet should:

Extract opcode from instruction stream.
Make sure that the opcode is in the allowable range of opcode values (the C compiler adds this logic when generating the switch code).
Go to the body instructions.
Extract instruction arguments from the stack or decode an instruction argument placed directly in bytecode.
Perform an operation.
If there is a result of the calculation, put it on the stack.
Move the pointer from the current instruction to the next.

The payload here is only in the fifth paragraph, all the rest is overhead: decoding or retrieving instruction arguments from the stack (item 4), checking the opcode value (item 2), repeatedly returning to the beginning of the main loop and subsequent hardly predictable conditional jump (item 3).

In short, the pig has clearly exceeded the recommended body mass index, and if we want to bring it into shape, we will have to deal with all these excesses.

Swine Assembly and Sieve of Eratosthenes

To begin, we define the rules of the game.

Writing programs for a virtual machine right in C - moveton, but creating a programming language is a long time, so the pig and I decided to limit ourselves to the pig language of assembler.

The program, which calculates the sum of numbers from 1 to 65 536, on this assembler looks like this:

# sum numbers from 1 to 65535
# init the current sum and the index
PUSHI 1
PUSHI 1
# stack s=1, i=1
STOREI 0
# stack: s=1
# routine: increment the counter, add it to the current sum
incrementandadd:
# check if index is too big
LOADI 0
# stack: s, i
ADDI 1
# stack: s, i+1
DUP
# stack: s, i+1, i+1
GREATER_OR_EQUALI 65535
# stack: s, i+1, 1 or 0
JUMP_IF_TRUE done
# stack: s, i+1
DUP
# stack: s, i+1, i+1
STOREI 0
# stack: s, i+1
ADD
# stack: s+i+1
JUMP incrementandadd
done:
DISCARD
PRINT
DONE

Not Python, of course, but everything you need for pig happiness is here: comments, labels, conditional and unconditional jumps on them, mnemonics for instructions and the ability to specify the immediate arguments of instructions.

Included with the machine "Piglet VM" are assembler and disassembler, which courageous in spirit and having a lot of free time, readers can independently try out in battle.

The numbers add up very quickly, so to test the performance, I wrote another program - a naive implementation of the sieve of Eratosthenes .

In fact, the piglet and so runs pretty quickly (his instructions are close to the machine), so to get clear results, I’ll make every measurement for a hundred program launches.

The first version of our non-optimized pig runs like this:

> ./pigletvm runtimes test/sieve-unoptimized.bin 100 > /dev/null
PROFILE: switch code finished took 545ms

Half a second! The comparison is definitely unfair, but the same Python algorithm makes a hundred runs a little slower:

> python test/sieve.py > /dev/null
4.66692185402

4.5 seconds, or nine times slower. We must pay tribute to the pig - he has the ability! Well, now let's see if our pig can pump a press.

Exercise One: Static Superinstructions

The first rule of fast code is not to do any extra work. The second rule of fast code is to never do any extra work. So what extra work does the "Pig VM"?

Observation one: profiling our program shows that there are sequences of instructions that occur more often than others. We will not torment our pig much and confine ourselves to pairs of instructions:

LOADI 0, ADD - put a number from the memory at address 0 and add it to the number at the top of the stack.
PUSHI 65536, GREATER_OR_EQUAL — put a number on the stack and compare it with the number that was previously at the top of the stack, putting the result of the comparison (0 or 1) back onto the stack.
PUSHI 1, ADD - put a number on the stack, add it to the number that was previously at the top of the stack, and put the result of the addition back to the stack.

In the PorosenVM machine there are a little more than 20 instructions, and the whole byte is used for coding - 256 values. Making new instructions is not a problem. What we will do:

for (;;) {
    uint8_t instruction = NEXT_OP();
    switch (instruction) {
    /*
     * Other instructions here
     * */case OP_LOADADDI: {
        /* get immediate argument as an memory address , add it to value from the address to the top
         * of the stack */uint16_t addr = NEXT_ARG();
        uint64_t val = vm.memory[addr];
        *TOS_PTR() += val;
        break;
    }
    case OP_GREATER_OR_EQUALI:{
        /* get the immediate argument, compare it with the value from the address to the top of the stack */uint64_t arg_right = NEXT_ARG();
        *TOS_PTR() = PEEK() >= arg_right;
        break;
    }
    case OP_ADDI: {
        /* Add immediate value to the top of the stack */uint16_t arg_right = NEXT_ARG();
        *TOS_PTR() += arg_right;
        break;
    }
    /*
     * Other instructions here
     * */
}

Nothing complicated. Let's see what came of it:

> ./pigletvm runtimes test/sieve.bin 100 > /dev/null
PROFILE: switch code finished took 410ms

Wow! Code for just three new instructions, and we won a hundred and fifty milliseconds!

The win here is achieved due to the fact that our piglet doesn’t make unnecessary movements when executing such instructions: the execution stream does not fall out into the main loop, nothing is further decoded, and the arguments of the instructions don’t go through the stack once again.

This is called static super instructions, since the additional instructions are determined statically, that is, by the programmer of the virtual machine at the design stage. This is a simple and effective technique, which in one form or another is used by all virtual machines of programming languages.

The main problem of static superinstructions is that without a specific program it is impossible to determine which particular instructions should be combined. Different programs use different sequences of instructions, and these sequences can only be learned at the stage of launching a specific code.

The next step could be a dynamic compilation of superinstructions in the context of a specific program, that is, dynamic superindications (in the 90s and in the early 2000s, this technique played the role of a primitive JIT compilation).

It is impossible to create instructions on the fly in the framework of the usual C, and our little pig quite rightly does not consider this an honest competition. Fortunately, I have a couple of better exercises for him.

Exercise two: checking the range of values opkodov

Following our rules of quick code, let us once again ask ourselves the eternal question: what can we not do?

When we got acquainted with the device of the Pigment VM machine, I listed all the actions that the virtual machine performs for each opcode. And point 2 (checking the value of the opcode for entering a valid interval of switch values) causes the most suspicion.

Let's look at how GCC compiles the switch construct:

A transition table is built, that is, a table that displays the value of the opcode to the address of the code executing the instruction body.
A code is inserted that checks whether the received opcode is in the interval of all possible switch values, and sends to the default label if there is no handler for the opcode.
Inserts the code that goes to the handler.

But why make a check of the interval of values for each instruction? We believe that the opcode is either correct - the final execution of the instruction OP_DONE, or incorrect - went beyond the limits of the byte-code. The tail of the opcode stream is marked by zero, and zero is the opcode of the OP_ABORT instruction, which completes the execution of the bytecode with an error.

It turns out that this check is not needed at all! And the pig should be able to convey this idea to the compiler. Let's try to fix the main switch a bit:

uint8_t instruction = NEXT_OP();
/* Let the compiler know that opcodes are always between 0 and 31 */switch (instruction & 0x1f) {
   /* All the instructions here */case26 ... 0x1f: {
       /*Handle the remaining 5 non-existing opcodes*/return ERROR_UNKNOWN_OPCODE;
   }
}

Knowing that we have only 26 instructions, we impose a bit mask (the octal value 0x1f is a binary 0b11111 covering the interval from 0 to 31) on the opcode and add handlers to unused values in the interval from 26 to 31.

Bit instructions are one of the cheapest in the x86 architecture, and they are certainly cheaper than problematic conditional transitions like the one that uses the interval check. Theoretically, we should win several cycles on each executable instruction, if the compiler takes our hint.

By the way, the way of specifying the interval of values in a case is not a standard C, but an extension of GCC. But for our purposes, this code is suitable, especially since it is easy to remake it into several handlers for each of the unnecessary values.

We try:

> ./pigletvm runtimes test/sieve.bin  100 > /dev/null
PROFILE: switch code finished took 437ms
PROFILE: switch code (no range check) finished took 383ms

Another 50 milliseconds! Piglet, you seemed to be heard on the shoulders! ..

Exercise number three: trails

What other exercises can help our pig? We received the biggest time savings due to super-instructions. And they reduce the number of outputs in the main cycle and allow you to get rid of the corresponding overhead costs.

The central switch is the main problem place for any processor with an extraordinary execution of instructions. Modern branch predictors have learned to predict well even such complex indirect transitions, but smearing branch points along a code can help the processor to quickly move from instruction to instruction.

Another problem is the byte read of the opcodes of instructions and immediate arguments from bytecode. Physical machines operate with a 64-bit machine word and do not really like it when the code operates with smaller values.

Compilers often operate with base blocks , that is, sequences of instructions without branches and labels inside. The base unit starts either from the beginning of the program, or from the label, and ends with the end of the program, conditional branching, or a direct transition to the label that starts the next base unit.

There are many advantages to working with base units, but our pig is interested in its key feature: the instructions within the base unit are executed sequentially. It would be nice to somehow allocate these basic blocks and execute instructions in them without losing time to exit to the main loop.

In our case, you can even expand the definition of the base unit to the track. The track in terms of the machine "Pig VM" will include all sequentially connected (that is, using unconditional jumps) basic blocks.

In addition to the sequential execution of instructions, it would be nice to decode the immediate arguments of the instructions in advance.

All this sounds pretty scary and resembles a dynamic compilation that we decided not to use. Piglet even a little doubted his strength, but in practice everything was not so bad.

Let's first think about how to present the instruction entering the track:

structscode {uint64_t arg;
    trace_op_handler *handler;
};

Here arg is a pre-decoded instruction argument, and handler is a pointer to a function that executes the instruction logic.

Now the view of each track looks like this:

typedef scode trace[MAX_TRACE_LEN];

That is, the track is a sequence of s-codes of limited length. The trail cache itself inside the virtual machine looks like this:

trace trace_cache[MAX_CODE_LEN];

This is just an array of traces of length not exceeding the possible length of the bytecode. The solution is lazy, practically to save memory, it makes sense to use a hash table.

At the beginning of the interpreter, the first handler of each of the traces will compile itself:

for (size_t trace_i = 0; trace_i < MAX_CODE_LEN; trace_i++ )
    vm_trace.trace_cache[trace_i][0].handler = trace_compile_handler;

The main interpreter loop now looks like this:

while(vm_trace.is_running) {
   scode *code = &vm_trace.trace_cache[vm_trace.pc][0];
   code->handler(code);
}

The handler compiling the trace is a bit more complicated, and, in addition to building the trace starting from the current instruction, it does the following:

staticvoidtrace_compile_handler(scode *trace_head){
    scode *trace_tail = trace_head;
   /*
     * Trace building here
     *//* now, run the chain that has a trace_compile_handler replaced with proper instruction handler
     * function pointer */
    trace_head->handler(trace_head);
}

Normal instruction handler:

staticvoidop_add_handler(scode *code){
    uint64_t arg_right = POP();
    *TOS_PTR() += arg_right;
    /*
    * Call the next trace handler
    * *//* scodes are located in an array so we can use pointer arithmetic to get the next handler */
    code++;
    code->handler(code);
}

Each handler terminates each trace, making no calls at the end of the function:

staticvoidop_done_handler(scode *code){
    (void) code;
    vm_trace.is_running = false;
    vm_trace.error = SUCCESS;
}

All this, of course, is more complicated than adding superinstructions, but let's see if it gave us anything:

> ./pigletvm runtimes test/sieve.bin  100 > /dev/null
PROFILE: switch code finished took 427ms
PROFILE: switch code (no range check) finished took 395ms
PROFILE: trace code finished took 367ms

Hooray, another 30 milliseconds!

How so? Instead of simple tag transitions, we make chains of callbacks to instruction handlers, spend time on calls and passing arguments, but our little pig still runs along the tracks faster than a simple switch with its labels.

Such a performance gain in runs is achieved due to three factors:

Predicting branches scattered in different places in the code is easy.
The arguments of handlers are always pre-decoded into a full machine word, and this is done only once - during the compilation of the trace.
The compiler itself turns the chains of functions into a single call to the first handler function, which is possible due to optimization of the tail call .

Before summing up our training sessions, the pig and I decided to try another ancient program interpretation technique - embroidered code.

Exercise Four: "sewn" code

Anyone interested in the history of interpreters, the pig heard about the sewn code (eng. Threaded code). The options for this technique are many, but they all boil down to going through an array instead of an array of opcodes, for example, function pointers or labels, passing on them directly, without an intermediate opcode.

Challenges to functions are expensive and have no special meaning these days; Most of the other versions of the sewn code are not feasible in the framework of the standard C. Even the technique described below uses the widespread, but non-standard extension C - pointers to tags.

In the version of the sewed code (eng. Token threaded code), which I chose to achieve our swine goals, we save the byte code, but before interpreting we create a table that displays opcodes of instructions to the addresses of instructions handler instructions:

constvoid *labels[] = {
    [OP_PUSHI] = &&op_pushi,
    [OP_LOADI] = &&op_loadi,
    [OP_LOADADDI] = &&op_loadaddi,
    [OP_STORE] = &&op_store,
    [OP_STOREI] = &&op_storei,
    [OP_LOAD] = &&op_load,
    [OP_DUP] = &&op_dup,
    [OP_DISCARD] = &&op_discard,
    [OP_ADD] = &&op_add,
    [OP_ADDI] = &&op_addi,
    [OP_SUB] = &&op_sub,
    [OP_DIV] = &&op_div,
    [OP_MUL] = &&op_mul,
    [OP_JUMP] = &&op_jump,
    [OP_JUMP_IF_TRUE] = &&op_jump_if_true,
    [OP_JUMP_IF_FALSE] = &&op_jump_if_false,
    [OP_EQUAL] = &&op_equal,
    [OP_LESS] = &&op_less,
    [OP_LESS_OR_EQUAL] = &&op_less_or_equal,
    [OP_GREATER] = &&op_greater,
    [OP_GREATER_OR_EQUAL] = &&op_greater_or_equal,
    [OP_GREATER_OR_EQUALI] = &&op_greater_or_equali,
    [OP_POP_RES] = &&op_pop_res,
    [OP_DONE] = &&op_done,
    [OP_PRINT] = &&op_print,
    [OP_ABORT] = &&op_abort,
};

Pay attention to the characters && - these are pointers to labels with instruction bodies, this is a non-standard extension of GCC.

To start the execution of the code, it is enough to jump on the pointer to the label corresponding to the first opcode of the program:

goto *labels[NEXT_OP()];

There is no loop here and there will not be; each of the instructions itself makes a jump to the next handler:

op_pushi: {
        /* get the argument, push it onto stack */uint16_t arg = NEXT_ARG();
        PUSH(arg);
        /* jump to the next instruction*/goto *labels[NEXT_OP()];
    }

The absence of a switch “smears” the branch points along the instruction bodies, which in theory should help the branch predictor in the case of an extraordinary execution of instructions. We as if built in switch directly into instructions and manually formed the transition table.

That's the whole technique. Piglet liked it for its simplicity. Let's see what happens in practice:

> ./pigletvm runtimes test/sieve.bin  100 > /dev/null
PROFILE: switch code finished took 443ms
PROFILE: switch code (no range check) finished took 389ms
PROFILE: threaded code finished took 477ms
PROFILE: trace code finished took 364ms

Oops! This is the slowest of all our techniques! What happened? Perform the same tests by turning off all GCC optimizations:

> ./pigletvm runtimes test/sieve.bin  100 > /dev/null
PROFILE: switch code finished took 969ms
PROFILE: switch code (no range check) finished took 940ms
PROFILE: threaded code finished took 824ms
PROFILE: trace code finished took 1169ms

Here the stitched code shows itself better.

Three factors play a role here:

The optimizing compiler itself will build the transition table as well as our manual label table.
Modern compilers remarkably get rid of unnecessary function calls.
Approximately since the Intel generation of Haswell processors, branch predictors have learned to accurately predict transitions through a single branch point.

According to the old memory, this technique is still used in the code, for example, the Python VM interpreter, but, to be honest, these days it’s already archaism.

Let's finally summarize and evaluate the success that our pig has achieved.

Parsing pig flights

I'm not sure that this can be called a flight, but let's admit, our little pig has come a long way from 550 milliseconds for a hundred runs on the “sieve” to the final 370 milliseconds. We used different techniques: superinstructions, getting rid of checking intervals of values, complicated mechanics of tracks, and finally, even embroidered code. At the same time, we, in general, acted within the framework of things implemented in all popular C compilers. Acceleration one and a half times, it seems to me, this is a good result, and the pig deserved an extra portion of bran in the trough.

One of the implicit conditions that we set ourselves with the pig, is the preservation of the stack architecture of the PoroVM machine. The transition to the register architecture, as a rule, reduces the number of instructions required for program logic and, accordingly, can help get rid of unnecessary exits to the instructions manager. I think another 10-20% of the time on this could be cut.

Our main condition - the lack of dynamic compilation - is also not a law of nature. To pump up a pig with steroids in the form of a JIT compilation these days is very easy: in libraries like GNU Lightning or LibJIT, all the dirty work has already been done. But the time to develop and the total amount of code even with the use of libraries is greatly increased.

There are, of course, other techniques to which our little pigs do not have hoofs. But there is no limit to perfection, and our swine journey - the second part of a series of articles about interpreter of byte-codes - must end somewhere. If readers come up with interesting ways to disperse a pig, we with the pig will be happy to try them.

PS Special thanks to my sister, Renate Casanova, for the sketches of illustrations, and our illustrator, Vladimir Shopotov ( https://www.instagram.com/vovazomb/ ), for the final drawings.

PPS The original piggy is not very talkative as a whole, and only the primitive assembler understands. But true-grue in some magical way in a few hours made for him a small tongue - PigletC . Now you can grunt without restrictions!

PPPS reader iliazeus offered another optimization: caching stack in a separate variable. Surprisingly, this change makes the sewn code the fastest option of all; for him, the sieve of Eratosthenes works twice as fast as the original naive version of the interpreter. All curious please look into the code .

Tags: