Do-it-yourself bytecode interpreters

Virtual machine programming languages in recent decades have become very widespread. A lot of time has passed since the presentation of the Java Virtual Machine in the second half of the 90s, and it can be said with confidence that bytecode interpreters are not the future, but the present.

But this technique, in my opinion, is almost universal, and an understanding of the basic principles of interpreter development will be useful not only to the creator of the next contender for the title "Language of the Year" according to TIOBE , but to any programmer in general.

In a word, if you are interested in knowing how numbers add up to our favorite programming languages, what the developers of virtual machines are still arguing about and how painlessly to match strings and regular expressions, please under the cat.

Part One, introductory (current)
Part Two, optimizing
Part Three, applied

Prehistory

One of the self-written systems of the Business Intelligence department of our company has an interface in the form of a simple query language. In the first version of the system, this language was interpreted on the fly, without compilation, directly from the input string with the query. The second version of the parser will work with intermediate byte-code, which will allow to separate the query language from their execution and greatly simplify the code.

In the process of working on the second version of the system, I had a vacation, during which for an hour or two I was distracted every day from family matters to study materials on the architecture and performance of bytecode interpreters. I decided to share the resulting notes and examples of interpreters with Habr's readers in the form of a series of articles.

The first one presents five small (up to a hundred lines of simple C code) virtual machines (ok), each of which reveals a certain aspect of the development of such interpreters.

Where there are bytecodes in programming languages

Virtual machines, a wide variety of virtual instruction sets over the past several decades, a great many have been invented. Wikipedia claims that the first programming languages were compiled into various simplified intermediate representations back in the 60s of the last century. Some of these first bytecodes were converted to machine codes and executed by real processors, others were interpreted by virtual processors on the fly.

The popularity of virtual instruction sets as an intermediate code representation is explained by three reasons:

Programs in the form of bytecodes are easily transferred to new platforms.
Bytecode interpreters work faster than syntax tree interpreters.
You can develop a simple virtual machine in just a couple of hours.

Let's make some simplest C virtual machines and use these examples to highlight the main technical aspects of the implementation of virtual machines.

Full sample codes are posted on GitHub . Examples can be collected with any relatively fresh GCC:

gcc interpreter-basic-switch.c -o interpreter
./interpreter

All examples have the same structure: first comes the code of the virtual machine itself, after - the main function with assertions that check the operation of the code. I tried to clearly comment on opcodes and key interpreter locations. I hope the article will be clear even to people who do not write daily in C.

The world's easiest bytecode interpreter.

As I said before, the simplest interpreter is very easy to make. Comments - right after the listing, but let's start directly with the code:

struct {uint8_t *ip;
    uint64_t accumulator;
} vm;
typedefenum {
    /* increment the register */
    OP_INC,
    /* decrement the register */
    OP_DEC,
    /* stop execution */
    OP_DONE
} opcode;
typedefenum interpret_result {
    SUCCESS,
    ERROR_UNKNOWN_OPCODE,
} interpret_result;
voidvm_reset(void){
    puts("Reset vm state");
    vm = (typeof(vm)) { NULL };
}
interpret_result vm_interpret(uint8_t *bytecode){
    vm_reset();
    puts("Start interpreting");
    vm.ip = bytecode;
    for (;;) {
        uint8_t instruction = *vm.ip++;
        switch (instruction) {
        case OP_INC: {
            vm.accumulator++;
            break;
        }
        case OP_DEC: {
            vm.accumulator--;
            break;
        }
        case OP_DONE: {
            return SUCCESS;
        }
        default:
            return ERROR_UNKNOWN_OPCODE;
        }
    }
    return SUCCESS;
}

There are less than one hundred lines, but all the characteristic attributes of a virtual machine are represented. The car has a single register ( vm.accumulator), three operations (register increment, register decrement and completion of program execution) and a pointer to the current instruction ( vm.ip).

Each operation (eng. Operation code , or opcode ) is encoded in one byte, and dispatching is performed using the usual switchin the function vm_interpret. Branches in switchcontain the logic of operations, that is, change the state of the register or complete the execution of the program.

The operations are passed to the function vm_interpretas an array of bytes - bytecode ( bytecode ) - and are executed successively until the virtual machine shutdown operation ( OP_DONE) is encountered in the bytecode .

A key aspect of a virtual machine is semantics, that is, a set of operations that are possible on it. In this case, there are only two operations, and they change the value of a single register.

Some researchers ( Virtual-machine Abstraction and Optimization Techniques , 2009) propose to divide virtual machines into high-level and low-level ones according to the proximity of the semantics of the virtual machine to the semantics of the physical machine on which the byte code will be executed.

In the extreme case, the bytecode of low-level virtual machines can completely repeat the machine code of a physical machine with simulated RAM, a full set of registers, instructions for working with the stack, and so on. The Bochs virtual machine , for example, repeats the x86 architecture instruction set.

And vice versa: the operations of high-level virtual machines closely reflect the semantics of the specialized programming language compiled into byte-code. This is how SQLite, Gawk, and numerous versions of Prolog work, for example.

Intermediate position is occupied by interpreters of general-purpose programming languages, having elements of both high and low levels. The most popular Java Virtual Machine has both low-level instructions for working with a stack, and built-in support for object-oriented programming with automatic memory allocation.

The code above is more likely to be the most primitive of low-level virtual machines: each virtual instruction is a wrapper over one or two physical instructions, and the virtual register fully corresponds to one register of the “iron” processor.

Arguments bytecode instructions

We can say that the only case in our virtual machine example is both the argument and the return value of all the instructions being executed. However, we may need the ability to pass arguments to instructions. One way is to directly put them into bytecode.

We extend the example by inserting instructions (OP_ADDI, OP_SUBI), which take an argument in the form of a byte, immediately following the opcode:

struct {uint8_t *ip;
    uint64_t accumulator;
} vm;
typedefenum {
    /* increment the register */
    OP_INC,
    /* decrement the register */
    OP_DEC,
    /* add the immediate argument to the register */
    OP_ADDI,
    /* subtract the immediate argument from the register */
    OP_SUBI,
    /* stop execution */
    OP_DONE
} opcode;
typedefenum interpret_result {
    SUCCESS,
    ERROR_UNKNOWN_OPCODE,
} interpret_result;
voidvm_reset(void){
    puts("Reset vm state");
    vm = (typeof(vm)) { NULL };
}
interpret_result vm_interpret(uint8_t *bytecode){
    vm_reset();
    puts("Start interpreting");
    vm.ip = bytecode;
    for (;;) {
        uint8_t instruction = *vm.ip++;
        switch (instruction) {
        case OP_INC: {
            vm.accumulator++;
            break;
        }
        case OP_DEC: {
            vm.accumulator--;
            break;
        }
        case OP_ADDI: {
            /* get the argument */uint8_t arg = *vm.ip++;
            vm.accumulator += arg;
            break;
        }
        case OP_SUBI: {
            /* get the argument */uint8_t arg = *vm.ip++;
            vm.accumulator -= arg;
            break;
        }
        case OP_DONE: {
            return SUCCESS;
        }
        default:
            return ERROR_UNKNOWN_OPCODE;
        }
    }
    return SUCCESS;
}

New instructions (see function vm_interpret) read their argument from bytecode and add it to the register / subtract it from the register.

Such an argument is called an immediate argument (the immediate argument ), since it is located directly in the array of opcodes. The main limitation in our implementation is that the argument is a single byte and can only take 256 values.

In our virtual machine, the range of possible values of the arguments of instructions does not play a big role. But if the virtual machine will be used as an interpreter of a real language, it makes sense to complicate the byte-code by adding a table of constants separate from the opcode array and instructions with a direct argument corresponding to the address of the present argument in the table of constants.

Stack machine

Instructions in our simple virtual machine always work with one register and cannot transfer data to each other in any way. In addition, the argument of the instruction can only be immediate, but, say, the operation of addition or multiplication takes two arguments.

Simply put, we have no way to evaluate complex expressions. To solve this problem, you need a stack machine, that is, a virtual machine with a built-in stack:

#define STACK_MAX 256struct {uint8_t *ip;
    /* Fixed-size stack */uint64_tstack[STACK_MAX];
    uint64_t *stack_top;
    /* A single register containing the result */uint64_t result;
} vm;
typedefenum {
    /* push the immediate argument onto the stack */
    OP_PUSHI,
    /* pop 2 values from the stack, add and push the result onto the stack */
    OP_ADD,
    /* pop 2 values from the stack, subtract and push the result onto the stack */
    OP_SUB,
    /* pop 2 values from the stack, divide and push the result onto the stack */
    OP_DIV,
    /* pop 2 values from the stack, multiply and push the result onto the stack */
    OP_MUL,
    /* pop the top of the stack and set it as execution result */
    OP_POP_RES,
    /* stop execution */
    OP_DONE,
} opcode;
typedefenum interpret_result {
    SUCCESS,
    ERROR_DIVISION_BY_ZERO,
    ERROR_UNKNOWN_OPCODE,
} interpret_result;
voidvm_reset(void){
    puts("Reset vm state");
    vm = (typeof(vm)) { NULL };
    vm.stack_top = vm.stack;
}
voidvm_stack_push(uint64_t value){
    *vm.stack_top = value;
    vm.stack_top++;
}
uint64_t vm_stack_pop(void)
{
    vm.stack_top--;
    return *vm.stack_top;
}
interpret_result vm_interpret(uint8_t *bytecode){
    vm_reset();
    puts("Start interpreting");
    vm.ip = bytecode;
    for (;;) {
        uint8_t instruction = *vm.ip++;
        switch (instruction) {
        case OP_PUSHI: {
            /* get the argument, push it onto stack */uint8_t arg = *vm.ip++;
            vm_stack_push(arg);
            break;
        }
        case OP_ADD: {
            /* Pop 2 values, add 'em, push the result back to the stack */uint64_t arg_right = vm_stack_pop();
            uint64_t arg_left = vm_stack_pop();
            uint64_t res = arg_left + arg_right;
            vm_stack_push(res);
            break;
        }
        case OP_SUB: {
            /* Pop 2 values, subtract 'em, push the result back to the stack */uint64_t arg_right = vm_stack_pop();
            uint64_t arg_left = vm_stack_pop();
            uint64_t res = arg_left - arg_right;
            vm_stack_push(res);
            break;
        }
        case OP_DIV: {
            /* Pop 2 values, divide 'em, push the result back to the stack */uint64_t arg_right = vm_stack_pop();
            /* Don't forget to handle the div by zero error */if (arg_right == 0)
                return ERROR_DIVISION_BY_ZERO;
            uint64_t arg_left = vm_stack_pop();
            uint64_t res = arg_left / arg_right;
            vm_stack_push(res);
            break;
        }
        case OP_MUL: {
            /* Pop 2 values, multiply 'em, push the result back to the stack */uint64_t arg_right = vm_stack_pop();
            uint64_t arg_left = vm_stack_pop();
            uint64_t res = arg_left * arg_right;
            vm_stack_push(res);
            break;
        }
        case OP_POP_RES: {
            /* Pop the top of the stack, set it as a result value */uint64_t res = vm_stack_pop();
            vm.result = res;
            break;
        }
        case OP_DONE: {
            return SUCCESS;
        }
        default:
            return ERROR_UNKNOWN_OPCODE;
        }
    }
    return SUCCESS;
}

In this example, there are already more operations, and almost all of them work only with the stack. OP_PUSHI pushes its immediate argument onto the stack. The instructions OP_ADD, OP_SUB, OP_DIV, OP_MUL extract a pair of values from the stack, calculate the result and push it back onto the stack. OP_POP_RES removes the value from the stack and places it in the result register, intended for the results of the virtual machine.

For the operation of division (OP_DIV), the error of division by zero is trapped, which stops the operation of the virtual machine.

The capabilities of such a machine are much wider than the previous one with a single register and allow, for example, to calculate complex arithmetic expressions. Another (and important!) Advantage is the simplicity of compiling programming languages into a byte-code stack machine.

Register machine

Due to its simplicity, stack virtual machines are the most widely used among developers of programming languages; the same JVM and Python VMs use them.

However, such machines have drawbacks: they have to add special instructions for working with the stack, when calculating expressions all arguments repeatedly pass through a single data structure, a lot of unnecessary instructions appear in the stacking code.

Meanwhile, the implementation of each extra instruction entails the cost of scheduling, that is, decoding the opcode and the transition to the body of instructions.

An alternative to stack machines is register-based virtual machines. They have a more complicated byte-code: each instruction clearly encoded the number of registers-arguments and the number of the register-result. Accordingly, instead of a stack, an extended set of registers is used as a repository of intermediate values.

#define REGISTER_NUM 16struct {uint16_t *ip;
    /* Register array */uint64_t reg[REGISTER_NUM];
    /* A single register containing the result */uint64_t result;
} vm;
typedefenum {
    /* Load an immediate value into r0  */
    OP_LOADI,
    /* Add values in r0,r1 registers and put them into r2 */
    OP_ADD,
    /* Subtract values in r0,r1 registers and put them into r2 */
    OP_SUB,
    /* Divide values in r0,r1 registers and put them into r2 */
    OP_DIV,
    /* Multiply values in r0,r1 registers and put them into r2 */
    OP_MUL,
    /* Move a value from r0 register into the result register */
    OP_MOV_RES,
    /* stop execution */
    OP_DONE,
} opcode;
typedefenum interpret_result {
    SUCCESS,
    ERROR_DIVISION_BY_ZERO,
    ERROR_UNKNOWN_OPCODE,
} interpret_result;
voidvm_reset(void){
    puts("Reset vm state");
    vm = (typeof(vm)) { NULL };
}
voiddecode(uint16_t instruction,
            uint8_t *op,
            uint8_t *reg0, uint8_t *reg1, uint8_t *reg2,
            uint8_t *imm){
    *op = (instruction & 0xF000) >> 12;
    *reg0 = (instruction & 0x0F00) >> 8;
    *reg1 = (instruction & 0x00F0) >> 4;
    *reg2 = (instruction & 0x000F);
    *imm = (instruction & 0x00FF);
}
interpret_result vm_interpret(uint16_t *bytecode){
    vm_reset();
    puts("Start interpreting");
    vm.ip = bytecode;
    uint8_t op, r0, r1, r2, immediate;
    for (;;) {
        /* fetch the instruction */uint16_t instruction = *vm.ip++;
        /* decode it */
        decode(instruction, &op, &r0, &r1, &r2, &immediate);
        /* dispatch */switch (op) {
        case OP_LOADI: {
            vm.reg[r0] = immediate;
            break;
        }
        case OP_ADD: {
            vm.reg[r2] = vm.reg[r0] + vm.reg[r1];
            break;
        }
        case OP_SUB: {
            vm.reg[r2] = vm.reg[r0] - vm.reg[r1];
            break;
        }
        case OP_DIV: {
            /* Don't forget to handle the div by zero error */if (vm.reg[r1] == 0)
                return ERROR_DIVISION_BY_ZERO;
            vm.reg[r2] = vm.reg[r0] / vm.reg[r1];
            break;
        }
        case OP_MUL: {
            vm.reg[r2] = vm.reg[r0] * vm.reg[r1];
            break;
        }
        case OP_MOV_RES: {
            vm.result = vm.reg[r0];
            break;
        }
        case OP_DONE: {
            return SUCCESS;
        }
        default:
            return ERROR_UNKNOWN_OPCODE;
        }
    }
    return SUCCESS;
}

The example shows a register machine with 16 registers. The instructions take 16 bits each and are encoded in three ways:

4 bits per opcode + 4 bits per register name + 8 bits per argument.
4 bits for the operation code + three times 4 bits for the names of the registers.
4 bits per opcode + 4 bits for a single register name + 8 unused bits.

Our small virtual machine has very few operations, so four bits (or 16 possible operations) per opcode are enough. The operation determines exactly what the remaining bits of the instruction represent.

The first type of coding (4 + 4 + 8) is needed to load data into the registers with the OP_LOADI operation. The second type (4 + 4 + 4 + 4) is used for arithmetic operations that need to know where to take a couple of arguments and where to add the result of the calculation. And finally, the last view (4 + 4 + 8 unnecessary bits) is used for instructions with a single register as an argument, in our case it is OP_MOV_RES.

For encoding and decoding instructions, special logic (function decode) is now needed . On the other hand, the logic of instructions due to the explicit indication of the location of the arguments becomes easier - operations with the stack disappear.

The key features are: there are fewer instructions in the byte code of the register machines, separate instructions are wider, compiling into such byte code is more complicated - the compiler has to decide how to use the available registers.

It should be noted that in practice, register virtual machines usually have a stack where, for example, function arguments are placed; registers are used to evaluate individual expressions. Even if there is no explicit stack, then an array is used to build the stack, which plays the same role as the RAM in physical machines.

Stack and register machines, comparison

There is an interesting study ( Virtual machine showdown: Stack versus registers , 2008), which had a great influence on all subsequent developments in the field of virtual machines for programming languages. Its authors proposed a method for live translation from the stack code of the standard JVM to the register code and compared the performance.

The method is non-trivial: the code is first translated, and then it is optimized in a rather complicated way. But the subsequent comparison of the performance of the same program showed that the additional processor cycles spent on decoding instructions are fully compensated by a decrease in the total number of instructions. In general, in short, the register machine turned out to be more efficient than the stack one.

As mentioned above, this efficiency has a very tangible price: the compiler must allocate the registers itself and additionally it is desirable to have an advanced optimizer.

The debate about which architecture is better is still not over. If we talk about Java compilers, the Dalvik VM bytecode, which until recently worked in every Android device, was registered; but the titular JVM retained the stack instruction set. The Lua virtual machine uses the register machine, but the Python VM is still a stack. And so on.

Byte code in regular expression interpreters

Finally, to distract from low-level virtual machines, let's look at a specialized interpreter that checks strings for regular expression matching:

typedefenum {
    /* match a single char to an immediate argument from the string and advance ip and cp, or
     * abort*/
    OP_CHAR,
    /* jump to and match either left expression or the right one, abort if nothing matches*/
    OP_OR,
    /* do an absolute jump to an offset in the immediate argument */
    OP_JUMP,
    /* stop execution and report a successful match */
    OP_MATCH,
} opcode;
typedefenum match_result {
    MATCH_OK,
    MATCH_FAIL,
    MATCH_ERROR,
} match_result;
match_result vm_match_recur(uint8_t *bytecode, uint8_t *ip, char *sp){
    for (;;) {
        uint8_t instruction = *ip++;
        switch (instruction) {
        case OP_CHAR:{
            char cur_c = *sp;
            char arg_c = (char)*ip ;
            /* no match? FAILed to match */if (arg_c != cur_c)
                return MATCH_FAIL;
            /* advance both current instruction and character pointers */
            ip++;
            sp++;
            continue;
        }
        case OP_JUMP:{
            /* read the offset and jump to the instruction */uint8_t offset = *ip;
            ip = bytecode + offset;
            continue;
        }
        case OP_OR:{
            /* get both branch offsets */uint8_t left_offset = *ip++;
            uint8_t right_offset = *ip;
            /* check if following the first offset get a match */uint8_t *left_ip = bytecode + left_offset;
            if (vm_match_recur(bytecode, left_ip, sp) == MATCH_OK)
                return MATCH_OK;
            /* no match? Check the second branch */
            ip = bytecode + right_offset;
            continue;
        }
        case OP_MATCH:{
            /* success */return MATCH_OK;
        }
        }
        return MATCH_ERROR;
    }
}
match_result vm_match(uint8_t *bytecode, char *str){
    printf("Start matching a string: %s\n", str);
    return vm_match_recur(bytecode, bytecode, str);
}

The main instruction is OP_CHAR. It takes its immediate argument and compares it with the current character in the string ( char *sp). In case of coincidence of the expected and current characters in the line, the transition to the next instruction and the next character takes place.

The machine also understands the transition operation (OP_JUMP), which takes a single immediate argument. Argument means absolute offset in bytecode, from where the calculation should be continued.

The last important operation is OP_OR. It takes two offsets, trying to apply the code first on the first one, then, in case of an error, the second one. It does this with the help of a recursive call, that is, the instruction makes a crawl into the depth of the tree of all possible variants of a regular expression.

Surprisingly, four opcodes and seventy lines of code are enough to express regular expressions of the form "abc", "a? Bc", "(ab | bc) d", "a * bc". In this virtual machine, there is not even an explicit state, since everything that is needed - pointers to the beginning of the instruction stream, the current instruction and the current symbol - is passed in by the arguments of the recursive function.

If you are interested in the details of the work of regular expression engines, for a start you can familiarize yourself with the series of articles by Russ Cox ( Google Co2 , author Russ Cox ), the author of the regular expression engine from Google RE2 .

Results

Let's summarize.

For general purpose programming languages, as a rule, two architectures are used: stack and register.

In a stack model, the main data structure and the way to transfer arguments between instructions is the stack. In the register model, a set of registers is used to evaluate expressions, but an explicit or implicit stack is still used to store function arguments.

The presence of an explicit stack and a set of registers brings such machines closer to low-level and even physical ones. The abundance of low-level instructions in such a bytecode means that a substantial amount of physical processor resources is spent on decoding and dispatching virtual instructions.

On the other hand, high-level instructions play an important role in popular virtual machines. In Java, for example, these are instructions for polymorphic function calls, allocation of objects, and garbage collection.

Purely high-level virtual machines — for example, byte-code interpreters of languages with advanced semantics far from iron — spend most of their time not in the controller or decoder, but in the instructions bodies and, accordingly, are relatively effective.

Practical recommendations:

If you need to execute any byte-code and do it in a reasonable time, then try to operate with instructions that are closest to your task; the higher the semantic level, the better. This will reduce the cost of scheduling and simplify code generation.
If greater flexibility and heterogeneous semantics were required, then you should at least try to isolate the common denominator in the byte-code so that the resulting instructions are conditionally average.
If in the future it may be necessary to calculate any expressions, make a stack machine, this will reduce the headache when compiling bytecode.
If no expressions are foreseen, then make a trivial register machine, which will avoid the cost of the stack and simplify the instructions themselves.

In the following articles, I will analyze the practical implementations of virtual machines in popular programming languages and will explain why the Business Intelligence Badoo department needed a bytecode.

Tags: