homm July 24, 2011 at 14:26

Manual: writing an interpreter from JIT to PyPy

Transfer

All source codes and examples from this article are available here .

The first time I watched the PyPy project, it took me a while to figure out what it was. It consists of two things:

- a set of tools for writing interpreters of programming languages;
- Python implementation using this toolkit.

Most people probably think that PyPy is just the second part, but this guide is not about the Python interpreter. It is about how to write an interpreter of your language.

I took this guide in order to better understand how PyPy works and what it is. It is assumed that you know very little about PyPy, so I will start from the very beginning.

What is PyPy?

Suppose we want to write an interpreter. This includes writing a source code parser, a bytecode execution loop, and a lot of the code in the standard library.

Writing a parser and compiler is usually not at all fun, so there are tools that generate a parser and compiler for you.

And even then you have to take care of memory management in the interpreter and will have to implement data types such as integers of arbitrary dimension, hash tables, and more. This is enough for many to change their mind about implementing their own interpreter.

Wouldn't it be great if you could implement your language in a high-level language like Python, for example? You would have at your disposal all the advantages of a high-level language, such as automatic memory management and a rich set of data types. But interpreting a language in another interpreted language should be very slow, right?

As you might guess, PyPy solves this problem. PyPy is a sophisticated toolkit for analyzing and translating your interpreter code in C (or JVM or CLI). This process is called "translation." He knows how to translate not all of Python’s syntax, but rather a large part of it. All you have to do is write your interpreter in RPython , a subset of the Python language, after which you will get a very efficient interpreter.

Because writing effective interpreters should not be a problem.

Tongue

The language I chose to implement is terribly simple. The language frame consists of a tape of integers initialized to zero, and one pointer to the current cell in this tape. The language has only 8 commands:

< - move the pointer to the previous cell.

> - move the pointer to the next cell.

+ - increase by one number in the current cell.

- - decrease by one number in the current cell.

[ - if the number in the current cell is 0, then skip all instructions until the corresponding instruction].

] - go back to the corresponding instruction [.

. - output to the standard output stream one byte from the current cell.

,- read one byte from the standard input stream and put in the current cell.

Any unrecognized characters should be ignored.

Some could learn this language, this is brainfuck.

The language itself is already a bytecode, so it does not require a separate translation of the source code into bytecode. This means that the main execution loop of our interpreter will work directly with the source code. This simplifies its implementation a little.

First steps

Let's start by writing an interpreter in regular Python. Here is a sketch of the main run loop.

def mainloop(program):
    tape = Tape()
    pc = 0
    while pc < len(program):
        code = program[pc]
        if code == ">":
            tape.advance()
        elif code == "<":
            tape.devance()
        elif code == "+":
            tape.inc()
        elif code == "-":
            tape.dec()
        elif code == ".":
            sys.stdout.write(chr(tape.get()))
        elif code == ",":
            tape.set(ord(sys.stdin.read(1)))
        elif code == "[" and value() == 0:
            # Skip forward to the matching ]
        elif code == "]" and value() != 0:
            # Skip back to the matching [
        pc += 1

As you can see, the instruction counter (pc) stores a pointer to the current instruction. The first expression in the loop retrieves the instruction, then several conditional statements determine how to execute it.

The implementation of the “[” and “]” operators is omitted; they must change the instruction counter to the position of the matching bracket.

And now the implementation of the Tape class, which stores a tape of integers and a pointer to the current number.

class Tape(object):
    def __init__(self):
        self.thetape = [0]
        self.position = 0
    def get(self):
        return self.thetape[self.position]
    def set(self, val):
        self.thetape[self.position] = val
    def inc(self):
        self.thetape[self.position] += 1
    def dec(self):
        self.thetape[self.position] -= 1
    def advance(self):
        self.position += 1
        if len(self.thetape) <= self.position:
            self.thetape.append(0)
    def devance(self):
        self.position -= 1

As you can see, the tape increases if necessary. Actually, it would be worth adding error checking when the pointer becomes negative. But for now, it doesn’t matter.

If the program has a lot of comments, they will be read one byte, so it is better to parse the source code in advance. At the same time, we will make a dictionary for brackets so that you can find pair brackets in it.

def parse(program):
    parsed = []
    bracket_map = {}
    leftstack = []
    pc = 0
    for char in program:
        if char in ('[', ']', '<', '>', '+', '-', ',', '.'):
            parsed.append(char)
            if char == '[':
                leftstack.append(pc)
            elif char == ']':
                left = leftstack.pop()
                right = pc
                bracket_map[left] = right
                bracket_map[right] = left
            pc += 1
    return "".join(parsed), bracket_map

This function returns a string only from language commands and a dictionary of pair brackets.

It remains to combine this, and we get a working brainfuck interpreter.

def run(input):
    program, map = parse(input.read())
    mainloop(program, map)
if __name__ == "__main__":
    import sys
    run(open(sys.argv[1], 'r'))

The full code of the interpreter, including the implementation of square brackets, can be seen in the first example. example1.py

Now you can try running the interpreter in Python to make sure that it works.

$ python example1.py 99bottles.b

PyPy Broadcast

But our goal was not only to write a brainfuck interpreter. What needs to be done in order for PyPy to create a superfast executable file from this code?

In the PyPy sources, in the pypy / translator / goal folder there are simple examples that come in handy. To get started, take a look at targetnopstandalone.py - a simple hello world for PyPy.

The important thing is that the module contains a target function that returns an entry point. Broadcasting starts from this point.

def run(fp):
    program_contents = ""
    while True:
        read = os.read(fp, 4096)
        if len(read) == 0:
            break
        program_contents += read
    os.close(fp)
    program, bm = parse(program_contents)
    mainloop(program, bm)
def entry_point(argv):
    try:
        filename = argv[1]
    except IndexError:
        print "You must supply a filename"
        return 1
    run(os.open(filename, os.O_RDONLY, 0777))
    return 0
def target(*args):
    return entry_point, None
if __name__ == "__main__":
    entry_point(sys.argv)

The entry_point function will accept command line arguments when you run the resulting executable.

RPython

Let's talk about RPython. PyPy cannot translate regular Python code because Python is a little too dynamic. There are some limitations that apply to the standard library and Python syntax so PyPy can translate it. I will not list them all, it is better to see them in the documentation .

In the above example, several things had to be changed. Now you have to use low-level descriptors with the os.open and os.read functions instead of using file objects. The implementation of "." And "," is also slightly modified. These are all the changes necessary for PyPy to digest the code.

It wasn’t too complicated, right? I continue to use dictionaries, extensible lists, and even classes with objects. And if the file descriptors are too low for you, there are some useful abstractions in the rlib.streamio module that comes with the standard RPython library.

The full code now looks like this: example2.py

Broadcast

If you have not already done so, merge the latest version of PyPy from the repository at bitbucket.org.

$ hg clone bitbucket.org/pypy/pypy

The script to run is pypy / translator / goal / translate.py. As a parameter, it takes the module that needs to be translated.

$ python ./pypy/pypy/translator/goal/translate.py example2.py

For faster translation, you can use PyPy instead of Python.

The result of the execution will be an executable file - the brainfuck interpreter. The repository contains a fractal generator on brainfuck, which takes about 45 seconds to complete on my machine. Try it yourself.

$ ./example2-c mandel.b

And compare the speed with what the same interpreter running on Python produces.

$ python example2.py mandel.b

So, we wrote an interpreter in RPython and translated it using PyPy toolkit.

Add JIT

Translating RPython in C is cool, but one of PyPy’s main features is the ability to generate a runtime compiler (JIT). Using just a few hints about how your interpreter works, PyPy will generate a JIT compiler that will translate the interpreted brainfuck code to machine code.

For this to happen, PyPy must know where the program execution cycle begins. This allows you to track which instructions are executed on brainfuck.

We must also indicate the features of the execution cycle. Since there is no stack in our language, we only need to indicate which variables relate to the program code and which to its data. This is called the green and red variables, respectively.

Back to the second example.

Four variables are used in our main run loop: pc, program, bracket_map, and tape. Of course, pc, program and bracket_map are green variables, they determine the execution of the interpreted program. The variable tape is red; it changes when the interpreted program is executed.

Let's tell PyPy this data. Let's start by importing the JitDriver class and instantiating it.

from pypy.rlib.jit import JitDriver
jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'],
        reds=['tape'])

And add this line to the very beginning of the execution loop:

fjitdriver.jit_merge_point(pc=pc, tape=tape, program=program,
        bracket_map=bracket_map)

We also need to define JitPolicy.

def jitpolicy(driver):
    from pypy.jit.codewriter.policy import JitPolicy
    return JitPolicy()

Full text of the example: example3.py

Now we ’ll translate the code again, but with the flag --opt = jit:

$ python ./pypy/pypy/translator/goal/translate.py --opt = jit example3.py

The broadcast will go significantly longer, almost 8 minutes on my machine, and the resulting executable will be much larger. After the broadcast is over, we will start the fractal generation program again. The difference is huge - about 12 seconds against 45 in the previous version!

As you can see, the JIT compiler really used machine code instead of interpreting. The first few lines of the picture are displayed quickly enough, then the program is accelerated and the rest is displayed even faster.

A bit about tracer JIT compilers

It’s worth talking about how tracing JIT compilers work in general. Your interpretive code runs normally. When a JIT encounters a frequently executed loop in an interpreted language (brainfuck), the loop is flagged for tracing. The next time the same cycle is reached, the logging of each executed instruction is turned on.

The resulting log is sent to the optimizer, the result of which is converted to machine code. This code is used for subsequent runs of the loop.

The resulting machine code is correct only under certain conditions under which it was received. Therefore, before using it, these conditions are checked. If the test fails, instead of the machine code, the interpreter starts again.

More information can be found on Wikipedia..

Debugging and trace logs

Can we see what JIT does?

Let's add the get_printable_location function, which will be used to output debugging information.

def get_location(pc, program, bracket_map):
    return "%s_%s_%s" % (
            program[:pc], program[pc], program[pc+1:]
            )
jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'], reds=['tape'],
        get_printable_location=get_location)

This function takes green variables and returns the due date. We output brainfuck code in which the current executable statement is surrounded by underscores.

Relocate the example code to example4.py .

Now run the test program (test.b just prints the letter “A” about 15 times) with the output of the trace logs.

$ PYPYLOG = jit-log-opt: logfile ./example4-c test.b

The logfile file contains the logs of all the traces produced and allows you to look at which instructions were compiled into machine code. The file is useful in that it allows you to see unnecessary instructions or ways for optimization.

Each trace starts with a line like this:
[3c091099e7a4a7] {jit-log-opt-loop

And ends with this line:
[3c091099eae17d jit-log-opt-loop}

Immediately after the trace header is a comment with a serial number and the number of operations. In my case, the first trace looks like this.

 1: [3c167c92b9118f] {jit-log-opt-loop
 2: # Loop 0 : loop with 26 ops
 3: [p0, p1, i2, i3]
 4: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
 5: debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0)
 6: i4 = getarrayitem_gc(p1, i2, descr=)
 7: i6 = int_add(i4, 1)
 8: setarrayitem_gc(p1, i2, i6, descr=)
 9: debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0)
10: debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0)
11: i7 = getarrayitem_gc(p1, i3, descr=)
12: i9 = int_sub(i7, 1)
13: setarrayitem_gc(p1, i3, i9, descr=)
14: debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0)
15: i10 = int_is_true(i9)
16: guard_true(i10, descr=) [p0]
17: i14 = call(ConstClass(ll_dict_lookup__dicttablePtr_Signed_Signed), ConstPtr(ptr12), 90, 90, descr=)
18: guard_no_exception(, descr=) [i14, p0]
19: i16 = int_and(i14, -9223372036854775808)
20: i17 = int_is_true(i16)
21: guard_false(i17, descr=) [i14, p0]
22: i19 = call(ConstClass(ll_get_value__dicttablePtr_Signed), ConstPtr(ptr12), i14, descr=)
23: guard_no_exception(, descr=) [i19, p0]
24: i21 = int_add(i19, 1)
25: i23 = int_lt(i21, 114)
26: guard_true(i23, descr=) [i21, p0]
27: guard_value(i21, 86, descr=) [i21, p0]
28: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
29: jump(p0, p1, i2, i3, descr=)
30: [3c167c92bc6a15] jit-log-opt-loop}

I trimmed debug_merge_point lines too long a bit.

This code section takes four parameters: two pointers to objects (p0 and p1) and two numbers (i2 and i3).

The first operator ">" begins on the 4th line. It runs without instructions and looks completely optimized. This cycle always works with one part of the tape, and the pointer to the current cell remains constant.

Lines from the fifth to the eighth - the operator "+". First, an array element with index i2 is extracted from the pointer p1 (line 6), the unit is added and stored in i6 (line 7). The result is put back into the array (line 8).

Line 9 corresponds to the "<" instruction, but it also does not require operations. Apparently - i2 and i3 are two pointers to tape cells, which are calculated in advance. You can also see that p1 is a command line. It is not clear what p0 is.

Lines 10 through 13 execute the “-” operator: they extract the element of the array, subtract and put it back.

In the 14th line we come to the operator “]". Lines 15 and 16 check if i9 is true (i.e., not equal to zero). i9 is the value we just reduced by one and put in the tape. Line 16 - check. If the condition is not met, a function calledto which one parameter is passed, p0.

If the check is passed, lines 17 through 23 retrieve the address of the instruction to go to from the bracket_map dictionary. I'm not sure what exactly these lines do, but it is clear that they contain two external calls and 3 checks. This is too wasteful, given that bracket_map does not change and the result will be the same address to which you need to go. But PyPy does not know about this, but we know, therefore, we can optimize this place.

Line 24 increments the pointer obtained from bracket_map. Lines 25 and 26 verify that it did not exceed the length of the program.

In addition, line 27 carries out an additional check that the pointer is strictly equal to 86. This is necessary in order to make sure that the jump should be made at the beginning of the cycle.

At the end, the cycle closes on line 28, and on line 29 there is a jump to the beginning of the cycle with parameters p0, p1, i2, i3.

Optimization

As was noted, at each iteration of the loop, a search is performed in the dictionary to find the pair bracket. This is terribly inefficient because the goal of the transition does not change at different iterations.

We need to give one more hint to the translator to say that a dictionary query will always return the same elements for the same dictionary indexes.

To do this, we will make the dictionary call a separate function and wrap it with pypy.rlib.jit.purefunction.

@purefunction
def get_matching_bracket(bracket_map, pc):
    return bracket_map[pc]

This version can be found in example5.py .

Broadcast this example. Mandelbrot is now completed in 6 seconds instead of 12!

Let's take a look at the new trace log.

 1: [3c29fad7b792b0] {jit-log-opt-loop
 2: # Loop 0 : loop with 15 ops
 3: [p0, p1, i2, i3]
 4: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
 5: debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0)
 6: i4 = getarrayitem_gc(p1, i2, descr=)
 7: i6 = int_add(i4, 1)
 8: setarrayitem_gc(p1, i2, i6, descr=)
 9: debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0)
10: debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0)
11: i7 = getarrayitem_gc(p1, i3, descr=)
12: i9 = int_sub(i7, 1)
13: setarrayitem_gc(p1, i3, i9, descr=)
14: debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0)
15: i10 = int_is_true(i9)
16: guard_true(i10, descr=) [p0]
17: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
18: jump(p0, p1, i2, i3, descr=)
19: [3c29fad7ba32ec] jit-log-opt-loop}

Much better! Now each cycle is only one addition, subtraction, two loads from the array, two places in the array and check on exit. The code does not require any changes to the command counter.

This optimization was suggested to me by Armin Rigo in the pypy-dev mailing list. Karl Friedrich has several articles on interpreter optimization, which have also proved useful.

Conclusion

I hope this article convinced you that PyPy is not only a fast Python interpreter.

For those of you who want to learn more about the PyPy JIT compiler, I recommend reading the article Tracing the Meta-Level: PyPy's Tracing JIT Compiler .

Tags: