Writing an x86-64 JIT compiler from scratch in stock Python

Original author: Christian Stigen Larsen
  • Transfer
In this article, I will show how to write a rudimentary, native x86-64 just-in-time compiler (JIT) in CPython using only built-in modules.

The code is intended for UNIX systems such as macOS and Linux, but it should be easy to translate to other systems, such as Windows. All code is published on github.com/cslarsen/minijit .

The goal is to generate new versions of the assembler code below in runtime and execute them.

48 b8 ed ef be ad de  movabs $0xdeadbeefed, %rax
00 00 00
48 0f af c7           imul   %rdi,%rax
c3                    retq

Basically, we will deal with the left side of the code - a byte sequence 48 b8 ed ...and so on. These 15 bytes in machine code make up the x86-64 function, which multiplies its argument by a constant 0xdeadbeefed. At the JIT stage, functions with different such constants will be created. This contrived form of specialization should demonstrate the basic mechanics of JIT compilation.

The main strategy is to use the built-in Python module to ctypesload the standard C library. From there we will get access to system functions for interaction with the virtual memory manager. Usemmapto get a block of memory aligned to the page border. Alignment is necessary for code execution. For this reason, we do not take the usual function malloc, since it can return memory that goes beyond the borders of the page.

We use the function mprotectto mark the memory block as read-only and executable. After that, it should be possible to call our newly compiled code block using ctypes.

Template part

Before doing anything, you need to load the standard C library.

import ctypes
import sys
if sys.platform.startswith("darwin"):
    libc = ctypes.cdll.LoadLibrary("libc.dylib")
    # ...
elif sys.platform.startswith("linux"):
    libc = ctypes.cdll.LoadLibrary("libc.so.6")
    # ...
    raise RuntimeError("Unsupported platform")

There are other ways to do this, for example

>>> import ctypes
>>> import ctypes.util
>>> libc = ctypes.CDLL(ctypes.util.find_library("c"))
>>> libc

To determine the page size, call sysconf(_SC_PAGESIZE). The constant _SC_PAGESIZEis 29 on macOS, but 30 on Linux. We just hardcode them in our program. You can find the page size by examining the system header files or by writing a simple C program for output. A more reliable and elegant solution is to use модуль cffictypes instead, because it can automatically parse header files. However, since we set a goal to use the standard CPython distribution, we will continue to work with ctypes.

We need some additional constants for mmapand so on. They are recorded below. You may need to look for them for other UNIX variants.

import ctypes
import sys
if sys.platform.startswith("darwin"):
    libc = ctypes.cdll.LoadLibrary("libc.dylib")
    _SC_PAGESIZE = 29
    MAP_ANONYMOUS = 0x1000
    MAP_PRIVATE = 0x0002
    PROT_EXEC = 0x04
    PROT_NONE = 0x00
    PROT_READ = 0x01
    PROT_WRITE = 0x02
    MAP_FAILED = -1 # voidptr actually
elif sys.platform.startswith("linux"):
    libc = ctypes.cdll.LoadLibrary("libc.so.6")
    _SC_PAGESIZE = 30
    MAP_ANONYMOUS = 0x20
    MAP_PRIVATE = 0x0002
    PROT_EXEC = 0x04
    PROT_NONE = 0x00
    PROT_READ = 0x01
    PROT_WRITE = 0x02
    MAP_FAILED = -1 # voidptr actually
    raise RuntimeError("Unsupported platform")

Although this is not a strict requirement, it is very useful to pass the ctypes the types of functions that we are going to use. Thus, exceptions will be raised in the case of mixing invalid types. For instance:

# Set up sysconf
sysconf = libc.sysconf
sysconf.argtypes = [ctypes.c_int]
sysconf.restype = ctypes.c_long

This tells ctypes that the function sysconftakes a four-byte integer, and produces a long integer. After that, you can find out the current page size with the following command:

pagesize = sysconf(_SC_PAGESIZE)

The machine code that we are going to generate will be interpreted as unsigned 8-bit bytes, so we will register an unsigned pointer type for these bytes:

# 8-bit unsigned pointer type
c_uint8_p = ctypes.POINTER(ctypes.c_uint8)

The rest of the types of functions that we will use are simply laid out below. For bug reports, it is good that a feature is available strerror. We use munmapto destroy a block of machine code when we are done with it. So the operating system will be able to reuse this memory.

strerror = libc.strerror
strerror.argtypes = [ctypes.c_int]
strerror.restype = ctypes.c_char_p
mmap = libc.mmap
mmap.argtypes = [ctypes.c_void_p,
                 # Below is actually off_t, which is 64-bit on macOS
mmap.restype = c_uint8_p
munmap = libc.munmap
munmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t]
munmap.restype = ctypes.c_int
mprotect = libc.mprotect
mprotect.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int]
mprotect.restype = ctypes.c_int

At this stage, it’s hard to justify using Python instead of C. In case of C, we don’t need the above template code. But Python will give much more freedom in further experiments.

Now we are ready to write a wrapper mmap.

def create_block(size):
    ptr = mmap(0, size, PROT_WRITE | PROT_READ,
            MAP_PRIVATE | MAP_ANONYMOUS, 0, 0)
    if ptr == MAP_FAILED:
        raise RuntimeError(strerror(ctypes.get_errno()))
    return ptr

This function uses mmapto allocate memory aligned to the borders of the page. We mark the memory location with the PROT flags as readable and writable, and we mark it as private and anonymous. The latter means that other processes will not be able to see this piece of memory and that it does not have file support. The mmap manual on Linux covers this topic in more detail (just be sure to open the manual specifically for your system). If the call fails mmap, we raise a Python error.

To mark memory as executable:

def make_executable(block, size):
    if mprotect(block, size, PROT_READ | PROT_EXEC) != 0:
        raise RuntimeError(strerror(ctypes.get_errno()))

With this call, mprotectwe mark a piece of memory as readable and executable. If we want, we can also make it writable, but some systems will refuse to execute code from memory open for writing. This is sometimes called a safety feature the X ^ the W .

To destroy a memory block, do the following:

def destroy_block(block, size):
    if munmap(block, size) == -1:
        raise RuntimeError(strerror(ctypes.get_errno()))

The fun part

Now we are finally ready to write insanely simple JIT code!

Recall the assembler listing from the beginning of the article: this is a small function - without a local stack frame - that multiplies the input number by a constant. In Python, we would write it like this:

def create_multiplication_function(constant):
    return lambda n: n * constant

This is a really contrived example, but meets the definition of JIT. In the end, we really create native code in runtime and execute it. It is easy to imagine more advanced examples, such as the JIT compilation of Brainfuck into x86-64 native code. Or use AVX instructions for lightning fast vectorization math operations.

The disassembly of machine code at the beginning of the article was actually done by compiling and disassembling the following C code:

uint64_t multiply(uint64_t n)
  return n*0xdeadbeefedULL;
If you want to compile it yourself, use something like
$ gcc -Os -fPIC -shared -fomit-frame-pointer \
    -march=native multiply.c -olibmultiply.so

Here I optimized by size ( -Os) to generate a minimal machine code, position-independent ( -fPIC) to prevent the use of transitions in absolute addresses, without any frame pointers ( -fomit-frame-pointer) to remove redundant stack setting code (but it may be needed for more advanced functions ) and using the native instruction set of the existing processor ( -march=native).

We could pass -Sand receive a disassembler listing, but we are interested in machine code , so instead we use a tool like objdump:

$ objdump -d libmultiply.so
0000000000000f71 <_multiply>:
 f71:   48 b8 ed ef be ad de    movabs $0xdeadbeefed,%rax
 f78:   00 00 00 
 f7b:   48 0f af c7             imul   %rdi,%rax
 f7f:   c3                      retq

If you are not very familiar with assembler, I’ll explain how this function works. First, the function movabssimply puts a direct number (immediate number) in register RAX. Immediate - this is such jargon in assembler to denote something that is specified directly in the machine code. In other words, this is a built-in argument for the statement movabs. So now the RAX register contains a constant 0xdeadbeefed.

Also - according to the AMD64 convention - the first integer argument will be in the RDI register, and the return value in RAX. So the RDI register contains a multiplier. This is the essence of the teamimulwhich multiplies RAX and RDI, putting the result in RAX. Finally, we fetch the 64-bit return address from the stack and go to it with the RETQ command. At this level, it is easy to imagine how programming in the style of transmitting continuations can be implemented .

Please note that the constant 0xdeadbeefedis specified in reverse byte order (little-endian). One must remember to do the same in the code.

Now we are ready to put everything in a Python function.

def make_multiplier(block, multiplier):
    # Encoding of: movabs , rax
    block[0] = 0x48
    block[1] = 0xb8
    # Little-endian encoding of multiplication constant
    block[2] = (multiplier & 0x00000000000000ff) >>  0
    block[3] = (multiplier & 0x000000000000ff00) >>  8
    block[4] = (multiplier & 0x0000000000ff0000) >> 16
    block[5] = (multiplier & 0x00000000ff000000) >> 24
    block[6] = (multiplier & 0x000000ff00000000) >> 32
    block[7] = (multiplier & 0x0000ff0000000000) >> 40
    block[8] = (multiplier & 0x00ff000000000000) >> 48
    block[9] = (multiplier & 0xff00000000000000) >> 56
    # Encoding of: imul rdi, rax
    block[10] = 0x48
    block[11] = 0x0f
    block[12] = 0xaf
    block[13] = 0xc7
    # Encoding of: retq
    block[14] = 0xc3
    # Return a ctypes function with the right prototype
    function = ctypes.CFUNCTYPE(ctypes.c_uint64)
    function.restype = ctypes.c_uint64
    return function

At the bottom, we return the type of the ctypes function for use in this code. This is somewhat arbitrary placement, but I thought it would be nice to place it next to the machine code.

Final part

Now we have the main parts that can be combined. First, select one page of memory:

pagesize = sysconf(_SC_PAGESIZE)
block = create_block(pagesize)

Then generate machine code. As a factor, choose the number 101.

mul101_signature = make_multiplier(block, 101)

Now mark the memory location as executable and read-only:

make_executable(block, pagesize)

We take the address of the first byte in the memory block and pass it to the called ctypes function with the correct type:

address = ctypes.cast(block, ctypes.c_void_p).value
mul101 = mul101_signature(address)

To get the memory address of a block, use ctypes to pass it to the null pointer and retrieve its value. Finally, initialize the real function from this address using the constructor mul101_signature.

Voila! Now we have a piece of native code that can be called from Python. If you are in a REPL environment, you can do this directly:

>>> print(mul101(8))

Note that this little multiplication function is slower than native Python calculations. This is mainly due to the alien ctypes library, the use of which carries a lot of overhead: every time you call a function, you need to check which dynamic types you pass to it, then unpack them and convert them, and then do the same with the return value. So it makes sense to use assembler either if you have access to some new Intel instructions, or to compile something like Brainfuck into native code.

In the end, if you want, you can let the system reuse the portion of memory in which the function resides. Keep in mind that after this, the process will probably crash if you try to access the code again. So it’s probably better to simultaneously delete all references to Python at all:

destroy_block(block, pagesize)
del block
del mul101

If you run the code in its full form from the GitHub repository , then you can specify the constant for multiplication directly on the command line:

$ python mj.py 101
Pagesize: 4096
Allocating one page of memory
JIT-compiling a native mul-function w/arg 101
Making function block executable
Testing function
OK mul(0) = 0
OK mul(1) = 101
OK mul(2) = 202
OK mul(3) = 303
OK mul(4) = 404
OK mul(5) = 505
OK mul(6) = 606
OK mul(7) = 707
OK mul(8) = 808
OK mul(9) = 909
Deallocating function

Debug JIT Code

If you want to continue learning with the help of this small program, you will soon have the idea to disassemble the generated machine code. Alternatively, you can simply use gdb or lldb, but you need to know where to start. There is such a trick: just print the hex value of the block address while waiting for the key to be pressed:

print("address: 0x%x" % address)
print("Press ENTER to continue")

Then just run the program in the debugger and during pause, disassemble the memory area. Of course, there is also the possibility of step-by-step debugging of assembler code if you want to see what happens. Example lldb session:

$ lldb python
(lldb) run mj.py 101
(lldb) c
Process 19329 resuming
address 0x1002fd000
Press ENTER to continue

At this point, press CTRL + C to return to the debugger, then disassemble the code from the memory area:

(lldb) x/3i 0x1002fd000
    0x1002fd000: 48 b8 65 00 00 00 00 00 00 00  movabsq $0x65, %rax
    0x1002fd00a: 48 0f af c7                    imulq  %rdi, %rax
    0x1002fd00e: c3                             retq

Note that 65 in hex is 101 in the decimal system, that is, the command line argument we entered above.

If you need a disassembler only in Python, I recommend the Capstone module .

What's next?

A good exercise would be JIT compiling Brainfuck programs into native code. If you want to do this right away, I opened the GitHub repository at github.com/cslarsen/brainfuck-jit . I even made a presentation by Speaker Deck . It shows JIT compilation and optimizations, but instead of using the approach in this article, GNU Lightning is used to compile native code. But it should be exceptionally easy to use examples without GNU Lightning or generate custom code. An interesting observation on the Brainfuck project: if you simply JIT all Brainfuck instructions one after the other, you won’t get much performance improvement even in native code. All performance gains occur at the stage of code optimization., where you clog integer operations with one or more x86 instructions. Another candidate for such a compilation is Forth .

Before seriously taking on the extension of this JIT compiler, take a look at the PeachPy project . This is a much more advanced project than ours, it includes a disassembler and seems to support the whole set of x86-64 instructions, up to AVX .

As mentioned above, there is a lot of overhead when using ctypes to call functions. To eliminate some of them, you can use the module cffi, but the fact remains: if you want to call very small JIT functions many times, it is usually faster to do this in pure Python.

What other notable use cases are there? I met some Python math libraries that switch to vector operations to improve performance. But I can imagine other things. For example, tools for compressing and decompressing native code, accessing virtualization primitives, and so on. I know that some BPF tools and regex modules also perform JIT compilation of requests for faster data processing.

What is interesting in this exercise is that we enter territory outside of pure assembler. For example, it comes to mind how various instructions are disassembled into the same characters. So, the RETQ instruction has a different opcode than the usual RET instruction, because it processes 64-bit values. Maybe this is not important when programming in assembler, because such details do not always matter, but it is worth remembering the difference. I saw how gcc, lldb, and objdump produced a slightly different disassembler listing for the RETQ and MOVABSQ instructions in the same code.

One more remark. I mentioned that the native Brainfuck compiler I made did not initially produce very fast code. It needs to be optimized. That is, the code does not automatically get faster from the fact that you have AVX, CUDA, or something else. The cruel truth is that the gcc compiler has a large database of optimizations that are almost impossible to reproduce manually. For more information, I would recommend Felix von Leitner's classic lecture on optimizing source code .

What about a valid compilation?

Several people wrote in the comments that they expected a more detailed description of the stage where the compilation is actually performed. This is a fair point. As I said, this really is a very limited form of compilation, where we practically do nothing with the code at runtime - just insert a constant. Maybe I will write a continuation of the article in which we consider a purely compilation stage. Keep in touch!

Also popular now: