Is Python GIL Really Dead?

Original author: Anthony Shaw
  • Transfer
Hello! Next Monday, classes will begin in the new group of the Python Developer course , which means that we have time to publish another interesting material, which we will do now. Enjoy reading.



Back in 2003, Intel released the new Pentium 4 “HT” processor. This processor overclocked to 3GHz and supported hyper-threading technology.



In the following years, Intel and AMD struggled to achieve the best desktop performance by increasing bus speed, L2 cache size and reducing matrix size to minimize latency. In 2004, the HT model with a frequency of 3 GHz was replaced by the 580 “Prescott” model with overclocking to 4 GHz.



It seemed that to go forward it was just necessary to increase the clock frequency, however, the new processors suffered from high power consumption and heat dissipation.

Does your desktop processor deliver 4 GHz today? It is unlikely, since the path to improving performance ultimately lay through an increase in bus speed and an increase in the number of cores. In 2006, Intel Core 2 replaced the Pentium 4 and had a much lower clock speed.

In addition to the release of multi-core processors for a wide user audience, in 2006 something else happened. Python 2.5 finally saw the light! It already came with a beta version of the with keyword, which you all know and love.

Python 2.5 had one major limitation when it came to using Intel Core 2 or AMD Athlon X2.
It was a GIL.

What is a GIL?


GIL (Global Interpreter Lock) is a boolean value in the Python interpreter protected by a mutex. The lock is used in the main CPython bytecode calculation loop to determine which thread is currently executing instructions.

CPython supports the use of multiple threads in a single interpreter, but threads must request access to the GIL in order to perform low-level operations. In turn, this means that Python developers can use asynchronous code, multithreading, and no longer have to worry about blocking any variables or crashes at the processor level during deadlocks.

GIL simplifies multi-threaded Python programming.



GIL also tells us that while CPython can be multi-threaded, only one thread at a time can be executed. This means that your quad-core processor does something like this (except for the blue screen, hopefully).

The current version of GIL was written in 2009 to support asynchronous functions and remained untouched even after many attempts to remove it in principle or change the requirements for it.

Any suggestion to remove the GIL was justified by the fact that the global locking of the interpreter should not degrade the performance of single-threaded code. Anyone who tried to enable hyperthreading in 2003 will understand what I'm talking about .

Gil abandonment in CPython


If you want to really parallelize the code in CPython, you will have to use several processes.

In CPython 2.6, the multiprocessing module was added to the standard library. Multiprocessing masked the generation of processes in CPython (each process with its own GIL).

from multiprocessing import Process
def f(name):
    print 'hello', name
if __name__ == '__main__':
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()


Processes are created, commands are sent to them using compiled modules and Python functions, and then they are rejoined to the main process.

Multiprocessing also supports the use of variables through a queue or channel. She has a lock object, which is used to lock objects in the main process and write from other processes.

Multiprocessing has one major drawback. It carries a significant computational load, which affects both processing time and memory usage. CPython startup time even without no-site is 100-200 ms ( check out https://hackernoon.com/which-is-the-fastest-version-of-python-2ae7c61a6b2b to learn more).

As a result, you may have parallel code in CPython, but you still need to carefully plan the work of long-running processes that share several objects.

Another alternative may be to use a third-party package such as Twisted.

PEP554 and the death of GIL?


So, let me remind you that multithreading in CPython is simple, but in reality it is not parallelization, but multiprocessing is parallel, but entails significant overhead.

What if there is a better way?
The key to bypassing the GIL lies in the name, the global locking of the interpreter is part of the global state of the interpreter. CPython processes can have several interpreters and, therefore, several locks, however this function is rarely used, since access to it is only through the C-API.

One of the features of CPython 3.8 is PEP554, an implementation of sub-interpreters and APIs with a new module interpretersin the standard library.

This allows you to create multiple interpreters from Python in a single process. Another Python 3.8 innovation is that all interpreters will have their own GIL.



Since the state of the interpreter contains a region allocated in memory, a collection of all pointers to Python objects (local and global), subinterpreters in PEP554 cannot access global variables of other interpreters.

Like multiprocessing, interpreters sharing objects consists in serializing them and using the IPC form (network, disk, or shared memory). There are many ways to serialize objects in Python, for example, a module marshal, module, pickleor more standardized methods, such as jsonorsimplexml. Each of them has its pros and cons, and all of them give a computing load.

It would be best to have a common memory space that can be changed and controlled by a specific process. Thus, objects can be sent by the main interpreter and received by another interpreter. This will be the managed memory space for searching for PyObject pointers, which every interpreter can access, while the main process will manage the locks.



An API for this is still being developed, but it will probably look something like this:

import _xxsubinterpreters as interpreters
import threading
import textwrap as tw
import marshal
# Create a sub-interpreter
interpid = interpreters.create()
# If you had a function that generated some data
arry = list(range(0,100))
# Create a channel
channel_id = interpreters.channel_create()
# Pre-populate the interpreter with a module
interpreters.run_string(interpid, "import marshal; import _xxsubinterpreters as interpreters")
# Define a
def run(interpid, channel_id):
    interpreters.run_string(interpid,
                            tw.dedent("""
        arry_raw = interpreters.channel_recv(channel_id)
        arry = marshal.loads(arry_raw)
        result = [1,2,3,4,5] # where you would do some calculating
        result_raw = marshal.dumps(result)
        interpreters.channel_send(channel_id, result_raw)
        """),
               shared=dict(
                   channel_id=channel_id
               ),
               )
inp = marshal.dumps(arry)
interpreters.channel_send(channel_id, inp)
# Run inside a thread
t = threading.Thread(target=run, args=(interpid, channel_id))
t.start()
# Sub interpreter will process. Feel free to do anything else now.
output = interpreters.channel_recv(channel_id)
interpreters.channel_release(channel_id)
output_arry = marshal.loads(output)
print(output_arry)


This example uses NumPy. The numpy array is sent over the channel, it is serialized using the module marshal, then the subinterpreter processes the data (on a separate GIL), so there may be a parallelization problem associated with the CPU, which is ideal for subinterpreters.

It looks inefficient


The module marshalworks really fast, but not as fast as sharing objects directly from memory.

PEP574 introduces a new pickle protocol (v5) that supports the ability to process memory buffers separately from the rest of the pickle stream. For large data objects, serializing them all in one go and deserializing from a subinterpreter will add a lot of overhead.

The new API can be implemented (purely hypothetically) as follows -

import _xxsubinterpreters as interpreters
import threading
import textwrap as tw
import pickle
# Create a sub-interpreter
interpid = interpreters.create()
# If you had a function that generated a numpy array
arry = [5,4,3,2,1]
# Create a channel
channel_id = interpreters.channel_create()
# Pre-populate the interpreter with a module
interpreters.run_string(interpid, "import pickle; import _xxsubinterpreters as interpreters")
buffers=[]
# Define a
def run(interpid, channel_id):
    interpreters.run_string(interpid,
                            tw.dedent("""
        arry_raw = interpreters.channel_recv(channel_id)
        arry = pickle.loads(arry_raw)
        print(f"Got: {arry}")
        result = arry[::-1]
        result_raw = pickle.dumps(result, protocol=5)
        interpreters.channel_send(channel_id, result_raw)
        """),
                            shared=dict(
                                channel_id=channel_id,
                            ),
                            )
input = pickle.dumps(arry, protocol=5, buffer_callback=buffers.append)
interpreters.channel_send(channel_id, input)
# Run inside a thread
t = threading.Thread(target=run, args=(interpid, channel_id))
t.start()
# Sub interpreter will process. Feel free to do anything else now.
output = interpreters.channel_recv(channel_id)
interpreters.channel_release(channel_id)
output_arry = pickle.loads(output)
print(f"Got back: {output_arry}")

It looks patterned


In essence, this example is built on the use of the API of low-level subinterpreters. If you have not used the library multiprocessing, some problems will seem familiar to you. It is not as simple as stream processing, you cannot just, say, run this function with such a list of input data in separate interpreters (for now).

As soon as this PEP merges with others, I think we will see some new APIs in PyPi.

How much overhead does the subinterpreter have?


Short answer: More than a stream, less than a process.
Long answer: The interpreter has its own state, so it will need to clone and initialize the following, despite the fact that PEP554 simplifies the creation of subinterpreters:

  • Modules in the namespace __main__and importlib;
  • The contents of the dictionary sys;
  • Built-in functions ( print(), assertetc.);
  • Streams;
  • Kernel configuration


The kernel configuration can be easily cloned from memory, but importing modules is not so simple. Importing modules into Python is slow, so if creating a subinterpreter means importing modules into a different namespace each time, the benefits are reduced.

What about asyncio?


The existing implementation of the event loop asyncioin the standard library creates stack frames for evaluation, and also shares state in the main interpreter (and, therefore, shares the GIL).

After combining PEP554, probably already in Python 3.9, an alternative implementation of the event loop (although no one has done it yet) can be used, which runs asynchronous methods in subinterpreters in parallel.

Sounds cool, wrap me up too!


Well, not quite.
Since CPython has been running on the same interpreter for so long, many parts of the code base use “Runtime State” instead of “Interpreter State”, so if PEP554 were introduced now, there would still be a lot of problems.

For example, the state of the garbage collector (in versions 3.7 <) belongs to the runtime.

In changes during PyCon sprints, the state of the garbage collector began to move to the interpreter, so that each subinterpreter would have its own garbage collector (as it should be).

Another problem is that there are some “global” variables that have lingered in the CPython code base along with many extensions in C. Therefore, when people suddenly started to parallelize their code correctly, we saw some problems.

Another problem is that the file descriptors belong to the process, so if you have a file open for writing in one interpreter, the subinterpreter will not be able to access this file (without further changes to CPython).

In short, there are still many problems that need to be addressed.

Conclusion: Is GIL true anymore?


GIL will continue to be used for single-threaded applications. Therefore, even when you follow PEP554, your single-threaded code will suddenly not become parallel.
If you want to write parallel code in Python 3.8, you will have parallelization problems associated with the processor, but this is also a ticket to the future!

When?


Pickle v5 and memory sharing for multiprocessing will most likely be in Python 3.8 (October 2019), and sub-interpreters will appear between versions 3.8 and 3.9.
If you have a desire to play around with the presented examples, then I created a separate branch with all the necessary code: https://github.com/tonybaloney/cpython/tree/subinterpreters.

What do you think about this? Write your comments and see you on the course.

Also popular now: