How GIL works in Python

Original author: David Beazley
  • Transfer
Why after parallelization the execution of your program can be halved?
Why does Ctrl-C stop working after creating a stream?
I present to you the translation of David Beazley 's article “Inside the Python GIL” . It discusses some of the intricacies of threading and signal processing in Python.

Gil

Introduction


As you know, Python uses the Global Interpreter Lock (GIL), which imposes some restrictions on threads. Namely, you cannot use multiple processors at the same time. This is a hackneyed topic for Python holivars, along with tail-call optimization, lambda, whitespace, etc.

Disclaimer


I am not deeply outraged by the use of GIL in Python. But for parallel computing using multiple CPUs, I prefer messaging and interprocess communication to use threads. However, I am interested in the unexpected behavior of GIL on multi-core processors.

Performance test


Consider a trivial CPU-dependent function (i.e., a function whose execution speed depends mainly on processor performance):
def count(n):
  while n > 0:
    n -= 1

First, run it twice in turn:
count(100000000)
count(100000000)

Now run it in parallel in two threads:
t1 = Thread(target=count,args=(100000000,))
t1.start()
t2 = Thread(target=count,args=(100000000,))
t2.start()
t1.join(); t2.join()

The following results were obtained on a dual-core MacBook:
  • sequential start - 24.6 s
  • parallel start - 45.5 s (almost 2 times slower!)
  • parallel start after disconnecting one of the cores - 38.0 s

I do not like inexplicable magical phenomena. As part of a project that I launched in May, I began to understand the implementation of GIL to understand why I got such results. I went through all the steps, starting with Python scripts and ending with the source code of the pthreads library (yes, maybe I should go out more often). So, let's figure it out in order.

More about streams


Python threads are real threads (POSIX threads or Windows threads), fully controlled by the OS. Consider threading in a Python interpreter process (written in C). When creating a thread, it simply executes the run () method of the Thread object or any given function:
import time
import threading
class CountdownThread(threading.Thread):
  def __init__(self,count):
    threading.Thread.__init__(self)
    self.count = count
→ def run(self):
    while self.count > 0:
      print "Counting down", self.count
      self.count -= 1
      time.sleep(5)
    return

In fact, much more is happening. Python creates a small data structure (PyThreadState) that indicates: the current stack frame in Python code, the current recursion depth, the thread identifier, some information about exceptions. The structure takes up less than 100 bytes. Then a new thread (pthread) is launched, in which the C code calls PyEval_CallObject, which launches what is specified in Python callable.

The interpreter stores in the global variable a pointer to the current active thread. The actions taken depend entirely on this variable:
/* Python/pystate.c */
...
PyThreadState *_PyThreadState_Current = NULL;

Notorious GIL


This is the catch: at any time, only one Python thread can execute. The global interpreter lock - GIL - carefully controls the execution of threads. GIL guarantees each thread exclusive access to interpreter variables (and the corresponding calls to the C extensions work correctly).

The principle of operation is simple. Threads hold the GIL while they execute. However, they release it when blocked for I / O. Each time a thread is forced to wait, others, ready for execution, threads use their chance to start.

Gil

When working with CPU-dependent threads that never perform I / O, the interpreter periodically checks (“the periodic check”).

Gil

By default, this happens every 100 ticks, but this parameter can be changed using sys.setcheckinterval (). The check interval is a global counter that is completely independent of the flow switching order.

Gil

During a periodic check, signal handlers, if any, are started in the main thread. Then the GIL turns off and on again. At this stage, it is possible to switch several CPU-dependent threads (with a brief release of GIL, other threads have a chance to start).
/* Python/ceval.c */
...
if (--_Py_Ticker < 0) {
  ...
  _Py_Ticker = _Py_CheckInterval;
  ...
  if (things_to_do) {
    if (Py_MakePendingCalls() < 0) {
      ...
    }
  }
  if (interpreter_lock) {
    /* даем шанс другому потоку */
    ...
    PyThread_release_lock(interpreter_lock);
    /* сейчас могут запуститься другие потоки */
    PyThread_acquire_lock(interpreter_lock, 1);
    ...
}

Ticks roughly correspond to the execution of interpreter instructions. They are not time based . In fact, a long operation can block everything:
>>> nums = xrange(100000000)
>>> -1 in nums # 1 тик (6,6 с)
False
>>>

Ticks cannot be interrupted, Ctrl-C in this case will not stop the program.

Signals


Let's talk about Ctrl-C. A very common problem is that a program with multiple threads cannot be interrupted with keyboard interrupt. This is very annoying (you will have to use kill -9 in a separate window). (From a translator: I was able to kill such programs by Ctrl + F4 in the terminal window.) It is surprising why Ctrl-C does not work?

When a signal arrives, the interpreter runs a “check” after each tick until the main thread starts. Since signal handlers can only be started in the main thread, the interpreter often turns the GIL on and off until the main thread starts.

Gil

Thread scheduler


Python has no means to determine which thread should start next. No priorities, crowding out multitasking, round-robin, etc. This function is entirely the responsibility of the operating system. This is one of the reasons for the strange operation of the signals: the interpreter has no control over the start of the flows, it simply switches them as often as possible, hoping that the main stream will start.

Ctrl-C often does not work in multi-threaded programs, because the main thread is usually blocked by an uninterrupted thread-join or lock. While it is locked, it will not be able to start. As a result, it will not be able to execute a signal handler.

As an added bonus, the interpreter remains in a state where it tries to switch the stream after each tick. Not only can you not interrupt the program, it also works more slowly.

GIL implementation


GIL is not an ordinary mutex. This is either an unnamed POSIX semaphore or a conditional variable pthreads. Interpretation blocking is based on sending signals.
  • To enable GIL, check if it is free. If not, wait for the next signal.
  • To turn off the GIL, release it and send a signal.

Switching threads carries more subtleties than programmers usually think.

Gil

The delay between sending a signal and starting a stream can be quite significant, it depends on the operating system. And it takes into account the priority of execution. At the same time, tasks requiring I / O operations have a higher priority than CPU-dependent ones. If a signal is sent to a low priority thread, and the processor is busy with more important tasks, then this thread will not run for quite some time.

As a result, the signals that the GIL stream sends become too many.
Every 100 ticks, the interpreter blocks the mutex, sends a signal to a variable or semaphore to a process that is waiting for this all the time .

Let's measure the number of system calls.
For sequential execution: 736 (Unix), 117 (Mac).
For two streams: 1149 (Unix), 3.3 million (Mac).
For two threads on a dual-core system: 1149 (Unix), 9.5 million (Mac).

On a multi-core system, CPU-dependent processes switch simultaneously (on different cores), as a result of which there is a struggle for GIL: A

Gil

waiting thread can make hundreds of unsuccessful attempts to capture the GIL.
We see that there is a battle for two mutually exclusive goals. Python just wants to run no more than one thread at a time. And the operating system (“Oooh, many cores!”) Generously switches threads, trying to get the most out of all the cores.

Even one CPU-dependent thread causes problems - it increases the response time of the I / O-dependent thread.

Gil

The final example is the bizarre form of the problem of changing priorities. A CPU-dependent process (low priority) blocks the execution of I / O-dependent (high priority). This only happens on multi-core processors, because the I / O stream cannot wake up fast enough and get the GIL before the CPU dependent.

Conclusion


The implementation of GIL in Python over the past 10 years has not changed much. The corresponding code in Python 1.5.2 looks almost the same as in Python 3.0. I do not know if the behavior of the GIL has been studied well enough (especially on multi-core processors). It is better to remove the GIL altogether than to change it. It seems to me that this subject requires further study. If the GIL stays with us, you should correct its behavior.

How to get rid of this problem? I have some vague ideas, but they are all “complex”. Python needs to have its own thread manager (or at least a mechanism for interacting with the OS manager). But this requires nontrivial interaction between the interpreter, the OS scheduler, the thread library, and, worst of all, the C-extension modules.

Is it worth it? Fixing GIL behavior would make thread execution (even with GIL) more predictable and less resource intensive. It may improve performance and decrease application response time. Hopefully this avoids completely rewriting the interpreter.

Afterword from the translator


The original was designed as a presentation, so I had to slightly change the order of the narration so that the article was easier to read. I also excluded traces of the interpreter - if you're interested, look in the original.

Habralyudi, advise interesting English articles on Python which would be good to translate. I have in mind a couple of articles, but I want more options.

Also popular now: