How GIL works in Ruby. Part 3. Does GIL make your code thread safe?

Transfer

Translations of the previous two parts:
First part
Second part

This is an article by Jesse Storimer . He speaks at a Unix fu workshop , an online classroom for Ruby developers who want to learn amazing Ruby hacks and improve their server stack development skills. The number of participants is limited, so hurry up while there are empty seats. He is also the author of the books “Working with Unix Processes” , “Working with TCP Sockets” and “Working with Threads in Ruby” .

There are some misconceptions regarding the GIL in the MRI implementation of the interpreter in the Ruby community. If you want to know the answer to the main question of this article without reading it, then here it is: GIL does not make your Ruby code thread-safe.

But you should not take my words for granted.

This series of articles began with an attempt to understand what GIL is at a technical level. The first part explains where the conditions for the race condition appear in the C code used in the MRI implementation. However, it seems that the GIL avoided this, at least for the method Array#<<we saw it.

The second part confirms that the GIL, in fact, makes the atomic implementation of inline methods in MRI. In other words, this eliminates the occurrence of a race condition. However, this applies only to the built-in functions of the MRI itself, and not to your Ruby code. Thus, we still remained with the question: "Does the GIL provide any guarantees that your Ruby code will be thread-safe?"

I have already answered this question above. Now I want the misconceptions about this to stop.

Once again about the state of the race

A race condition can occur if some data is shared by several streams, and they try to work with this data at the same time. When this happens without any synchronization, for example without blocking, your program may start to do unexpected things, and data may be lost.

Let's take a step back and remember how a race condition can arise. We will use the following Ruby code example for this part of the article:

classSheepdefinitialize
    @shorn = falseenddefshorn?
    @shorn
  enddefshear!
    puts "shearing..."
    @shorn = trueendend

There is nothing new in this class. A sheep is not trimmed at birth. The shear! Method performs a haircut and marks the sheep as already trimmed.

sheep = Sheep.new
5.times.map do
  Thread.new dounless sheep.shorn?
      sheep.shear!
    endendend.each(&:join)

This code creates a new sheep object and spawns 5 threads. Each of them checks whether the sheep is cut, and if not, then calls the shear! Method.

Here is the result I get by running this code several times on MRI 2.0:

$ ruby check_then_set.rb
shearing...
$ ruby check_then_set.rb
shearing...
shearing...
$ ruby check_then_set.rb
shearing...
shearing...

Sometimes one sheep is sheared twice!

If you were sure that the GIL would allow your code to “just work” in multiple threads, then now this should pass. GIL does not give you any guarantees. Please note that the first time you run the script, you get the expected result, but the next time - the result was not expected. If you continue to run this example, you will see a few more options.

These unexpected results are the result of a race condition in your Ruby code. In fact, this is a fairly common design error pattern, which even has its own name: “check-then-set race condition”. In this case, two or more threads check a certain value, and then set other values based on the first. Having nothing to ensure atomicity, it is entirely possible that two streams go through the “value verification” phase, and then both complete the “set new value” phase.

Race Status Recognition

Before we look at how to fix this, I want you to understand how to recognize this. I owe @brixen for explaining the terminology of interleaving in the context of concurrency. This is really helpful.

Remember that context switching can occur on any line of your code. When switching from one thread to another, imagine that your program is divided into a set of separate blocks. This sequential set of blocks is a set for interleaving.

On the one hand, it is possible that context switching occurs after each line of code! Such a set of alternating blocks will contain one line of code in each. On the other hand, it is entirely possible that there will be no context switching at all in the body of the stream. In this case, in each alternating block there will be a full stream code. Between these extremes, there are many options for how your program can be sliced into alternating blocks.

Some of these alternations are fine. Not every line of code leads to a race condition. But presenting your programs as a set of possible alternating blocks can help you understand when conflict situations occur. I will use a series of graphical schemes to show how this code can be executed by two threads.

Just to make the diagrams simpler, I replaced the “shear!” Method call with its code.

Consider this scheme. Alternating blocks of stream A are highlighted in red, and block B in blue.

Now let's see how this code can be alternated by simulating context switching. In the simplest case, if no thread is interrupted during execution, then this will not lead to a race condition, and we will get the expected result. It might look like this:

Now I organized the circuit so that you see a sequential order of events. Remember that GIL stops everything around executable code, so that two threads cannot truly work in parallel. Events in this case go sequentially, from top to bottom.

In this rotation, thread A has done all its work, and then the scheduler switches the context to thread B. Since thread A has already successfully cut the sheep and updated the state variable, thread B does nothing with it.

But not always so simple. Remember that the scheduler can switch context at any time. This time we were lucky.

Let's look at a more vile example that will produce an unexpected result for us.

In this case, context switching occurs at the point that causes the problem. Stream A checks the condition and starts cutting. Then, the scheduler switches the context, and thread B starts to execute. Despite the fact that thread A has already cut the sheep, he has not yet managed to update the status flag, so thread B does not know anything about it.

Stream B checks the condition, believes that the sheep is not trimmed and shears it again. After that, the context switches to thread A, which completes its execution. Despite the fact that thread B set the status flag, thread A does this again, since it only remembers its state at the time of interruption.

The fact that the sheep was cut twice may not seem like a big problem to take care of this, but it is enough to replace it with an account, and take a fee for each haircut to get dissatisfied customers!

I will share another example to show the non-deterministic nature of these things.

We just added more context switches, so that each thread runs a little bit several times. You just need to understand that context switching is possible on any line of the program. These switches can occur at different times each time the code is executed, so that you can get the desired result in one iteration, and unexpected in the next.

It is really good to think about the state of the race. When you write multi-threaded code, you should think about the fact that the program can be chopped into blocks, and take into account the influence of their various alternations. If it seems to you that some of them may lead to incorrect results, you should rethink your approach, or enter synchronization through the mutex.

It's horrible!

Now it seems appropriate to tell you that you can make this code thread-safe by simply adding a mutex. Yes, you can really do it , but I specifically prepared the following example to prove my point that this is a terrible approach. You should not write such code for multithreaded execution.

Each time you have several threads that have a link to an object and make its change, and also you do not have a lock in the right place to prevent the consequences of switching the context in the middle of the changes, you run into a problem.

However, you can avoid the race state without blocking the code. Here is one solution using a queue:

require'thread'classSheep# ...end
sheep = Sheep.new
sheep_queue = Queue.new
sheep_queue << sheep
5.times.map do
  Thread.new dobegin
      sheep = sheep_queue.pop(true)
      sheep.shear!
    rescue ThreadError
      # raised by Queue#pop in the threads# that don't pop the sheependendend.each(&:join)

I removed the implementation of the sheep class, since it is exactly the same as before. Now, instead of working together different streams on one sheep and the race for her shearing, a queue has appeared that provides synchronization.

If you run this code on MRI, or on any other really parallel Ruby implementation, it will produce the expected result each time. We eliminated the race condition in this code. Even taking into account the fact that all threads will call Queue#popat more or less the same time, this code uses an internal mutex so that only one thread at a time can receive a sheep.

As soon as this one stream gets the sheep, the race condition disappears. Just one thread, no more to compete with him!

The reason I propose using the queue instead of blocking is because it is more difficult to use it incorrectly. In locks, as you know, it is easy to make a mistake. If they are not used correctly, they bring new problems, such as deadlock and performance degradation. Using a data structure is like using an abstraction. Make tricky things more limited, but get a simpler API.

Delayed Initialization

I just quickly point out that lazy initialization is another form of “check-then-set race condition”. The operator ||=expands to:

@logger ||= Logger.new
# Разворачивается вif @logger == nil
  @logger = Logger.new
end
@logger

Look at the deployed version and think about where the problem might occur. With multiple threads and without synchronization, a situation is quite possible that @loggerwill be initialized several times. Of course, initializing @loggertwice may not be a problem in this case, but I saw similar bugs in the code that caused problems.

Reflections

In the end, I want you to learn some lesson for yourself.

4 out of 5 programmers will agree that in multi-threaded programming it is quite difficult to do everything right.

In the end, all that the GIL guarantees you is that the method implementations built into the MRI will be atomic (but there are pitfalls too ). This behavior can sometimes help us, but the GIL is actually designed to protect MRI itself internally, and not as a robust API for Ruby developers.

Therefore, GIL will not solve thread safety issues. As I said, it’s difficult to write multithreaded programs correctly, but we solve complex problems every day. One option for how we work with a complex problem is abstraction.

For example, when I need to make an HTTP request in my code, I have to use a socket. But usually I do not use it directly, as it is bulky and error prone. Instead, I use abstraction. The HTTP client provides a more limited and simpler API, hiding work with the socket and saving me from unnecessary errors.

If it is difficult to get the correct multithreading, then maybe you should not use it directly.

“If you added a new thread to your program, then you probably added 5 new bugs.” Mike perham

We see more and more abstractions around streams. The approach that has captured the Ruby community is the actor model, with the most popular implementation in the form of Celluloid . It gives us a great abstraction that connects concurrency primitives with the Ruby object model. Celluloid does not guarantee that your code will be thread-safe or devoid of race conditions, but it contains best practices in this regard. I urge you to give him a chance .

These issues that we are talking about are not specific to Ruby or MRI. This is a reality in the world of multi-core programming. The number of cores in devices is only growing, and the MRI has yet to somehow respond to this. Despite some guarantees, using GIL in multi-threaded programming seems wrong. This is part of MRI growth disease. Other implementations such as JRuby or Rubinus work really distributed and do not have GIL.

We see many new languages that have built-in abstractions for concurrency. There is not one in Ruby, at least not yet. Another advantage of abstractions is that their implementation can improve, while at the same time your code will remain unchanged. For example, if the implementation of the queue gets rid of the use of locks, then your code will reap the benefits without any changes.

For now, Ruby programmers should learn how to solve these problems themselves! Learn about concurrency. Know the cause of the race condition. Imagine the code as alternating blocks, this will help you solve problems.

Lastly, I will add a quote that well describes most of the work with concurrency today:

“Do not work together, sharing status, share status together working”

Using a data structure for synchronization supports this. The actor model supports this idea. It underlies concurrency in languages such as Go, Erlang, and others.

Ruby needs to watch what works in other languages and how to add it to itself. As a Ruby developer, you can start doing something today, just try and support one of the approaches. With more people on board, these approaches could become the new standard for Ruby.

Thanks to Brian Shirai for analyzing a draft of this article.

Tags: