But how does multithreading work? Part II: memory ordering

    picture to attract attention

    The knowledge about flow control that we got in the last topic is , of course, great, but there are still a lot of questions. For example: “How does happens-before work?” , “Is it true that volatilethis is a reset of caches?” , “Why was there any kind of memory model? Everything was normal, that something started? ”

    Like the previous article, this one is built on the principle of“ first, we will briefly describe what should happen in theory, and then go to the source and see how it happens there. ” Thus, the first part is largely applicable not only to Java, and therefore developers for other platforms can find something useful for themselves.


    Theoretical minimum

    The ever-increasing productivity of iron is increasing for a reason. Engineers who develop, say, processors, come up with many different optimizations that allow you to squeeze even more abstract parrots from your code. However, there is no free performance, and in this case, the possible counterintuitiveness of how your code is executed is the price. There are a lot of various features of iron hidden from us by abstractions. I recommend that those who have not done this yet should read the report by Sergey Walrus Kuksenko, which is called “Quantum Performance Effects” and perfectly demonstrates how unexpectedly your abstractions can occur. We will not go far for an example, and take a look at the caches.

    Cache device


    A request for “main memory” is an expensive operation, and even on modern machines can take hundreds of nanoseconds. During this time, the processor could have time to complete a lot of instructions. To avoid vileness in the form of perpetual downtime, caches are used. In simple words, the processor stores right next to itself copies of frequently used contents of the main memory. More complex words about the different types of caches and their hierarchies can be found here , but we are more interested in how the relevance of data in the cache is guaranteed. And if in the case of one processor (or the kernel, the term processor will be used in the future), obviously there are no problems, then with several cores (YAY MULTITHREADING!), Questions are already starting to arise.
    How can processor A know that processor B has changed some value if A has it cached?

    Or, in other words, how to ensure cache coherency?

    In order for the various processors to have a consistent picture of the world, they must in some way communicate with each other. The rules that they follow in this communication are called the cache coherence protocol .

    Cache Coherence Protocols


    There are many different protocols, and they vary not only from the iron producer to the iron producer, but also are constantly evolving even within the framework of one vendor. However, despite the vastness of the world of protocols, most of them have some common points. Somewhat diminishing the generality, we will consider the MESI protocol . Of course, there are approaches that are fundamentally different from it: for example, Directory Based . However, they are not considered in this article.

    In MESI, each cell in the cache can be in one of four states:
    • I nvalid: no value in cache
    • E xclusive: the value is only in this cache, and it has not yet been changed
    • M odified: the value has been changed by this processor, and so far it is neither in the main memory nor in the cache of any other processor
    • S hared: value is present in cache of more than one processor


    To switch from state, a message exchange takes place, the format of which is also part of the protocol. By the way, it’s rather ironic that at such a low level the change of states occurs precisely through the exchange of messages. Problem, Actor Model Haters?

    In order to reduce the size of the article and encourage the reader to study independently, I will not describe the messaging in detail. Those interested can fetch this information, for example, in the excellent article Memory Barriers: a Hardware View for Software Hackers . Traditionally, deeper thoughts on a topic from cheremin can be found on his blog .

    Cleverly skipping the description of the messages themselves, we make two remarks regarding them. Firstly, messages are not delivered instantly, as a result of which we get latency to change state. Secondly, some messages require special processing, resulting in processor downtime. All this leads to various scalability and performance issues.

    Optimizations for MESI and the problems they cause


    Store buffers


    In order to write something to a memory location in the Shared state, you need to send an Invalidate message and wait for everyone to confirm it. All this time, the processor will be idle, which is incredibly sad, since the time during which the message will reach is usually several orders of magnitude higher than necessary for simple instructions. To avoid such a meaningless and merciless loss of processor time, they came up with Store Buffers . The processor places the values ​​it wants to write into this buffer and continues to execute instructions. And when the necessary Invalidate Acknowledge is received, the data is finally sent to the main memory.

    Of course, there are several underwater rakes. The first of them is very obvious: if before the value leaves the buffer, the same processor tries to read it, it will not get what it just wrote. This is solved using Store Forwarding : it is always checked to see if the requested value is in the buffer; and if it is there, then the meaning is taken from there.

    But the second rake is already much more interesting. No one guarantees that if the cells were placed in the same order in the store buffer, then they will be written in the same order. Consider the following piece of pseudocode:

    void executedOnCpu0() {
        value = 10;
        finished = true;
    }
    void executedOnCpu1() {
        while(!finished);
        assert value == 10;
    }
    

    It would seem that what could go wrong? Contrary to what you might think, a lot. For example, if it turns out that at the beginning of the code execution finishedCpu0 is in the Exclusive state, and valuein the state, for example, Invalid, it valuewill leave the buffer later than finished. And it is quite possible that Cpu1 will read finishedhow true, but valueit will turn out to be not equal to 10. This phenomenon is called reordering . Of course, reordering happens not only in this case. For example, the compiler, for whatever reason, may well swap some instructions.

    Invalidate queues


    As you can easily guess, store buffers are not endless, and therefore tend to overflow, as a result of which you still often have to wait for an Invalidate Acknowledge. And they can sometimes take a very long time if the processor and cache are busy. You can solve this problem by introducing a new entity: Invalidate Queue . All requests for invalidation of memory cells will be placed in this queue, and acknowledgment will be sent instantly. In fact, the values ​​will be invalidated when the processor is comfortable. At the same time, the processor promises to behave well, and will not send any messages on this cell until it is disabled. Do you feel the catch? Let's get back to our code.

    void executedOnCpu0() {
        value = 10;
        finished = true;
    }
    void executedOnCpu1() {
        while(!finished);
        assert value == 10;
    }
    


    Suppose we were lucky (or we used some secret knowledge), and Cpu0 wrote down the memory cells in the order we needed. Does this guarantee that they fall into the Cpu1 cache in the same order? As you could already understand, no. We will also assume that now the cell valueis in the Cpu1 cache in the Exclusive state. The order of the actions then may turn out to be as follows:

    #Cpu0Cpu0: valueCPU0: finishedCpu1Cpu1: valueCPU1: finished
    0(...)0 (Shared)false (Exclusive)(...)0 (Shared)(Invalid)
    1
    value = 10;
    - store_buffer(value)
    ← invalidate(value)
    0 (Shared)
    (10 in store buffer)
    false (Exclusive)
    2
    while (!finished);
    ← read(finished)
    0 (Shared)(Invalid)
    3
    finished = true;
    0 (Shared)
    (10 in store buffer)
    true (Modified)
    4
    → invalidate(value)
    ← invalidate_ack(value)
    - invalidate_queue(value)
    0 (Shared)
    (in invalidation queue)
    (Invalid)
    5
    → read(finished)
    ← read_resp(finished)
    0 (Shared)
    (10 in store buffer)
    true (Shared)
    6
    → read_resp(finished)
    0 (Shared)
    (in invalidation queue)
    true (Shared)
    7
    > assert value == 10;
    0 (Shared)
    (in invalidation queue)
    true (Shared)
    Assertion fails
    N
    - invalidate(value)
    (Invalid)true (Shared)


    Multithreading is simple and straightforward, isn't it? The problem is at steps (4) - (6). Having received invalidatein (4), we do not execute it, but write it in the queue. And in step (6) we get read_responsea request readthat was sent earlier than that in (2). However, this does not make us disabled value, and therefore assertion falls. If operation (N) were performed earlier, we would still have a chance, but now this damn optimization has broken everything for us! But on the other hand, it is so fast and gives us ultra-low-latency ™! That's the dilemma. Iron developers cannot magically know in advance when the use of optimization is permissible, and when it can break something. And so they pass the problem on to us, adding: “It's dangerous to go alone. Take this! ”

    Hardware Memory Model


    The magic sword supplied by developers who went to fight dragons is not really a sword at all, but rather the rules of the game. They describe what values ​​a processor can see when it or another processor performs certain actions. But the Memory Barrier is already something much more like a sword. In the MESI example we are considering, there are such swords:

    Store Memory Barrier (also ST, SMB, smp_wmb) - an instruction that forces the processor to execute all stores that are already in the buffer before executing those that follow after this instruction

    Load Memory Barrier (also LD, RMB, smp_rmb) - an instruction that forces the processor to apply all invalidate already in the queue before executing any load instructions


    With new weapons at our disposal, we can easily fix our example:

    void executedOnCpu0() {
        value = 10;
        storeMemoryBarrier();
        finished = true;
    }
    void executedOnCpu1() {
        while(!finished);
        loadMemoryBarrier();
        assert value == 10;
    }
    


    Great, everything works, we are satisfied! You can go and write cool, productive and correct multi-threaded code. Although stop ...

    It would seem, where is Java?


    Write Once @ Run Anywhere


    All of these various cache coherence protocols, membranes, flushed caches, and other platform-specific things, in theory, should not worry those who write Java code. Java is platform independent, right? Indeed, there is no reordering concept in the Java Memory Model.
    NB : If this phrase confuses you, do not continue to read the article until you understand why. And read, for example, this .
    In general, it sounds interesting. There is no “reordering” concept, but reordering itself is. The authorities are clearly hiding something! But even if we abandon the conspiracy assessment of the surrounding reality, we will remain with curiosity and the desire to know. Let’s quench it! Take a simple class illustrating our recent example: [ github ]

    public class TestSubject {
        private volatile boolean finished;
        private int value = 0;
        void executedOnCpu0() {
            value = 10;
            finished = true;
        }
        void executedOnCpu1() {
            while(!finished);
            assert value == 10;
        }
    }
    


    Traditionally, there are several approaches to finding out what is happening there. You can have fun with PrintAssembly, you can see what the interpreter does, you can squeeze out secret knowledge from those who already know. You can say with a mysterious look that the caches are dumped there and calm down.

    Last time, we looked at a system interpreter that is not actually used in production . This time we will look at how the client compiler (C1) works. I used openjdk-7u40-fcs-src-b43-26_aug_2013 for my purposes .

    For a person who has not previously opened the OpenJDK sources (as well as for the one who opened it), it can be a difficult task to find where the necessary actions are performed in them. One of the easiest ways to do this is to look in the bytecode and find out the name of the desired instruction, and then search for it.

    $ javac TestSubject.java && javap -c TestSubject
    void executedOnCpu0();
      Code:
         0: aload_0          // Толкаем в стек this
         1: bipush        10 // Толкаем в стек 10
         3: putfield      #2 // Записываем во второе поле this (value) значение с вершины стека(10)
         6: aload_0          // Толкаем в стек this
         7: iconst_1         // Толкаем в стек 1
         8: putfield      #3 // Записываем в третье поле this (finished) значение с вершины стека(1)
        11: return
    void executedOnCpu1();
      Code:
         0: aload_0          // Толкаем в стек this
         1: getfield      #3 // Загружаем в стек третье поле this (finished)
         4: ifne          10 // Если там не ноль, то переходим к метке 10(цикл завершён)
         7: goto          0  // Переходим к началу цикла
        10: getstatic     #4 // Получаем статическое служебное поле $assertionsDisabled:Z
        13: ifne          33 // Если assertions выключены, переходим к метке 33(конец)
        16: aload_0          // Толкаем в стек this
        17: getfield      #2 // Загружаем в стек второе поле this (value)
        20: bipush        10 // Толкаем в стек 10
        22: if_icmpeq     33 // Если два верхних элемента стека равны, переходим к метке 33(конец)
        25: new           #5 // Создаём новый java/lang/AssertionError
        28: dup              // Дублируем значение на верхушке стека
        29: invokespecial #6 // Вызываем конструктор (метод )
        32: athrow           // Кидаем то, что лежит на верхушке стека
        33: return
    

    NB : Do not try to determine the exact behavior of the program in runtime using the bytecode. After the JIT compiler does its job, everything can change very much.
    What interesting things can we notice here? The first little thing that many people forget is that assertions are turned off by default. You can enable them in runtime using the key -ea. But this is so, nonsense. What we came here for are the names of the instructions getfieldand putfield. Do you think the same thing I am talking about? (Of course, Gleb! Just how do we build the Dyson Sphere from bacon, plunger and two bras ?!!)

    Down the rabbit hole


    Noting that the same instructions are used for both fields, let's see where the information about what the field is is contained volatile. A class is used to store field data share/vm/ci/ciField.hpp. We are interested in the method
    176
    bool is_volatile    () { return flags().is_volatile(); }
    
    In order to find out what C1 does with volatilefield access , you can find all uses of this method. After wandering around the dungeons a little and collecting several Scrolls with Ancient Knowledge, we find ourselves in a file share/vm/c1/c1_LIRGenerator.cpp. As his name hints to us, he is generating a low-level intermediate representation ( LIR , Low-Level Intermediate Representation) of our code.

    C1 Intermediate Representation Example putfield


    When creating IR in C1, our instruction putfieldis ultimately processed here. Consider the special actions that are performed for the volatilefields and quickly stumble upon familiar words:
    1734
    1735
    1736
    if (is_volatile && os::is_MP()) {
        __ membar_release();
    }
    Here __is the macro that expands to gen()->lir()->. And the method is membar_releasedefined in share/vm/c1/c1_LIR.hpp:
    1958
    void membar_release()                          { append(new LIR_Op0(lir_membar_release)); }
    In fact, this line added the membar_release statement to the intermediate representation of our code. After this, the following occurs:
    1747
    1748
    1749
    if (is_volatile && !needs_patching) {
        volatile_field_store(value.result(), address, info);
    }
    The implementation of the method is volatile_field_storealready platform dependent. On x86 ( cpu/x86/vm/c1_LIRGenerator_x86.cpp), for example, the actions are quite simple: it checks if the field is 64-bit, and if so, use Black Magic to guarantee the atomicity of the recording. Still remember that in the absence of a modifier, volatilefields of type longand doublecan be written nonatomically ?

    And finally, at the very end, another membar is placed, this time without release:
    1759
    1760
    1761
    if (is_volatile && os::is_MP()) {
        __ membar();
    }
    1956
    void membar()                                  { append(new LIR_Op0(lir_membar)); }
    NB : Of course, I insidiously hid some of the ongoing activities. For example, manipulations related to GC. The reader is invited to study them as an independent exercise.


    Convert IR to assembler


    We passed only ST and LD, and here there are new types of barriers. The fact is that what we saw earlier is an example of barriers for low-level MESI. And we have already moved to a higher level of abstraction, and the terms have changed somewhat. Suppose we have two types of memory operations: Store and Load. Then there are four ordered combinations of two operations: Load and Load, Load and Store, Store and Load, Store and Store. We examined two categories: StoreStore and LoadLoad - and there are those same barriers that we saw when talking about MESI. The other two should also be quite easily digestible. All load produced before LoadStore must complete before any store after. With StoreLoad, respectively, vice versa. You can read more about this in the JSR-133 Cookbook , for example .

    In addition, the concepts of operations with Acquire semantics and operations with Release semantics are distinguished . The latter is applicable to write operations, and ensures that any memory operations going before this operation must complete before it begins. In other words, an operation with write-release semantics cannot be reorder with any operation with memory that goes before it in the program text. The combination of LoadStore + StoreStore memory barrier can provide us with such semantics. Acquire, as you might guess, has the opposite semantics, and can be expressed using the combination LoadStore + LoadLoad.

    Now we understand which membranes the JVM places. However, while we saw this only in LIR, which, although Low-level, is still not the native code that JIT should generate for us. The study of exactly how C1 converts LIR to native code is beyond the scope of this article, so without further ado, we will go straight to the file share/vm/c1/c1_LIRAssembler.cpp. There, the whole conversion of IR into assembler code takes place. For example, in a very sinister line, consider lir_membar_release:
    665
    666
    667
    case lir_membar_release:
          membar_release();
          break;
    The called method is already platform dependent, and the source code for x86 lies in cpu/x86/vm/c1_LIRAssembler_x86.cpp:
    3733
    3734
    3735
    3736
    void LIR_Assembler::membar_release() {
      // No x86 machines currently require store fences
      // __ store_fence();
    }
    Great! Thanks to a strict memory model (including TSO - Total Store Order), on this architecture all records already have release semantics. But with the second membar, everything is a little more complicated:
    3723
    3724
    3725
    3726
    void LIR_Assembler::membar() {
      // QQQ sparc TSO uses this,
      __ membar( Assembler::Membar_mask_bits(Assembler::StoreLoad));
    }

    Here the macro __expands to _masm->, and the method membarlies in cpu/x86/vm/assembler_x86.hppand looks like this:
    1247
    1248
    1249
    1250
    1251
    1252
    1253
    1254
    1255
    1256
    1257
    1258
    1259
    1260
    1261
    1262
    1263
    void membar(Membar_mask_bits order_constraint) {
      if (os::is_MP()) {
        // We only have to handle StoreLoad
        if (order_constraint & StoreLoad) {
            // All usable chips support "locked" instructions which suffice
            // as barriers, and are much faster than the alternative of
            // using cpuid instruction. We use here a locked add [esp],0.
            // This is conveniently otherwise a no-op except for blowing
            // flags.
            // Any change to this code may need to revisit other places in
            // the code where this idiom is used, in particular the
            // orderAccess code.
            lock();
            addl(Address(rsp, 0), 0);// Assert the lock# signal here
        }
      }
    }
    It turns out that on x86 volatilewe write an expensive StoreLoad barrier in the form of recording each variable lock addl $0x0,(%rsp). The operation is expensive because it forces us to execute all Store in the buffer. However, it gives us the same effect that we expect from volatile- all other threads will see at least the value that was relevant at the time of its execution.

    It turns out that read on x86 should be the most common read. A quick inspection of the method LIRGenerator::do_LoadFieldtells us that after reading, as we expected, membar_acquire is set, which on x86 looks like this:
    3728
    3729
    3730
    3731
    void LIR_Assembler::membar_acquire() {
      // No x86 machines currently require load fences
      // __ load_fence();
    }
    This, of course, does not mean yet that it volatile readdoes not introduce any overhead compared to the usual one read. For example, even though nothing is added to the native code, the presence of a barrier in IR itself prevents the compiler from rearranging certain instructions. (otherwise you can catch funny bugs ). There are many other effects of using volatile. You can read about this, for example, in this article .

    Check for lice


    Printassssembly


    Sitting and guessing at the source is a noble occupation worthy of any self-respecting philosopher. However, just in case, we still look at PrintAssembly. To do this, add a lot of calls to the required methods in the experimental rabbit in the loop, turn off inlining (to make it easier to navigate in the generated code) and start in the client VM, not forgetting to turn on assertions:

    $ java -client -ea -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:MaxInlineSize=0 TestSubject
    ...
      # {method} 'executedOnCpu0' '()V' in 'TestSubject'
    ...
      0x00007f6d1d07405c: movl   $0xa,0xc(%rsi)
      0x00007f6d1d074063: movb   $0x1,0x10(%rsi)
      0x00007f6d1d074067: lock addl $0x0,(%rsp)     ;*putfield finished
                                                    ; - TestSubject::executedOnCpu0@8 (line 15)
    ...
      # {method} 'executedOnCpu1' '()V' in 'TestSubject'
    ...
      0x00007f6d1d061126: movzbl 0x10(%rbx),%r11d   ;*getfield finished
                                                    ; - TestSubject::executedOnCpu1@1 (line 19)
      0x00007f6d1d06112b: test   %r11d,%r11d
    ...

    That's nice, everything looks exactly as we predicted. It remains to check whether, in the absence volatile, something really can go wrong. Earlier in the article, TheShade demonstrated broken Double-Checked Locking , but we also want to get a little perverted, so we’ll try to break everything ourselves. Well, or almost yourself.

    Demonstration of failure without volatile


    The problem of demonstrating such a re-rendering is that the probability of its origin in the general case is not very high, and on individual HMM architectures it will not allow it at all. Therefore, we need to either get hold of alpha, or rely on re-rendering in the compiler. And besides, run everything many, many, many times. It’s good that we don’t have to invent a bicycle for this. We will use the wonderful utility jstress . Speaking quite simply, it repeatedly executes some code and collects statistics on the result of the execution, doing all the dirty work for us. Including the one about the necessity of which many do not even suspect.

    Moreover, the necessary test has already been written for us . More precisely, a little more complicated, but perfectly demonstrating what is happening:

    static class State {
        int x;
        int y; // acq/rel var
    }
    @Override
    public void actor1(State s, IntResult2 r) {
        s.x = 1;
        s.x = 2;
        s.y = 1;
        s.x = 3;
    }
    @Override
    public void actor2(State s, IntResult2 r) {
        r.r1 = s.y;
        r.r2 = s.x;
    }
    


    We have two threads: one changes state, and the second reads the state and saves the result that it saw. The framework aggregates the results for us, and checks them according to some rules . We are interested in two results that the second stream can see: [1, 0]and [1, 1]. In these cases, we read y == 1, but at the same time, we either did not see any records in x( x == 0) at all, or we saw not the latest at the time of recording y, that is x == 1. According to our theory, such results should occur. Check this:

    $ java -jar tests-all/target/jcstress.jar -v -t ".*UnfencedAcquireReleaseTest.*"
    ...
    Observed state Occurrence      Expectation                                            Interpretation
     [0, 0]          32725135        ACCEPTABLE       Before observing releasing write to, any value is OK for $x.
     [0, 1]             15           ACCEPTABLE       Before observing releasing write to, any value is OK for $x.
     [0, 2]             36           ACCEPTABLE       Before observing releasing write to, any value is OK for $x.
     [0, 3]           10902          ACCEPTABLE       Before observing releasing write to, any value is OK for $x.
     [1, 0]           65960    ACCEPTABLE_INTERESTING Can read the default or old value for $x after $y is observed.
     [1, 3]          50929785        ACCEPTABLE       Can see a released value of $x if $y is observed.
     [1, 2]             7            ACCEPTABLE       Can see a released value of $x if $y is observed.
    


    Here we can see that in 65960 cases out of 83731840 (approximately 0.07%) we saw y == 1 && x == 0that clearly speaks of the re-happened. Hooray, you can tie it.

    The reader should now have a good enough understanding of what is happening to answer the questions asked at the beginning of the article. Let me remind you:

    • How does happens-before work?
    • Is it true that volatilethis is flushing caches?
    • Why was there any kind of memory model?

    Well, everything fell into place? If not, then you should try to delve into the appropriate section of the article again. If this does not help, welcome to comment!

    And one more thing ©

    Not only hardware, but the entire runtime environment can perform transformations on the source code. To comply with JMM requirements, restrictions are imposed on all components where something can change. For example, in the general case, the compiler may rebuild some instructions, however, many optimizations may be prohibited to JMM.

    Of course, the server compiler (C2) is significantly smarter than C1, which we examined, and some things in it are very different. For example, the semantics of working with memory is completely different.

    OpenJDK multithreading guts use check in many placesos::is_MP()that allows you to improve performance on single-processor machines without performing some operations. If using the Forbidden Arts to make the JVM think at launch that it runs on the same processor, then it will not live long.

    Many thanks to the valiant TheShade , cheremin and artyushov for reading the article before publication, making sure that I don’t bring to the masses, instead of the light, some nonsense filled with stupid jokes and oversights.

    Only registered users can participate in the survey. Please come in.

    Do you understand anything?

    • 21.5% Yes, I understood everything, and now my code will be even better 68
    • 56.1% Dude, I didn’t understand nichrome from what you said, but you spoke and reached my heart 177
    • 25% TL; DR. I will use it as a reference 79
    • 11.4% And I already knew everything, and I look at this topic with a smile 36
    • 4.4% I did not understand something specific, and I will write about it in the comments 14

    Also popular now: