Common mistakes when developing lockfree algorithms and their solutions

    On a habr already there were some articles about lock-free algorithms. This post is a translation of an article by my colleague that we plan to publish on our corporate blog. By type of activity, we write a huge number of lock-free algorithms and data structures, and this article wants to show how interesting and difficult it is at the same time.

    This article is largely similar to this article , but that article does not address all the problems that can be encountered when developing lock-free data structures, and very little attention is paid to solving these problems. In this article, I would like to dwell in detail on some of the solutions that we use in the actual implementation of lock-free data structures in our product, and pay more attention to evaluating performance.

    By definition, the lock-free algorithm ensures that the time between two consecutive completed operations is limited from above by a quantity that does not depend on how the operating system distributes these threads over time. Such a guarantee sounds very nice, because threads can access the same memory location at the same time, and even if the operating system decides to temporarily euthanize one or more threads in the middle of some operation with this memory location, the rest of the threads will continue to work. No matter how two or more threads intersect each other in time, at least one of them will always end in a finite time (unlike wait-free, lock-free does not guarantee that they will all end in a finite time).

    Lock-free algorithms are usually contrasted with the traditional approach to writing multi-threaded applications using locks around code that accesses common areas of memory. When a thread wants to access memory, it blocks access to it from other threads. If some other thread has blocked access before, then the current thread waits until the lock is released. If the operating system decides to temporarily euthanize the thread that owns the lock, the entire system stops without the ability to change the shared memory area.

    Instead of using locks, lock-free algorithms use a command known as compare and swap (CAS). Its logic can be described by the following code fragment, with the caveat that CAS is performed atomically:

    1 bool CompareAndSwap(Value* addr, Value oldVal, Value newVal){
    2     if(*addr == oldVal){
    3         *addr = newVal;
    4         return true;
    5     }else{
    6         return false;
    7     }
    8 }

    The main problems with lock-free algorithms are the following three:

    1. Often, a lock-free implementation is less practical than an implementation with locks.
    2. Writing lock-free code is not easy.
    3. Writing the right lock-free code is incredibly difficult.

    To prove point three, consider a simple example - implementing a stack on linked lists. If you ask a person who is not familiar with lock-free to write a stack, then its first implementation will be approximately as follows:

    For the operation of adding to the stack, create a new element of the linked list, a pointer to the next element of which points to the current top of the stack. Then try using CAS to change the top of the stack to a new element. If CAS succeeds, the item is inserted. If the CAS returns with an error (which means that the top of the stack has changed between the way we read it and the way the CAS was executed), we will repeat everything from the beginning.

    For the delete operation from the stack, remember the current top of the stack, and, using CAS, try to change it to the value of its pointer to the next element. If CAS was successful, then the element from the stack was removed, otherwise the top of the stack was changed by another thread, and we are trying to delete it again.

    In C ++, it will look something like this:

     1 template 
     2 class LockFreeStack{
     3     struct Node{
     4         Entry* entry;
     5         Node* next;
     6     };
     8     Node* m_head;
    10     void Push(Entry* e){
    11         Node* n = new Node;
    12         n->entry = e;
    13         do{
    14             n->next = m_head;
    15         }while(!CompareAndSwap(&m_head, n->next, n));
    16     }
    18     Entry* Pop(){
    19         Node* old_head;
    20         Entry* result;
    21         do{
    22             old_head = m_head;
    23             if(old_head == NULL){
    24                 return NULL;
    25             }
    26         }while(!CompareAndSwap(&m_head, old_head, old_head->next));
    28         result = old_head->entry;
    29         delete old_head;
    30         return result;
    31     }
    32 }

    Unfortunately, this seemingly reasonable, implementation of the lock-free stack contains a huge number of errors.


    Even the simplest test of such a stack will fall quickly with Segmentation Fault. The reason is that on line 22 we read the pointer to the current top of the stack. In line 26, we turn to her next field. Between the time we read the pointer and the way we accessed its next field, a pop that was running at the same time could delete this element (line 29), thereby leading to the fact that the memory read in line 26 was already freed.
    For the algorithm to work correctly, some kind of secure memory management algorithm is needed.

    Item loss

    A successful CAS does not guarantee that the memory has not been changed during its execution. He guarantees that its value at the moment when he rewrote it, was equal to the value that was given to him as expected. In other words, if you call CAS (ptr, 5, 6) at the moment when the value in ptr is five, then another thread changes the value in ptr to seven, and then back to five, CAS will succeed, re-setting five to six. This can lead to a problem known as an ABA problem. Consider the following example:

    Assume there are two elements on the stack, A -> C.

    Stream 1 calls Pop, reads m_head (A) on line 22, reads old_head-> next (C) on line 26, and then the operating system euthanizes the stream .
    Thread 2 calls Pop, and successfully removes item A.
    Thread 2calls Push, and successfully inserts the element B.
    Stream 2 calls Push again, and inserts the element located at the same address as A (either A was released and then another element was started at the same address, or the user decided to insert the element which he just picked up from the stack, back)
    Thread 1 wakes up and calls CAS. CAS is successful, despite the fact that during this time, m_head has changed three times! As a result, element B is lost.

    Not lock-free

    The C ++ standard does not guarantee that new and delete are lock-free. What is the point of developing a lock-free algorithm if it calls library functions that are not lock-free? In order for the algorithm to be lock-free, working with memory must also be lock-free.

    Simultaneous memory access

    If the current value in some memory location is 100, and one of the threads is currently writing 200 to it, and the other thread is reading its value (suppose that there are no other threads in the system besides these two threads). What value will the second stream read? It seems reasonable that he will read either 100 or 200. However, this is not so. In C ++, the described scenario leads to undefined behavior, and the second thread could theoretically read 42. Prior to C ++ 11, people used volatile for variables that could be accessed simultaneously, but in fact volatile should never be used for this purpose .
    In our implementation of Push and Pop, both read and write m_head, thus each reading of m_head leads to undefined behavior, and can return a value that has never been written to m_head.

    Memory access order

    It is well known that both the compiler and the processor can change the order of execution of commands. Consider this example. Let x and y both be zero, and then two threads execute the following code:

    Stream 1: print x; x is 1; print y;
    Stream 2: print y; y = 1; print x;

    Suppose that both threads print two zeros. It follows that the first thread deduced y before the second thread assigned one to it, which happened before the second thread deduced x, which happened before the first thread assigned it one, which happened before the first thread deduced y. That is, the first thread printed y before it printed y, which makes no sense. However, such a scenario is possible, because both the compiler and the processor could change the reading and writing order of x and y while the streams are in use.
    In our implementation of the lock-free stack, we can see the value in the pointer to the next element, which was changed long ago, and, as a result, we can access the memory that was freed.

    So how do you write a lock-free stack?

    Most of the problems described have more than one solution. Here are the solutions that we use in our work both for the stack and for other data structures.


    Before you delete a piece of memory, you need to make sure that no one is reading this memory. The first thought that comes to mind is to use a reference counter, but this solution does not work: between how we got the address and how we increased the reference counter, memory could be freed. We are using a solution called Hazard Pointers. In the case of the stack, each thread has a special pointer visible to everyone (let's call it the “local pointer”), where the address of the element with which this thread is currently working is stored. Before deleting any element, the deletion thread makes sure that this element is not contained in any of the local pointers. If it is contained in local pointers, you cannot delete an item. You can either wait and try again,
    Each thread that wants to perform some operation on the top of the stack first saves the top of the stack to its local pointer, then makes sure that the element stored in the local pointer is still the top of the stack (it could be removed and memory could be freed while we saved it to local pointer). After that, the thread can be sure that the memory occupied by the element will not be freed.
    A big drawback may be that to remove an item, you need to scan an array whose size is equal to the number of threads in the system. One of the optimization options may be the following: instead of deleting an element, always put it in the queue, and only when the queue collects a sufficiently large number of elements to delete (comparable to the number of threads), delete them all at once. In this case, you can copy the values ​​of all local pointers at that time into some data structure with a quick search (for example, a hash table), and use this data structure to check whether it is possible to delete an element.
    For data structures, it is more complicated than a stack (linked lists, skiplists), more than one local stream pointer may be required.

    Item loss

    To ensure that the ABA problem never occurs, it is enough to ensure that m_head will never take the same value twice. To do this, you can store a 64-bit number with a pointer (let's call it a version), which increases by one with each operation. In order to change the pointer and version at the same time, we need Double CAS, which is available on all modern processors.

    Not lock-free

    To solve the problem with the institution of memory, you just can not start the memory. In our code, we store a pointer to the next element directly in the element structure. But this is not always an acceptable solution, because it consumes 8 bytes per element, even if the element is not on the stack.

    Simultaneous memory access

    We use boost :: atomic for variables that can be accessed from multiple threads. Although the compiler that we use (g ++ 4.6) already has an implementation of std :: atomic, it is significantly less efficient than boost :: atomic, because it uses memory_barrier after each operation, even if memory_barrier is not needed after it.

    Memory access order

    In C ++ 11, a new memory model and memory access rules for atomic variables were introduced. For each read or write operation, you can specify which guarantees are required. For calls to the top of the lock-free stack, we are forced to use CompareAndSwap with the guarantee of sequential consistency (this is the strictest guarantee, and, as a result, the slowest). Sequential consistency means that for all read and write operations performed in a certain period of time by all threads, there is an order such that each thread sees these operations as if they had occurred in that order.
    If you understand the C ++ 11 memory model, and try to analyze what guarantees should be used to work with local pointers (hazard pointers), you can assume that aqcuire / release is sufficient.

    What is acquire / release
    The acquire / release guarantee can be described by the following scenario: if Stream 1 changed the variable A with the release guarantee, and then the variable B with the release guarantee, and Stream 2 read B with the guarantee guaranteed, and saw the value recorded by Stream 1, then it is guaranteed that if then Stream 2 reads A with a guarantee of acquire, then it will see the change made by Stream 1.

    We used acquire / release for a while, until in one day the server crashed on a successfully delivered assert. A detailed analysis of the error showed that the following scenario is possible when using acquire / release:

    Stream 1 prepares to execute Pop and reads the top of the stack
    Stream 1 writes the top of the stack to its local pointer (using release guarantee)
    Stream 1 reads the top of the stack again, it has not changed
    Stream 2 removes the top of the stack, and decides to delete it (or adds it to the deletion queue, and another thread decides to delete it - let's call the thread that removes the element, Stream 3 )
    Stream 3reads an array of local pointers with a guarantee of acquire. Here a breakdown occurs - it is not guaranteed that we see the change made by Stream 1 .
    Thread 3 deletes memory.
    Thread 1 dereferences the stored top of the stack that has been deleted and crashes with SegFault.

    If you use Sequential Consistency both to access local pointers and to work with the top of the stack, such a scenario is not possible. Since the order of operations for all streams is the same, either Stream 2 removed the element from the top of the stack before Stream 1 checked it after writing to the local pointer (then Stream 1 will understand that the top of the stack has changed and will start again), or Stream 1successfully compared its local pointer to the top of the stack before Stream 2 removed the item from the top of the stack, which means that Stream 3 will see the deleted top of the stack in the local pointer.


    For this article, we wrote two implementations of the stack: one lock-free, the other using std :: mutex. The test program started several threads, and called Push and Pop with equal probability in an infinite loop. For measurements used Dell Precision T5500. The results are shown in the graph below. The X axis is the number of threads, the Y axis is the number of operations per second.

    It doesn’t look very impressive.
    One of the problems is that by its nature the stack has a very bottleneck - its top. All threads are fighting to change it first, and performance will be as high as the theoretical ability to change top. The operations of Push and Pop are so simple that even when using a single thread, they rest against the limit of the number of changes to the top of the stack per unit time. Adding new threads only slows down overall performance, because now you need to synchronize the changes in the top of the stack between threads.
    Another problem is that when several threads simultaneously try to change the top of the stack, only one of them will do this successfully, and all other threads will go to a new iteration. This is problem. As you know, every problem has a quick, beautiful but wrong solution. Such a solution for this problem would be to add usleep (250) after a failed CAS. This is a simple but not optimal way to make threads intersect less when the top of the stack changes. The result may seem surprising - adding usleep (250) improves performance by 10 times! In practice, we use slightly more sophisticated approaches to reduce the number of failed CASs, but as you can see, even a simple approach like usleep gives excellent results.

    It is also interesting to look at how much resources various implementations consume:
    This is how HTOP looks like for launching with a lock-free stack without usleep in 16 threads:

    As you can see, the processor is 100% busy. This is due to the fact that all threads in the loop are trying to change the top of the stack, again and again performing a new iteration, completely consuming the processor time available to them.

    This is how HTOP looks to run with a lock-free stack with usleep in 16 threads:

    The processor is almost idle. Threads that fail to immediately change the vertex sleep without occupying the processor. This is an amazing observation - the addition of usleep not only increased useful performance by 10 times, but also significantly reduced CPU consumption.

    This is how HTOP looks to run with a stack with locks in 16 threads:

    Here you can clearly see that the processor is completely occupied by system calls. In other words, almost all the processor time was spent working with locks.


    Lock-free ensures that the system will always do useful work, even if some of the threads have been suspended by the operating system. But lock-free does not guarantee that this will be achieved efficiently. In addition, lock-free algorithms often contain errors that are not always easy to notice or reproduce. The simple implementation of the stack at the beginning of this article contained five different errors, and was inferior in performance to the implementation with locks. If you want to use lock-free algorithms, make sure that they are worth it, both in terms of performance and in terms of implementation complexity. Try not to invent your own lock-free algorithms, but to find ready-made implementations or scientific works.

    The code

    With locks:

     1 #include 
     2 #include 
     4 template
     5 class LockedStack{
     6 public:
     7     void Push(T* entry){
     8         std::lock_guard lock(m_mutex);
     9         m_stack.push(entry);
    10     }
    12     // For compatability with the LockFreeStack interface,
    13     // add an unused int parameter.
    14     //
    15     T* Pop(int){
    16         std::lock_guard lock(m_mutex);
    17         if(m_stack.empty()){
    18             return nullptr;
    19         }
    20         T* ret =;
    21         m_stack.pop();
    22         return ret;
    23     }
    25 private:
    26     std::stack m_stack;
    27     std::mutex m_mutex;
    28 };

    No locks (there is no logic in the code responsible for clearing the memory)

     1 class LockFreeStack{
     2 public:
     3     // Элементы стека должны наследоваться от Node
     4     //
     5     struct Node{
     6         boost::atomic next;
     7     };
    10     // К сожалению, нет платформо-независимого способа написать этот класс.
    11     // Приведенный код работает для gcc на архитектуре x86_64.
    12     //
    13     class TaggedPointer{
    14     public:
    15         TaggedPointer(): m_node(nullptr), m_counter(0) {}
    17         Node* GetNode(){
    18             return m_node.load(boost::memory_order_acquire);
    19         }
    21         uint64_t GetCounter(){
    22             return m_counter.load(boost::memory_order_acquire);
    23         }
    25         bool CompareAndSwap(Node* oldNode, uint64_t oldCounter, Node* newNode, uint64_t newCounter){
    26             bool cas_result;
    27             __asm__ __volatile__
    28             (
    29                 "lock;"
    30                 "cmpxchg16b %0;"
    31                 "setz       %3;"
    33                 : "+m" (*this), "+a" (oldNode), "+d" (oldCounter), "=q" (cas_result)
    34                 : "b" (newNode), "c" (newCounter)
    35                 : "cc", "memory"
    36             );
    37             return cas_result;
    38         }
    39     private:
    40         boost::atomic m_node;
    41         boost::atomic m_counter;
    42     }
    43     // Выравнивание по 16 байтам необходимо для правильной работы double cas
    44     //
    45     __attribute__((aligned(16)));
    48     bool TryPushStack(Node* entry){
    49         Node* oldHead;
    50         uint64_t oldCounter;
    52         oldHead = m_head.GetNode();
    53         oldCounter = m_head.GetCounter();
    54         entry->, boost::memory_order_relaxed);
    55         return m_head.CompareAndSwap(oldHead, oldCounter, entry, oldCounter + 1);
    56     }
    58     bool TryPopStack(Node*& oldHead, int threadId){
    59         oldHead = m_head.GetNode();
    60         uint64_t oldCounter = m_head.GetCounter();
    61         if(oldHead == nullptr){
    62             return true;
    63         }
    64         m_hazard[threadId*8].store(oldHead, boost::memory_order_seq_cst);
    65         if(m_head.GetNode() != oldHead){
    66             return false;
    67         }
    68         return m_head.CompareAndSwap(oldHead, oldCounter, oldHead->next.load(boost::memory_order_acquire), oldCounter + 1);
    69     }
    71     void Push(Node* entry){
    72         while(true){
    73             if(TryPushStack(entry)){
    74                 return;
    75             }
    76             usleep(250);
    77         }
    78     }
    80     Node* Pop(int threadId){
    81         Node* res;
    82         while(true){
    83             if(TryPopStack(res, threadId)){
    84                 return res;
    85             }
    86             usleep(250);
    87         }
    88     }
    90 private:
    91     TaggedPointer m_head;
    92     // Умножение на восемь позволяет каждому локальному указателю быть в
    93     //   своей собственной кеш-линии, что заметно ускоряет работу с ними.
    94     //
    95     boost::atomic m_hazard[MAX_THREADS*8];
    96 };

    The benchmark code is also available on github (if you want to run the test locally):


    At the suggestion, lostmsu added a stack with spinlock instead of std :: mutex to the benchmark. For the purity of the experiment, I used spinlock, which, like the lockfree implementation, sleeps 250 milliseconds if it is not possible to capture the lock on the first try. Not unexpectedly, such an implementation turned out to be more productive than the lock-free implementation, and the implementation with std :: mutex. CPU consumption is visually the same as the lock-free implementation with usleep (250).

    The github repository has also been updated with a new implementation.

    Also popular now: