Multithreaded associative containers in C ++. Yandex Report

    From the report of the senior developer Sergey Murylyov, you can learn about the multi-threaded associative container for the standard library, which is being developed as part of WG21. Sergey spoke about the pros and cons of popular solutions to this problem and the path chosen by the developers.


    - You probably already guessed from the title that today's report will be about how we, within the framework of Working Group 21, made our container similar to std :: unordered_map, but for a multi-threaded environment.

    In many programming languages ​​- Java, C #, Python - this already exists and is quite widely used. But in our beloved, fastest and most productive C ++, this did not happen. But we consulted and decided to do such a thing.



    Before you add something to the standard, you need to think about how people will use it. Then to make a more correct interface, which will most likely be adopted by the committee - preferably without any amendments. And so that in the end there wouldn’t be such that they did one thing, but it turned out another.

    The most famous and widely used option is caching large heavy computing. There is a fairly well-known Memcached service that simply caches web server responses in memory. It is clear that you can do approximately the same thing on the side of your application, if you have a competitive associative container. Both approaches have their pros and cons. But if you don’t have such a container, you will have to either make your own bike or use some kind of Memcached.

    Another popular use case is event deduplication. I think many in this room write all kinds of distributed systems and know that often distributed queues are used to communicate between components, such as Apache Kafka and Amazon Kinesis. They are distinguished by such a feature that they can send one message to one consumer several times. This is called at least once delivery, which means a guarantee of delivery at least once: more is possible, less is not.

    Consider this in terms of real life. Imagine that we have some backend of some chat room or social network where messaging takes place. This can lead to the fact that someone wrote a message and someone later received a push notification on their mobile phone several times. It is clear that if users see this, they will not be happy about it. It is argued that this problem can be solved with such a wonderful multi-threaded container.

    The next, less frequently used case is when we just need to save something somewhere on the server side, some metadata for the user. For example, we can save the time when the user last authenticated, in order to understand when the next time he should be asked for his username and password.

    The last option on this slide is statistics. From real life applications, an example of use in a virtual machine from Facebook can be given. They made a whole virtual machine to optimize PHP and in their virtual machine they try to write down the arguments with which they were called into a multi-threaded hash table for all built-in functions. And if they have a result in the cache, then they try to just give it away right away and not count anything.


    Link from the slide

    Adding something big and complicated to the standard is not easy and not fast. As a rule, if something big is added, it goes through the technical specification. Currently, the standard is moving a lot to expand multithreading support in the standard library. And specifically, proposal P0260R3 about multithreaded queues is going there right now. This proposal is about a data structure very similar to us, and our design decisions are very similar in many respects.

    Actually, one of the basic concepts of their design is that their interface is different from the standard std :: queue. What is the point? If you look at the standard queue, then to extract an element from it, you need to do two operations. You need to do a front operation to count, and a pop operation to delete. If we work under multithreading conditions, then we may have some kind of operation on the queue between these two calls, and it may happen that we consider one element and delete another, which seems conceptually incorrect. Therefore, these two calls were replaced there by one, and a few more calls from the category of try push and try pop were added. But the general idea is that a multi-threaded container will not be the same as a normal interface.

    Multithreaded queues also have many different implementations that solve several different tasks. The most common task is the task of producers and consumers, when we have some threads that produce some tasks and there are some threads that take tasks from the queue and process them.

    The second case is when we just need to have some kind of synchronized queue. In the case of manufacturers and consumers, we get a queue that is limited to the top and bottom. If we try, relatively speaking, to extract from an empty queue, then we go into a waiting state. And the same thing roughly happens if we try to add something that does not fit into the queue in size.

    Also in this proposal'e it is described that we have a separately made interface that allows us to distinguish which implementation we have inside the lock, or lock free. The fact that here everywhere in the proposal is written lock free, in fact, it is written in books as wait free. This means that our algorithm works out for a fixed number of operations. Lock free there means a little different, but that's not the point.



    Let's look at a typical diagram of how the implementation of our hash table might look if it is blocked. We have it divided into a number of segments. Each segment, as a rule, contains some kind of lock, such as Mutex, Spinlock, or something even more tricky. And in addition to Mutex or Spinlock, it also contains the usual hash table, which is protected by this business.

    In this picture, we have a hash table, which is made on the lists. In fact, in our reference implementation, we wrote a hash table with open addressing for performance reasons. Performance considerations are basically the same as why std :: vector is faster than std :: list because vector, relatively speaking, is stored sequentially in memory. When we go along it, we have sequential access, which is well cached. If we use some kind of sheets, then we will have all sorts of jumps between different sections of memory. And the whole thing, as a rule, ends with the fact that we lose productivity.

    At the very beginning, when we want to find something in this hash table, we take the hash code from the key. You can take it modulo and do something else with it so that we get the segment number, and in the segment we are looking, as in a regular hash table, but at the same time, of course, we take the lock.


    Link from the slide

    The main ideas of our design. Of course, we also made an interface that does not match std :: unordered_map. The reason is this. For example, in ordinary std :: unordered_map we have such a wonderful thing as iterators. Firstly, not all implementations can support them normally. And those that can support, as a rule, are some kind of lock free implementations that require either a garbage collection or some kind of smart pointers that clean up the memory behind it.

    In addition to this, there are several other types of operations that we threw out. In fact, iterators, they are in many places. For example, they are in Java. But, as a rule, oddly enough, they do not synchronize there. And if you try to do something with them from different threads, then they can easily go into an invalid state, and you can get an exception in Java, and if you write this on the pluses, then this will most likely be undefined behavior, which is even worse . And by the way, on the topic of undefined behavior: in my opinion, comrades from Facebook in their library folly, which is posted in open source on GitHub, did just that. Just copied the interface with Java and got such wonderful iterators.

    We also don’t have a memory complaint, that is, if we added something to the container and took memory for it, we won’t give it back, even if we delete everything. Another prerequisite for this decision was that we have a hash table with open addressing inside. It works on a linear data structure, which like vector does not give back memory.

    The next conceptual moment is that under no circumstances will we ever give out to anyone outward links and pointers to internal objects. This was precisely done to prevent the need for a garbage collection, and not to enforce smart pointers. It is clear that if we store some large enough objects, storing them by value inside will be unprofitable. But, in this case, we can always wrap them in some kind of smart pointers of our choice. And, if we want, for example, to do some kind of synchronization on the values, we can wrap them in some kind of boost :: synchronized_value.

    We looked at the fact that some time ago, the shared_ptr class was removed from the method that returned the number of active links to this object. And we came to the conclusion that we also need to throw out several functions, namely, size, count, empty, buckets_count, because as soon as we return this value from the function, it immediately ceases to be valid, because someone can change the same moment.

    At one of our previous meetings, they asked us to add a kind of mode so that we can access our container in single-threaded mode, like a regular std :: unordered_map. Such semantics allows us to clearly distinguish where we work in multithreaded mode and where not. And to avoid some unpleasant situations when people take a multi-threaded container, expect that everything will be fine with them, take iterators, and then suddenly it turns out that everything is bad.


    Link from the slide

    At this meeting in Hawaii, a whole proposal was written against us. :) We were told that such things have no place in the standard, because there are many ways to make your multi-threaded associative container.

    Each has its pros and cons - and it is not clear how to use what we ultimately succeed in. As a rule, it is used when you need some kind of extreme performance. And it seems like our boxed solution is not suitable, it is necessary to optimize for each specific case.

    The second point of this proposal was that our interface is not compatible with the fact that Facebook posted on GitHub from its standard library.

    In fact, there were no particular problems there. There simply was a question from the category of "I have not read, but condemn." They just looked - the interface is different, which means it is not compatible.

    The idea of ​​the proposal was also that similar tasks should be solved with the help of the so-called policy based design, when it is necessary to make an additional template parameter in which you can pass a bit mask with which features we want to enable and which to turn off. Indeed, this sounds a little wild and leads to the fact that we get about 2 ^ n implementations, where n is the number of different features.



    In code, it looks something like this. We have some kind of parameter and some number of predefined constants that can be passed there. Oddly enough, this proposal was rejected.



    Because, in fact, the committee has already taken the position that such things should be when the multi-threaded queue passed through SG1. There was not even a vote on this issue. But two questions were put to the vote.

    First. Many people wanted us on the side of our reference implementation to support reading without taking any locks. So that we have a completely non-blocking reading. It really makes sense: as a rule, the most popular user case is caching. And it’s very beneficial for us to have a quick read.

    Second moment. Everyone heard a lot from the previous friend who spoke about policy based design. Everyone had an idea - but what, let me talk about my idea too! And everyone voted to have a policy based design. Although I must say, this whole story has been going on for quite some time, and before us, our colleagues from Intel, Arch Robinson and Anton Malakhov were doing this. And they already had several proposals on this subject. They just offered to add a lock free implementation based on the SeepOrderedList algorithm . And they also had a policy based design with the memory complaint.

    And this proposal did not like the Library Evolution Working Group. It was wrapped up with the reason that we simply will have an unlimited increase in the number of words in the standard. It will be simply impossible to adequately preview and simply implement in code.

    We have no comments on the ideas themselves. We have comments, for the most part, on reference implementation. And of course, we have a couple of ideas that can be introduced to make it clearer. But essentially, next time we’ll go with the same proposal. We sincerely hope that we will not have, as with the modules, five calls with the same proposal. We sincerely believe in ourselves, and that we will be allowed to go on to the next call, and that the Library Evolution Working Group will nevertheless insist on our opinion and will not allow us to push through policy based design. Because our opponent does not stop. He decided to prove to everyone that it was necessary. But as they say, time will tell. I have everything, thank you for your attention.

    Also popular now: