degs January 21, 2014 at 18:21

Boost :: lockfree tests for message speed and delay

From the sandbox

Not so long ago, a whole new section appeared in boost-1.53 - lockfree that implements non-blocking queues and the stack.
Over the past few years, I have been working with the so-called lock-free data structures, we wrote them ourselves, tested them ourselves, used them ~~and were secretly proud of them~~ . Naturally, the question immediately arose whether to switch from home-made libraries to boost, and if to switch, then when?
That's when I had the idea for the first time to apply some of the techniques with which we tested our own code to boost :: lockfree. Fortunately, we don’t have to test the algorithm itself, and we can focus on measuring performance.
I will try to make the article interesting for everyone. For those who have not yet encountered such tasks, it will be useful to look at what such algorithms are capable of, and most importantly, where and how they are or should not be used. For those who have experience in developing non-blocking queues, it may be interesting to compare the data of quantitative measurements. I myself have at least not seen such publications.

Introduction: what are non-blocking data structures and algorithms

The concept of multithreading has firmly entered into modern programming, however, working with threads is impossible without synchronization tools, so mutex, semaphore, condition variable and their subsequent descendants appeared. However, the first standard functions were rather heavy and slow; moreover, they were implemented inside the kernel, that is, they required contextual switching to each call. The switching time is weakly dependent on the CPU, so the faster the processors became, the more relative time was required to synchronize the threads. Then the idea came up that, with minimal hardware support, it would be possible to create invariant data structures while working with multiple threads at the same time. For those who would like to know more about this, I recommend this series of publications .
The basic algorithms were developed and put in a long box, their time had not yet come. They got a second life when the concept of message processing time (latency) became almost more important than the usual CPU speed. What is it all about?

Here is a simple example:

Suppose I have a server that receives messages, processes and sends a response. Suppose I received 1 million messages and the server dealt with them in 2 seconds, that is, 2 microseconds per transaction, and this suits me. This is what is called bandwidth and is not a correct measure when processing messages.. Later, I was surprised to learn that each client sent me a message received a response no earlier than 1 second, which did not suit them at all. What's the matter? One of the possible scenarios: the server quickly receives all messages and adds them to the buffer; then it processes them in parallel, spending for every 1 second, but manages to process everything together in just 2 seconds; and quickly sends back. This is an example of a good system speed as a whole and at the same time an unacceptably high latency.

You can read more in the presentation of the Herb Sutter interview , he is in a slightly different context, but he discusses this problem very temperamentally. It seems intuitively that the concepts of speed and latency are identical - the larger the first, the smaller the second. A closer look reveals, however, that they are independent and even anti-correlated. What does this have to do with non-blocking structures? The most direct thing is that for latency, any attempt to slow down or stop the flow is fatal. It is easy to euthanize a stream, but it is impossible to wake it. Only the kernel of the operating system can wake him up with a ~~tender kiss~~ , and it does it strictly on schedule ~~and with lunch breaks~~
. Try to explain to someone that your program is committed to those. the task to respond within 200 nanoseconds , at the moment fell asleep for 10 milliseconds (typical time for * nix systems) and it is better not to disturb it. Lock-free data structures come to the rescue, which do not require stopping the stream for synchronization with other streams.
We’ll talk about one such structure.

The first approach to the platform

I will work only with one of the structures - boost :: lockfree :: queue that implements a unidirectional queue with an arbitrary number of write and read streams. This structure exists in two versions - allocating memory as necessary and having infinite capacity, and the option with a fixed buffer. Strictly speaking, both of them are not non-blocking, the first because the system memory allocation is not lock-free, the second, because sooner or later the buffer will overflow and writing streams will be forced to wait indefinitely until there is space for recording. Let's start with the first option, and towards the end I will compare with the results for a fixed buffer.
I also add that I have a 4-core Linux Mint-15.
Let's get the code right from here. and try to run, here is the result:

boost :: lockfree :: queue is lockfree
produced 40,000,000 objects.
consumed 40,000,000 objects.
real 0m15.332s
user 1m0.376s
sys 0m0.064s

That is, if you approach the matter in a simple way, about 400 ns per message, it is quite satisfactory. This implementation passes int and starts 4 read and write streams.
Let's modify the code a little bit, I want to run an arbitrary number of threads and I also like to see statistics. What will be the distribution if you run the test 100 times in a row?

Here, please, it looks quite reasonable. On the X axis, the total execution time in nanoseconds divided by the number of transmitted messages, on the Y axis is the number of such events.

And here is the result for a different number of writers / readers:

Not everything is so rosy here, any broadening of the distribution suggests that something is not working optimally. In this case, it’s pretty clear that reading streams in this test never give up control, and when their number approaches the number of cores, the system just has to suspend them.

The second approach to the platform

Let's make one more improvement in the test, instead of passing a useless int , let the writing stream send the current time accurate to nanoseconds. Then the recipient will be able to calculate the latency for each message. We do, run:

threads: 1 write, 1 read
failed: 0 pushes, 3267 pops
bandwidth: 177.864 ns
latency: 1.03614e + 08 ns

We are now also counting the number of unsuccessful attempts to read the message from the queue and write to the queue (the first one here of course will always be zero, this is an allocation option).
However, what else is this? The delay, which we intuitively assumed of the same order - 200 ns, suddenly exceeds 100 milliseconds, half a million times more! It just can't be.
But after all, we now know the delay of each message, now press and see how it looks in real time, here are the results of several identical starts, so that you can see how random the process is:

if we write and read on one stream, and if on four, then here:

What is going on? At some arbitrary moment, part of the reading streams is sent by the system to rest. The queue begins to grow rapidly, messages are sitting in it and waiting for processing. After some time, the situation changes, the number of writing threads becomes less than the reading and the queue is slowly resolving. Such fluctuations occur with a period from milliseconds to seconds and the queue works in batch mode - a million messages are recorded, a million are read. At the same time, the performance remains very high, but each individual message can spend several milliseconds in the queue.
What do we do? First of all, let's think about it, the test in this form is clearly inadequate. With us, half of the active threads are busy only with inserting messages into the queue, this simply cannot happen on a real system, in other words, the test is designed so that traffic generates a knowingly superior power to the machine.
You have to limit the input traffic, just insert usleep (0) after each entry in the queue. On my machine, this gives a delay of 50 μs with good accuracy. Let's see: The

red line is the initial test without delay, the green line is with a delay.
It’s a completely different matter, now you can calculate statistics.

Here is the result for several combinations of the number of write and read streams, to maintain an acceptable scale in X, 1% of the largest samples are discarded:

Note that latency stays confidently within 300 ns and only the distribution tail extends further and further.

And here are the results for one and four writing streams, respectively.

There is a significant increase in delay, mainly due to the sharp growth of the tail. Again we see that there are four (== CPU) threads that continuously thresh idle while developing their time slice, which generates a large number of uncontrolled slowdowns. Although the average delay surely remains within 600 ns, for some tasks this is already on the verge of acceptable, for example, if your TK clearly stipulates that 99.9% of messages should be delivered within a certain time (this happened to me).
Also pay attention to how much the total execution time has grown, every 150 times. This is a demonstration of the statement I made at the very beginning - the minimum latency and maximum speed are not achieved at the same time. A kind of peculiar principle of uncertainty.

That's actually all that we were able to get out of the tests, not so little. We measured the delay with good accuracy, showed that in a number of modes the average latency grows by many orders of magnitude or, more precisely, the concept of the average for it loses its meaning.
Let us finally consider the last question.

What about fixed capacity queue?

fixed capacity is another variant of boost :: lockfree :: queue built on an internal buffer of a fixed size. On the one hand, this avoids accessing the system allocator, on the other hand, if the buffer is full, the writing stream will also have to wait. For some types of tasks this is completely out of the question.
Here we will work by the same method. First, taught by experience, let's look at the dynamics of delays: The

red graph corresponds to the 128 bytes used in the example from boost, the green one corresponds to the maximum possible 65534 bytes.

By the way

The documentation says that the maximum size is 65535 bytes - do not believe it, get a core dump

We did not insert an artificial delay, so it is natural that the queue works in batch mode, is filled and freed up in large portions. However, the fixed buffer capacity introduces a certain order and it is clearly seen that the average for the delay at least exists. Another surprising conclusion ~~for fans of huge buffers is~~ that the size of the buffer does not affect the overall speed of execution in any way. That is, if you are satisfied with a latency of 32 microseconds (which, by the way, is quite enough for many applications), you can use fixed_capacity lockfree :: queue with a tiny amount of memory usage and get great speed.
Nevertheless, let's evaluate how this option behaves in a multi-threaded program:

it was a little unexpected to see such a clear separation into two groups, where the readers' speed exceeds the speed of the writers, we get our desired hundreds of nanoseconds, where on the contrary, it jumps up to 30-40 microseconds, it seems that this is the time for switching the context on my machine. This is the result for a 128-byte buffer, for 64K it looks very similar, only the right-hand group creeps far, far, for tens of milliseconds.
Is this good or bad? It depends on the task, on the one hand, we can confidently guarantee that the delay will never exceed 40 μs under any conditions, and this is good; on the other hand, if we need to guaranteesome maximum delay is less than this value, then we will have a hard time. Any change in the balance of readers / writers, for example due to a slight change in message processing, can lead to an abrupt change in delay.

Recall, however, that we generate messages deliberately faster than our system can process (see above, in the section on dynamic queues) and try to insert a plausible delay:

this is already quite good, the two groups did not merge completely, but the right one approached so that maximum latency does not exceed 600 ns. Take my word for it, the statistics for the large buffer is 64K, it looks absolutely the same, not the slightest difference.

It's time to move to a conclusion

I hope that those who have experience will be able to extract something useful from the test results themselves. Here's what I think myself:

If you are only interested in speed, all options are approximately equivalent and give an average time of the order of hundreds of nanoseconds per message. In this case, the fixed_capacity queue is more lightweight, since it occupies a fixed amount of memory. However, there are applications, a logger, for example, where it is critically important to “free” the reading stream as quickly as possible, in which case the allocating queue is better, on the other hand, it can consume memory unlimitedly.
If minimization of latency is required, the processing time of each message separately, the situation is complicated. For applications where you do not want to block the writing stream (loggers), it is definitely worth choosing an allocating option. For the case of limited memory, fixed_capacity is uniquely suitable, the buffer sizes will have to be selected based on the signal statistics.
In any case, the algorithm is unstable with respect to the intensity of the data stream. When a certain critical threshold is exceeded, the delay jumps by several orders of magnitude and the line actually (but not formally) becomes blocking. As a rule, fine-tuning is required to make the system work without falling into blocking mode.
A complete decoupling of the input and output streams is possible only in the allocating version, this is achieved, however, due to uncontrolled memory consumption and uncontrolled long delay.
Fixed_capacity allows you to achieve fast data transfer while limiting the maximum latency to some reasonable limit. The fixed_capacity queue itself is essentially a very lightweight structure. The main minus is that writing streams are blocked if the readers cannot cope or freeze for any reason. Large sizes of the buffer, in my opinion, are rarely needed, they allow you to achieve transitional dynamics, something close to the allocating queue.
A very unpleasant surprise for me was the great negative impact of continuously working idle reading streams on dynamics. Even in the case when the total number of threads <= CPU, adding another thread consuming 100% does not improve, but worsens the dynamics. It seems that the strategy of "large servers", when each important thread is assigned a separate core, does not always work.
In this regard, one not mentioned and still not solved problem is how to efficiently use a thread waiting for an event. If you put it to sleep - latency ~~karma~~ spoils fatally, if used for other tasks - the problem arises of quickly switching from task to task. I think a good approximation to the ideal would be to make it possible to add a read stream to boost :: io_service, so as to efficiently serve at least rare events. I would be glad to hear if anyone has any ideas.

For those who need a code

#include 
#include 
#include 
#include 
#include 
std::atomic producer_count(0);
std::atomic consumer_count(0);
std::atomic push_fail_count(0);
std::atomic pop_fail_count(0);
#if 1
boost::lockfree::queue> queue(65534);
#else
boost::lockfree::queue> queue(128);
#endif
unsigned stat_size=0, delay=0;
std::atomic* stat=0;
std::atomic idx(0);
void producer(unsigned iterations)
{
	timespec t;
    for (int i=0; i != iterations; ++i) {
        ++producer_count;
        clock_gettime(CLOCK_MONOTONIC, &t);
        while (!queue.push(t))
            ++push_fail_count;
		if(delay) usleep(0);
    }
}
boost::atomic done (false);
void consumer(unsigned iterations)
{
    timespec t, v;
    while (!done) {
        while (queue.pop(t)) {
            ++consumer_count;
        	clock_gettime(CLOCK_MONOTONIC, &v);
			unsigned i=idx++;
			v.tv_sec-=t.tv_sec;
			v.tv_nsec-=t.tv_nsec;
			stat[i]=v.tv_sec*1000000000+v.tv_nsec;
		}
		++pop_fail_count;
    }
    while (queue.pop(t)) {
            ++consumer_count;
        	clock_gettime(CLOCK_MONOTONIC, &v);
			unsigned i=idx++;
			v.tv_sec-=t.tv_sec;
			v.tv_nsec-=t.tv_nsec;
			stat[i]=v.tv_sec*1000000000+v.tv_nsec;
	}
}
int main(int argc, char* argv[])
{
    boost::thread_group producer_threads, consumer_threads;
	int indexed=0, quiet=0;
	int producer_thread=1, consumer_thread=1;
	int opt;
	while((opt=getopt(argc,argv,"idqr:w:")) !=-1)
	switch(opt) {
		case 'r': consumer_thread=atol(optarg); break;
		case 'w': producer_thread=atol(optarg); break;
		case 'd': delay=1; break;
		case 'i': indexed=1; break;
		case 'q': quiet=1; break;
		default : return 1;
	}
	int iterations=6000000/producer_thread/consumer_thread;
	unsigned stat_size=iterations*producer_thread*consumer_thread;
	stat=new std::atomic[stat_size];
	timespec st, fn;
	clock_gettime(CLOCK_MONOTONIC, &st);
    for (int i=0; i != producer_thread; ++i)
        producer_threads.create_thread([=](){ producer(stat_size/producer_thread); });
    for (int i=0; i != consumer_thread; ++i)
        consumer_threads.create_thread([=]() { consumer(stat_size/consumer_thread); });
    producer_threads.join_all();
    done=true;
    consumer_threads.join_all();
	clock_gettime(CLOCK_MONOTONIC, &fn);
	std::cerr << "threads  : " << producer_thread <<" write, " 
			  << consumer_thread << " read" << std::endl;
	std::cerr << "failed   : " << push_fail_count << " pushes, "
			  << pop_fail_count  << " pops" << std::endl;
	fn.tv_sec-=st.tv_sec;
	fn.tv_nsec-=st.tv_nsec;
	std::cerr << "bandwidth: " << (fn.tv_sec*1e9+fn.tv_nsec)/stat_size << " ns"<< std::endl;
	double ct=0;
	for(auto i=0; i < stat_size; ++i)
		ct+=stat[i];
	std::cerr << "latency  : "<< ct/stat_size << " ns"<< std::endl;
	if(!quiet) {
		if(indexed) for(auto i=0; i < stat_size; ++i)
			std::cout<

Теги:
boost
lock-free
алгоритмы