Vengo May 20, 2014 at 18:52

Lock-free stack for Windows

Windows is customary not to love. However, often, the phrase: “I have not read the book of the writer, but I condemn” describes this situation very well. Despite the deep-rooted contempt for Windows, some things have been implemented very successfully in it, and we would like to write about one of them. Individual fragments of WinAPI, although they have been implemented for a long time, for various reasons, and often undeservedly, fell out of sight of a wide audience.
This article will focus on the OS built-in implementation of the lock-free stack and comparing its performance with cross-platform counterparts.

So, for quite some time in WinAPI there is an implementation of a non-blocking stack based on a singly linked list ( Interlocked Singly Linked Lists), which is abbreviated as SList. Initialization operations of such a list and stack primitives over it have been implemented. Without going into the details of the implementation of their SList, Microsoft only indicates that it uses a certain non-blocking algorithm to implement atomic synchronization, increase performance and lock problems.

Non-blocking simply connected lists can be implemented independently, and Maxim Khizhinsky ( khizmax ) wrote a lot about this in detail in his monumental series of articles on lock-free algorithms on Habré. However, before Windows 8, there was no 128-bit CAS operation, which sometimes created problems when implementing similar algorithms in 64-bit applications. Slist, thus, helps to solve this problem.

Implementation

The peculiarities of the implementation of SList include the requirement of alignment of memory for list items along the MEMORY_ALLOCATION_ALIGNMENT border. However, similar requirements are imposed for other interlocked operations in WinAPI. For us, this means the need to use aligned_malloc / aligned_free when working with the memory of list items.

Another feature is the requirement that the pointer to the next list item of type SLIST_ENTRY be located at the very beginning of the structure: to our own fields.

The following is an implementation of a C ++ template that wraps the native WinAPI functions for working with SList:

C ++ Template Code

	template
	class SList
	{
	public:
		SList()
		{
			// Let Windows initialize an SList head
			m_stack_head = (PSLIST_HEADER)_aligned_malloc(sizeof(SLIST_HEADER),
				MEMORY_ALLOCATION_ALIGNMENT);
			InitializeSListHead(m_stack_head); //UPD: 22.05.2014, thx to @gridem
		}
		~SList()
		{
			clear();
			_aligned_free(m_stack_head);
		}
		bool push(const T& obj)
		{
			// Allocate an SList node
			node* pNode = alloc_node();
			if (!pNode)
				return false;
			// Call the object's copy constructor
			init_obj(&pNode->m_obj, obj);
			// Push the node into the stack
			InterlockedPushEntrySList(m_stack_head, &pNode->m_slist_entry);
			return true;
		}
		bool pop(T& obj)
		{
			// Pop an SList node from the stack
			node* pNode = (node*)InterlockedPopEntrySList(m_stack_head);
			if (!pNode)
				return false;
			// Retrieve the node's data
			obj = pNode->m_obj;
			// Call the destructor
			free_obj(&pNode->m_obj);
			// Free the node's memory
			free_node(pNode);
			return true;
		}
		void clear()
		{
			for (;;)
			{
				// Pop every SList node from the stack
				node* pNode = (node*)InterlockedPopEntrySList(m_stack_head);
				if (!pNode)
					break;
				// Call the destructor
				free_obj(&pNode->m_obj);
				// Free the node's memory
				free_node(pNode);
			}
		}
	private:
		PSLIST_HEADER m_stack_head;
		struct node
		{
			// The SList entry must be the first field
			SLIST_ENTRY m_slist_entry; 
			// User type follows
			T m_obj;
		};
		node* alloc_node()
		{
			return (node*)_aligned_malloc(sizeof(node), MEMORY_ALLOCATION_ALIGNMENT);
		}
		void free_node(node* pNode)
		{
			_aligned_free(pNode);
		}
		T* init_obj(T* p, const T& init)
		{
			return new (static_cast(p)) T(init);
		}
		void free_obj(T* p)
		{
			p->~T();
		}
	};

Performance test

To test the algorithm, a standard test with “producers” and “consumers” was used. However, in addition, on each test run, we varied the number of threads of type consumer and producer type. At the same time, the total number of tasks also changed, since each “producer" according to the test conditions always produces the same number of tasks (iterations) equal to 1 million, in this case. Thus, with the number of producer threads equal to N, the total number of jobs was N * 1M.

SList Test Code

class test
{
private:
	static UMS::SList stack;
public:
	static const char* get_name() { return "MS SList"; }
	static void producer(void)
	{
		for (int i = 0; i != iterations; ++i)
		{
			if (!stack.push(++producer_count))
				return;
		}
	}
	static void consumer(void)
	{
		int64_t value;
		while (WaitForSingleObject(hEvtDone, 10) != WAIT_OBJECT_0)
		{
			while (stack.pop(value))
			{
				++consumer_count;
			}
		}
		while (stack.pop(value))
		{
			++consumer_count;
		}
	}
};

In order to prevent consumer workflows from being idle and not freezing in guaranteed sleep, we used Event- type Windows synchronization objects so that consumers cleaned up the stack after the manufacturers finished their work. “Consumers” start simultaneously with “producers”, and by the time “producers” stop and we activate the hEvtDone event, they already manage to parse some of the tasks from the stack.

Below is a function that calls our test with the required number of threads:

So we call the test

template
void run_test(int producers,   // Number of producer threads
              int consumers    // Number of consumer threads
			 )
{
	using namespace std;
	boost::thread_group producer_threads, consumer_threads;
	// Initiate a timer to measure performance
	boost::timer::cpu_timer timer;
	cout << T::get_name() << "\t" << producers << "\t" << consumers << "\t";
	// Reset the counters after the previous test
	producer_count = consumer_count = 0;
	done = false;
	ResetEvent(hEvtDone);
	// Start all the producer threads with a given thread proc
	for (int i = 0; i != producers; ++i)
		producer_threads.create_thread(T::producer);
	// Start all the consumer threads with a given thread proc
	for (int i = 0; i != consumers; ++i)
		consumer_threads.create_thread(T::consumer);
	// Waiting for the producers to complete
	producer_threads.join_all();
	done = true;
	SetEvent(hEvtDone);
	// Waiting for the consumers to complete
	consumer_threads.join_all();
	// Report the time of execution
	auto nanoseconds = boost::chrono::nanoseconds(timer.elapsed().user + timer.elapsed().system);
	auto seconds = boost::chrono::duration_cast(nanoseconds);
	auto time_per_item = nanoseconds.count() / producer_count;
	cout << time_per_item << "\t" << seconds.count() << endl;
}

The test was run under the following conditions:

OS: Windows 8 64-bit
CPU: 4x Intel Core i7-3667U @ 2.00GHz
RAM: 8GB
Compiler: Microsoft C / C ++ Optimizing Compiler Version 18.00.21005.1
Configuration: Release, Static Runtime (/ MT), Optimize Speed (/ Ox), x64 Architecture
boost: version 1.55
libcds: version 1.5.0

Variations of parameters in two dimensions (consumers, producers) give us the function t (p, c), whose graph is shown in the image on the left.

In order not to have to count the number of zeros in the results with our eyes, instead of counting the number of tasks per second, we give the time to complete one task in nanoseconds, calculated as the total test time divided by the total number of tasks. The lower this value, the faster the algorithm works.

The number of threads of each type changed in sequence:

int NumThreads[] = { 2, 4, 8, 12, 16, 20, 24, 32, 36, 48, 56, 64, 72 };

Pay attention to this surface. Significantly serious acceleration of the algorithm with a small number of consumers (plot area, painted in green). A further increase in the number of flows of both types does not lead to a noticeable change in the speed of work, although it can be seen that the indicator “floats” somewhat in a narrow corridor, but the graph maintains a calming orange tone.

If we look at the same graph in a different version, this border is even more clearly visible. In the image on the right, the gracious green-blue strip is clearly distinguishable, marking the entire region with four “consumers” and an arbitrary number of “producers”, which, incidentally, coincides with the number of cores in the experiment. It will be shown below that other implementations show similar dynamics. This suggests the similarity of the algorithm used by Microsoft with that used in third-party libraries.

It is gratifying to see that the lock-free approach shines here in all its glory: it is difficult to imagine 72 (+72) threads, with 1 million tasks each hanging in the lock. However, articles about lock-free usually start with this.

Comparison

An identical test was run on the same computer for two other implementations of non-blocking containers taken from well-known libraries (boost :: lockfree and libcds) in the following loop:

	int NumThreads[] = { 2, 4, 8, 12, 16, 20, 24, 32, 36, 48, 56, 64, 72 };
	for (int p : NumThreads)
	for (int c : NumThreads)
	{
		run_test(p, c);
		run_test(p, c);
		run_test(p, c);
	}

Despite some similarities in the implementations, the following are the tests performed for these libraries and the results of each of them.

Boost.Lockfree Library

This library was included in boost relatively recently. It consists of three containers: a queue, a stack, and a ring buffer. Using their classes is always convenient. There is documentation and even examples.

Below is the code for a similar test using boost :: stack.

Boost test

class test
{
private:
	static ::boost::lockfree::stack stack;
public:
	static const char* get_name() { return "boost::lockfree"; }
	static void producer(void)
	{
		for (int i = 0; i != iterations; ++i)
		{
			while (!stack.push(++producer_count));
		}
	}
	static void consumer(void)
	{
		int64_t value;
		while (WaitForSingleObject(hEvtDone, 10)!=WAIT_OBJECT_0)
		{
			while (stack.pop(value))
			{
				++consumer_count;
			}
		}
		while (stack.pop(value))
		{
				++consumer_count;
		}
	}
};

We present the results of this test in the form of graphs:

Libcds library

This library is often referenced by khizmax in its articles. Regardless of its consumer qualities, it seemed to us somewhat cumbersome and poorly documented (most of the information we managed to get here on Habré). In addition, in each thread where their lock-free containers are used, you need to attach your thread to their “engine” (probably due to TLS?), Then make it detach and still need to initialize the Hazard Pointer singleton somewhere.

Despite the unthinkable number of implemented lock-free containers for every taste, this library can hardly be called beautiful - you need to get used to it.

Below is the code for a similar test using cds :: container :: TreiberStack:

Libcds test

class xxxstack : public cds::container::TreiberStack
{
public:
	cds::gc::HP hzpGC;
	xxxstack()
	{
		cds::Initialize(0);
	}
};
class test
{
private:
	static xxxstack stack;
public:
	static const char* get_name() { return "libcds tb"; }
	static void producer(void)
	{
		// Attach the thread to libcds infrastructure
		cds::threading::Manager::attachThread();
		for (int i = 0; i != iterations; ++i)
		{
			//int value = ++producer_count;
			while (!stack.push(++producer_count));
		}
		// Detach thread when terminating
		cds::threading::Manager::detachThread();
	}
	static void consumer(void)
	{
		// Attach the thread to libcds infrastructure
		cds::threading::Manager::attachThread();
		int64_t value;
		while (WaitForSingleObject(hEvtDone, 10) != WAIT_OBJECT_0)
		{
			while (stack.pop(value))
			{
				++consumer_count;
			}
		}
		while (stack.pop(value))
		{
			++consumer_count;
		}
		// Detach thread when terminating
		cds::threading::Manager::detachThread();
	}
};

We present the results of this test in the form of graphs:

Performance comparison

Despite the fact that SList is a native solution, and the other two are “almost” cross-platform, we believe that the comparison we give below is valid, since all tests were carried out under the same conditions and demonstrate the behavior of libraries in these conditions.

Due to the linear increase in the number of tasks in the test as the number of threads increases, the full implementation of all options takes a decent amount of time. To stabilize the result, three passes were performed, so the results presented above are averaging between three starts.

According to the above three-dimensional graphs, it is noticeable that the diagonal (the values of the arguments {p = c}) looks almost a straight line, so for a visual comparison of the three libraries we made a selection of results according to this criterion.

On the left is what we got.

It can be seen that boost loses to the other two implementations, although it demonstrates greater resistance to changing input parameters.

Implementations on libcds and SList differ not so much, but throughout the input interval.

conclusions

It must be recognized that this time Microsoft has created a very good implementation (albeit just one container), which can successfully compete with algorithms from well-known libraries. Although the described implementation is not cross-platform, it may be useful to Windows application developers.

Download the archive with the source code of the Visual Studio project here

Used materials

The picture at the beginning of the
MSDN article description slist
library boost.lockfree
library libcds
Lock-free data structures. Stack evolution
Information on implementation details of SList

UPD:
At the request of mickey99, we conducted one more test: this time we took an ordinary std :: stack, access to which was blocked by the regular std :: mutex.

Mutex test

class test
{
private:
	static std::stack stack;
	static std::mutex lock;
public:
	static const char* get_name() { return "mutex"; }
	static void producer(void)
	{
		for (int i = 0; i != iterations; ++i)
		{
			lock.lock();
			stack.push(++producer_count);
			lock.unlock();
		}
	}
	static void consumer(void)
	{
		int64_t value;
		while (WaitForSingleObject(hEvtDone, 10) != WAIT_OBJECT_0)
		{
			lock.lock();
			if (!stack.empty())
			{
				value = stack.top();
				stack.pop();
				++consumer_count;
			}
			lock.unlock();
		}
		bool isEmpty = false;
		while (!isEmpty)
		{
			lock.lock();
			if (!(isEmpty = stack.empty()))
			{
				value = stack.top();
				stack.pop();
				++consumer_count;
			}
			lock.unlock();
		}
	}
};

Let's say right away: I had to wait a long time, a very long time. Then the number of tasks per stream was reduced from 1 million to 100K, which, of course, led to not so accurate results (this is probably not required with such numbers), and the set for the number of input streams was changed so that there were fewer points for calculation :

int NumThreads[] = { 2, 4, 8, 16, 32, 64};

Here are the results: The result is very indicative: already with more than 4 flows of any type, the voltage dramatically increases by orders of magnitude. Such a line is difficult to plot on the chart by the first three. Probably more clearly will be with a logarithmic scale.

Tags: