Lock-free stack for Windows


    Windows is customary not to love. However, often, the phrase: “I have not read the book of the writer, but I condemn” describes this situation very well. Despite the deep-rooted contempt for Windows, some things have been implemented very successfully in it, and we would like to write about one of them. Individual fragments of WinAPI, although they have been implemented for a long time, for various reasons, and often undeservedly, fell out of sight of a wide audience.
    This article will focus on the OS built-in implementation of the lock-free stack and comparing its performance with cross-platform counterparts.

    So, for quite some time in WinAPI there is an implementation of a non-blocking stack based on a singly linked list ( Interlocked Singly Linked Lists), which is abbreviated as SList. Initialization operations of such a list and stack primitives over it have been implemented. Without going into the details of the implementation of their SList, Microsoft only indicates that it uses a certain non-blocking algorithm to implement atomic synchronization, increase performance and lock problems.

    Non-blocking simply connected lists can be implemented independently, and Maxim Khizhinsky ( khizmax ) wrote a lot about this in detail in his monumental series of articles on lock-free algorithms on Habré. However, before Windows 8, there was no 128-bit CAS operation, which sometimes created problems when implementing similar algorithms in 64-bit applications. Slist, thus, helps to solve this problem.

    Implementation


    The peculiarities of the implementation of SList include the requirement of alignment of memory for list items along the MEMORY_ALLOCATION_ALIGNMENT border. However, similar requirements are imposed for other interlocked operations in WinAPI. For us, this means the need to use aligned_malloc / aligned_free when working with the memory of list items.

    Another feature is the requirement that the pointer to the next list item of type SLIST_ENTRY be located at the very beginning of the structure: to our own fields.

    The following is an implementation of a C ++ template that wraps the native WinAPI functions for working with SList:

    C ++ Template Code
    	template
    	class SList
    	{
    	public:
    		SList()
    		{
    			// Let Windows initialize an SList head
    			m_stack_head = (PSLIST_HEADER)_aligned_malloc(sizeof(SLIST_HEADER),
    				MEMORY_ALLOCATION_ALIGNMENT);
    			InitializeSListHead(m_stack_head); //UPD: 22.05.2014, thx to @gridem
    		}
    		~SList()
    		{
    			clear();
    			_aligned_free(m_stack_head);
    		}
    		bool push(const T& obj)
    		{
    			// Allocate an SList node
    			node* pNode = alloc_node();
    			if (!pNode)
    				return false;
    			// Call the object's copy constructor
    			init_obj(&pNode->m_obj, obj);
    			// Push the node into the stack
    			InterlockedPushEntrySList(m_stack_head, &pNode->m_slist_entry);
    			return true;
    		}
    		bool pop(T& obj)
    		{
    			// Pop an SList node from the stack
    			node* pNode = (node*)InterlockedPopEntrySList(m_stack_head);
    			if (!pNode)
    				return false;
    			// Retrieve the node's data
    			obj = pNode->m_obj;
    			// Call the destructor
    			free_obj(&pNode->m_obj);
    			// Free the node's memory
    			free_node(pNode);
    			return true;
    		}
    		void clear()
    		{
    			for (;;)
    			{
    				// Pop every SList node from the stack
    				node* pNode = (node*)InterlockedPopEntrySList(m_stack_head);
    				if (!pNode)
    					break;
    				// Call the destructor
    				free_obj(&pNode->m_obj);
    				// Free the node's memory
    				free_node(pNode);
    			}
    		}
    	private:
    		PSLIST_HEADER m_stack_head;
    		struct node
    		{
    			// The SList entry must be the first field
    			SLIST_ENTRY m_slist_entry; 
    			// User type follows
    			T m_obj;
    		};
    		node* alloc_node()
    		{
    			return (node*)_aligned_malloc(sizeof(node), MEMORY_ALLOCATION_ALIGNMENT);
    		}
    		void free_node(node* pNode)
    		{
    			_aligned_free(pNode);
    		}
    		T* init_obj(T* p, const T& init)
    		{
    			return new (static_cast(p)) T(init);
    		}
    		void free_obj(T* p)
    		{
    			p->~T();
    		}
    	};
    


    Performance test


    To test the algorithm, a standard test with “producers” and “consumers” was used. However, in addition, on each test run, we varied the number of threads of type consumer and producer type. At the same time, the total number of tasks also changed, since each “producer" according to the test conditions always produces the same number of tasks (iterations) equal to 1 million, in this case. Thus, with the number of producer threads equal to N, the total number of jobs was N * 1M.

    SList Test Code
    class test
    {
    private:
    	static UMS::SList stack;
    public:
    	static const char* get_name() { return "MS SList"; }
    	static void producer(void)
    	{
    		for (int i = 0; i != iterations; ++i)
    		{
    			if (!stack.push(++producer_count))
    				return;
    		}
    	}
    	static void consumer(void)
    	{
    		int64_t value;
    		while (WaitForSingleObject(hEvtDone, 10) != WAIT_OBJECT_0)
    		{
    			while (stack.pop(value))
    			{
    				++consumer_count;
    			}
    		}
    		while (stack.pop(value))
    		{
    			++consumer_count;
    		}
    	}
    };
    

    In order to prevent consumer workflows from being idle and not freezing in guaranteed sleep, we used Event- type Windows synchronization objects so that consumers cleaned up the stack after the manufacturers finished their work. “Consumers” start simultaneously with “producers”, and by the time “producers” stop and we activate the hEvtDone event, they already manage to parse some of the tasks from the stack.

    Below is a function that calls our test with the required number of threads:

    So we call the test
    template
    void run_test(int producers,   // Number of producer threads
                  int consumers    // Number of consumer threads
    			 )
    {
    	using namespace std;
    	boost::thread_group producer_threads, consumer_threads;
    	// Initiate a timer to measure performance
    	boost::timer::cpu_timer timer;
    	cout << T::get_name() << "\t" << producers << "\t" << consumers << "\t";
    	// Reset the counters after the previous test
    	producer_count = consumer_count = 0;
    	done = false;
    	ResetEvent(hEvtDone);
    	// Start all the producer threads with a given thread proc
    	for (int i = 0; i != producers; ++i)
    		producer_threads.create_thread(T::producer);
    	// Start all the consumer threads with a given thread proc
    	for (int i = 0; i != consumers; ++i)
    		consumer_threads.create_thread(T::consumer);
    	// Waiting for the producers to complete
    	producer_threads.join_all();
    	done = true;
    	SetEvent(hEvtDone);
    	// Waiting for the consumers to complete
    	consumer_threads.join_all();
    	// Report the time of execution
    	auto nanoseconds = boost::chrono::nanoseconds(timer.elapsed().user + timer.elapsed().system);
    	auto seconds = boost::chrono::duration_cast(nanoseconds);
    	auto time_per_item = nanoseconds.count() / producer_count;
    	cout << time_per_item << "\t" << seconds.count() << endl;
    }
    

    The test was run under the following conditions:
    • OS: Windows 8 64-bit
    • CPU: 4x Intel Core i7-3667U @ 2.00GHz
    • RAM: 8GB
    • Compiler: Microsoft C / C ++ Optimizing Compiler Version 18.00.21005.1
    • Configuration: Release, Static Runtime (/ MT), Optimize Speed ​​(/ Ox), x64 Architecture
    • boost: version 1.55
    • libcds: version 1.5.0



    Variations of parameters in two dimensions (consumers, producers) give us the function t (p, c), whose graph is shown in the image on the left.

    In order not to have to count the number of zeros in the results with our eyes, instead of counting the number of tasks per second, we give the time to complete one task in nanoseconds, calculated as the total test time divided by the total number of tasks. The lower this value, the faster the algorithm works.

    The number of threads of each type changed in sequence:
    int NumThreads[] = { 2, 4, 8, 12, 16, 20, 24, 32, 36, 48, 56, 64, 72 };


    Pay attention to this surface. Significantly serious acceleration of the algorithm with a small number of consumers (plot area, painted in green). A further increase in the number of flows of both types does not lead to a noticeable change in the speed of work, although it can be seen that the indicator “floats” somewhat in a narrow corridor, but the graph maintains a calming orange tone.

    If we look at the same graph in a different version, this border is even more clearly visible. In the image on the right, the gracious green-blue strip is clearly distinguishable, marking the entire region with four “consumers” and an arbitrary number of “producers”, which, incidentally, coincides with the number of cores in the experiment. It will be shown below that other implementations show similar dynamics. This suggests the similarity of the algorithm used by Microsoft with that used in third-party libraries.

    It is gratifying to see that the lock-free approach shines here in all its glory: it is difficult to imagine 72 (+72) threads, with 1 million tasks each hanging in the lock. However, articles about lock-free usually start with this.

    Comparison


    An identical test was run on the same computer for two other implementations of non-blocking containers taken from well-known libraries (boost :: lockfree and libcds) in the following loop:

    	int NumThreads[] = { 2, 4, 8, 12, 16, 20, 24, 32, 36, 48, 56, 64, 72 };
    	for (int p : NumThreads)
    	for (int c : NumThreads)
    	{
    		run_test(p, c);
    		run_test(p, c);
    		run_test(p, c);
    	}
    


    Despite some similarities in the implementations, the following are the tests performed for these libraries and the results of each of them.

    Boost.Lockfree Library


    This library was included in boost relatively recently. It consists of three containers: a queue, a stack, and a ring buffer. Using their classes is always convenient. There is documentation and even examples.

    Below is the code for a similar test using boost :: stack.
    Boost test
    class test
    {
    private:
    	static ::boost::lockfree::stack stack;
    public:
    	static const char* get_name() { return "boost::lockfree"; }
    	static void producer(void)
    	{
    		for (int i = 0; i != iterations; ++i)
    		{
    			while (!stack.push(++producer_count));
    		}
    	}
    	static void consumer(void)
    	{
    		int64_t value;
    		while (WaitForSingleObject(hEvtDone, 10)!=WAIT_OBJECT_0)
    		{
    			while (stack.pop(value))
    			{
    				++consumer_count;
    			}
    		}
    		while (stack.pop(value))
    		{
    				++consumer_count;
    		}
    	}
    };
    


    We present the results of this test in the form of graphs:



    Libcds library


    This library is often referenced by khizmax in its articles. Regardless of its consumer qualities, it seemed to us somewhat cumbersome and poorly documented (most of the information we managed to get here on Habré). In addition, in each thread where their lock-free containers are used, you need to attach your thread to their “engine” (probably due to TLS?), Then make it detach and still need to initialize the Hazard Pointer singleton somewhere.

    Despite the unthinkable number of implemented lock-free containers for every taste, this library can hardly be called beautiful - you need to get used to it.

    Below is the code for a similar test using cds :: container :: TreiberStack:
    Libcds test
    class xxxstack : public cds::container::TreiberStack
    {
    public:
    	cds::gc::HP hzpGC;
    	xxxstack()
    	{
    		cds::Initialize(0);
    	}
    };
    class test
    {
    private:
    	static xxxstack stack;
    public:
    	static const char* get_name() { return "libcds tb"; }
    	static void producer(void)
    	{
    		// Attach the thread to libcds infrastructure
    		cds::threading::Manager::attachThread();
    		for (int i = 0; i != iterations; ++i)
    		{
    			//int value = ++producer_count;
    			while (!stack.push(++producer_count));
    		}
    		// Detach thread when terminating
    		cds::threading::Manager::detachThread();
    	}
    	static void consumer(void)
    	{
    		// Attach the thread to libcds infrastructure
    		cds::threading::Manager::attachThread();
    		int64_t value;
    		while (WaitForSingleObject(hEvtDone, 10) != WAIT_OBJECT_0)
    		{
    			while (stack.pop(value))
    			{
    				++consumer_count;
    			}
    		}
    		while (stack.pop(value))
    		{
    			++consumer_count;
    		}
    		// Detach thread when terminating
    		cds::threading::Manager::detachThread();
    	}
    };
    


    We present the results of this test in the form of graphs:



    Performance comparison


    Despite the fact that SList is a native solution, and the other two are “almost” cross-platform, we believe that the comparison we give below is valid, since all tests were carried out under the same conditions and demonstrate the behavior of libraries in these conditions.

    Due to the linear increase in the number of tasks in the test as the number of threads increases, the full implementation of all options takes a decent amount of time. To stabilize the result, three passes were performed, so the results presented above are averaging between three starts.

    According to the above three-dimensional graphs, it is noticeable that the diagonal (the values ​​of the arguments {p = c}) looks almost a straight line, so for a visual comparison of the three libraries we made a selection of results according to this criterion.

    On the left is what we got.

    It can be seen that boost loses to the other two implementations, although it demonstrates greater resistance to changing input parameters.

    Implementations on libcds and SList differ not so much, but throughout the input interval.

    conclusions


    It must be recognized that this time Microsoft has created a very good implementation (albeit just one container), which can successfully compete with algorithms from well-known libraries. Although the described implementation is not cross-platform, it may be useful to Windows application developers.

    Download the archive with the source code of the Visual Studio project here

    Used materials

    The picture at the beginning of the
    MSDN article description slist
    library boost.lockfree
    library libcds
    Lock-free data structures. Stack evolution
    Information on implementation details of SList

    UPD:
    At the request of mickey99, we conducted one more test: this time we took an ordinary std :: stack, access to which was blocked by the regular std :: mutex.
    Mutex test
    class test
    {
    private:
    	static std::stack stack;
    	static std::mutex lock;
    public:
    	static const char* get_name() { return "mutex"; }
    	static void producer(void)
    	{
    		for (int i = 0; i != iterations; ++i)
    		{
    			lock.lock();
    			stack.push(++producer_count);
    			lock.unlock();
    		}
    	}
    	static void consumer(void)
    	{
    		int64_t value;
    		while (WaitForSingleObject(hEvtDone, 10) != WAIT_OBJECT_0)
    		{
    			lock.lock();
    			if (!stack.empty())
    			{
    				value = stack.top();
    				stack.pop();
    				++consumer_count;
    			}
    			lock.unlock();
    		}
    		bool isEmpty = false;
    		while (!isEmpty)
    		{
    			lock.lock();
    			if (!(isEmpty = stack.empty()))
    			{
    				value = stack.top();
    				stack.pop();
    				++consumer_count;
    			}
    			lock.unlock();
    		}
    	}
    };
    


    Let's say right away: I had to wait a long time, a very long time. Then the number of tasks per stream was reduced from 1 million to 100K, which, of course, led to not so accurate results (this is probably not required with such numbers), and the set for the number of input streams was changed so that there were fewer points for calculation :
    int NumThreads[] = { 2, 4, 8, 16, 32, 64};


    Here are the results: The result is very indicative: already with more than 4 flows of any type, the voltage dramatically increases by orders of magnitude. Such a line is difficult to plot on the chart by the first three. Probably more clearly will be with a logarithmic scale.





    Also popular now: