Prefer SRW locks to critical sections
- Transfer
This article explains why Slim Reader / Writer Lock (SRWL) is often preferred over classic critical sections when developing Win32 applications .
An SRWL object occupies only 8 bytes in memory on the x64 architecture, while the critical section is 40 bytes. The critical section requires initialization and deinitialization through calls to the kernel functions of the OS, while SRWL is initialized by simply assigning the constant SRWLOCK_INIT to it, and there are no expenses for deletion at all. Using SRWL generates more compact code and uses less RAM when working.
If you have 100,000 objects that require some internal synchronization, memory savings will already be substantial. The performance gain from avoiding unnecessary cache misses will be even more noticeable. In modern processors (starting with Intel Nehalem , released in 2008), one cache linetakes 64 bytes. If you use 40 of them on the synchronization object, this will significantly affect the performance of access to small objects in your software.
First of all, keep in mind that the implementation of SRWL in the OS kernel has been substantially redesigned over the previous few years. If you read some benchmark on the Internet regarding the measurement of the speed of various synchronization primitives in Windows OS - pay attention to the date of writing.
Both the critical section and SRWL for some time spin in a cycle in user mode, and only then go into standby mode in the kernel. Only the critical section allows you to configure the timeout in user mode.
I have not explored implementation details deeper. Also, I never tried to conduct the correct benchmark to completely correctly compare the speed of critical sections and SRWL. It is very difficult to build both a theoretically sound and practically useful benchmark.
But I replaced critical sections with SRWL in my applications about 20 times in various scenarios. SRWL was always faster (or at least not slower) and often gave a visible performance boost.
I will not give specific numbers here. The amount of work when locking is captured, the granularity of locks, the level of parallelism, the ratio of reads and writes, cache usage, processor load, and other factors have too much influence on the final result.
I will not argue that SRWL is absolutely always faster than critical sections. In each case, profiling is necessary to clarify the whole picture.
This is not a bug, but a feature.
If we do not have re-profitability of locks, this immediately leads to more transparent public contracts, it requires careful consideration when deciding on the capture and release of locks, which ultimately avoids deadlocks. Well, at least until you do stupid things, like calling callbacks inside a captured lock.
Reentrant locking, of course, is also useful. For example, when you try to add parallelism to some old code and do not want to get too deep into its refactoring. The original POSIX mutex was created reentrant by accident. One can only imagine how many problems associated with parallel code and locks could be happily avoided if reentrant synchronization primitives had not become mainstream.
A stream that tries to capture the same SRWL twice for recording will catch itself in the deadlock. This type of error is easy to identify and correct right at the time of first appearance. Just look at the tie - there will be all the necessary information. There is no influence of timings and parallel flows there.
A recursive read lock used to cause deadlocks too, well, at least I'm sure about 90% of that :). If I'm not mistaken, Microsoft has quietly changed the behavior either in some update, or when switching from Win8 to Win10 and now there is no deadlock. Unfortunately, this made it more difficult to find reentrant errors. Mistakenly enclosed read locks lead to unpleasant bugs in cases when the internal lock is released too early. Perhaps even worse, an external lock can release the lock captured by another reader. Microsoft SAL annotations for locks can theoretically help detect this type of problem at the compilation stage, but I personally have never tried them in practice.
Parallel reading in practice happens quite often. The critical section does not support concurrency in this case.
The flip side of the benefit of concurrent reads is the fact that a write lock cannot be obtained until all read locks are released. Moreover, SRWL does not guarantee a request for a write lock at all any preferences or even justice in the order of granting the right to lock (new read locks can be successfully captured while the write lock will continue to be pending). Critical sections in this regard are no better on the one hand (it is also impossible to set priorities for capture for reading or writing there), but on the other hand, due to the lack of the possibility of parallel captures for reading, the problem will arise less often.
Windows Task Scheduler Provides Some Justicein terms of providing resources to all threads. This helps, while blocking some resource in one thread, complete the user-mode wait loop in all other threads. But, since the scheduler operation algorithm is not part of any public contract, it is not worth writing any code based on its current implementation.
If continuity of recording progress is important, then neither the critical section nor SRWL are suitable as a synchronization mechanism. Other constructs, such as a reader-writer queue, may be the preferred mechanism.
concurrency :: reader_writer_lock provides stricter guarantees of priorities than SRWL and is designed specifically to work in conditions of frequent captures. It comes at a price. In my experience, this synchronization primitive is significantly slower than critical sections and SRWL, and also takes up more memory space (72 bytes).
Personally, I think it is too redundant to perform certain tasks (jobs) only to try to capture the lock, but probably it will suit someone.
An erroneous cache hit is much more likely for SRWL than for critical sections - again, due to the difference in size (8 bytes versus 40). If the critical section gets into the cache, then its 40 bytes occupy most of the 64-byte cache line, which eliminates the possibility of another critical section entering it. If you create an array of locks, try to consider the size of the cache line of your platform.
However, one should not concentrate on this ahead of time. Even SRWLs rarely fall into the same cache line. This only happens when a very large number of threads simultaneously modify some relatively small number of objects. If you have, for example, several thousand small objects, then it’s hardly worthwhile to significantly change their size because of the probability of an erroneous hit of the lock in the cache - the game, as a rule, is not worth the candle. Exactly, of course, it can be stated only after profiling each individual situation.
I should mention a bug in the kernel of the Windows OS, which forced me to slightly lose faith in SRWL and indeed in Windows. A few years ago, my colleagues and I began to notice strange bugs when some streams sometimes could not capture certain SRWLs. This happened mainly on dual-core processors, but sometimes, very rarely, on single-core processors too. Debugging showed that at the time of the attempt to capture, no other thread held this lock. Even more surprisingly, an instant later in the same thread, an attempt to capture the same lock was already successful. After a long study, I was able to reduce the playback time of this bug from a few days to half an hour. In the end, I proved that this is a problem in the kernel of the OS, which also applies to IOCP.
From the moment a bug is detected to the hotfix 8 months have passed and, of course, it took some time to distribute the update to user PCs.
Most locks protect information in some objects from accidental simultaneous access from various streams. The key word here is “random,” since the requirement of precisely simultaneous access is rarely deliberately programmed. Both the critical section and the SRWL have good performance in capturing and releasing a lock that is currently free. In this case, the overall size of the protected object comes to the fore. If the object is small enough to get into the same cache line with the lock, this immediately gives a performance boost. 32 bytes smaller SRWL is the main reason to use it for this purpose.
For code scenarios, when the lock in most cases at the time of the capture attempt is already taken, it is impossible to draw such unambiguous conclusions. It requires measurements of each optimization done. But in this case, the speed of capturing and releasing the lock itself is unlikely to be a bottleneck in the code. The focus will be on reducing the code runtime inside the lock. Everything that can be done before or after blocking should be done there. Consider using multiple separate locks instead of one. Try to pull the necessary data before calling the lock (this gives a chance that the code inside the lock will work faster because it will get it from the cache). Do not allocate memory in the global heap inside the lock (try using some allocator with preliminary allocation of memory).
And finally, unresponsive locks are much easier to read in code. Reentrant locks are a kind of “goto concurrency” because they complicate the understanding of the current state of the lock and the reasons for the appearance of deadlocks.
Lightness
An SRWL object occupies only 8 bytes in memory on the x64 architecture, while the critical section is 40 bytes. The critical section requires initialization and deinitialization through calls to the kernel functions of the OS, while SRWL is initialized by simply assigning the constant SRWLOCK_INIT to it, and there are no expenses for deletion at all. Using SRWL generates more compact code and uses less RAM when working.
If you have 100,000 objects that require some internal synchronization, memory savings will already be substantial. The performance gain from avoiding unnecessary cache misses will be even more noticeable. In modern processors (starting with Intel Nehalem , released in 2008), one cache linetakes 64 bytes. If you use 40 of them on the synchronization object, this will significantly affect the performance of access to small objects in your software.
Speed
First of all, keep in mind that the implementation of SRWL in the OS kernel has been substantially redesigned over the previous few years. If you read some benchmark on the Internet regarding the measurement of the speed of various synchronization primitives in Windows OS - pay attention to the date of writing.
Both the critical section and SRWL for some time spin in a cycle in user mode, and only then go into standby mode in the kernel. Only the critical section allows you to configure the timeout in user mode.
I have not explored implementation details deeper. Also, I never tried to conduct the correct benchmark to completely correctly compare the speed of critical sections and SRWL. It is very difficult to build both a theoretically sound and practically useful benchmark.
But I replaced critical sections with SRWL in my applications about 20 times in various scenarios. SRWL was always faster (or at least not slower) and often gave a visible performance boost.
I will not give specific numbers here. The amount of work when locking is captured, the granularity of locks, the level of parallelism, the ratio of reads and writes, cache usage, processor load, and other factors have too much influence on the final result.
I will not argue that SRWL is absolutely always faster than critical sections. In each case, profiling is necessary to clarify the whole picture.
Lack of reentry in SRWL
This is not a bug, but a feature.
If we do not have re-profitability of locks, this immediately leads to more transparent public contracts, it requires careful consideration when deciding on the capture and release of locks, which ultimately avoids deadlocks. Well, at least until you do stupid things, like calling callbacks inside a captured lock.
Reentrant locking, of course, is also useful. For example, when you try to add parallelism to some old code and do not want to get too deep into its refactoring. The original POSIX mutex was created reentrant by accident. One can only imagine how many problems associated with parallel code and locks could be happily avoided if reentrant synchronization primitives had not become mainstream.
A stream that tries to capture the same SRWL twice for recording will catch itself in the deadlock. This type of error is easy to identify and correct right at the time of first appearance. Just look at the tie - there will be all the necessary information. There is no influence of timings and parallel flows there.
A recursive read lock used to cause deadlocks too, well, at least I'm sure about 90% of that :). If I'm not mistaken, Microsoft has quietly changed the behavior either in some update, or when switching from Win8 to Win10 and now there is no deadlock. Unfortunately, this made it more difficult to find reentrant errors. Mistakenly enclosed read locks lead to unpleasant bugs in cases when the internal lock is released too early. Perhaps even worse, an external lock can release the lock captured by another reader. Microsoft SAL annotations for locks can theoretically help detect this type of problem at the compilation stage, but I personally have never tried them in practice.
Parallel reading
Parallel reading in practice happens quite often. The critical section does not support concurrency in this case.
Recording Performance Issues
The flip side of the benefit of concurrent reads is the fact that a write lock cannot be obtained until all read locks are released. Moreover, SRWL does not guarantee a request for a write lock at all any preferences or even justice in the order of granting the right to lock (new read locks can be successfully captured while the write lock will continue to be pending). Critical sections in this regard are no better on the one hand (it is also impossible to set priorities for capture for reading or writing there), but on the other hand, due to the lack of the possibility of parallel captures for reading, the problem will arise less often.
Windows Task Scheduler Provides Some Justicein terms of providing resources to all threads. This helps, while blocking some resource in one thread, complete the user-mode wait loop in all other threads. But, since the scheduler operation algorithm is not part of any public contract, it is not worth writing any code based on its current implementation.
If continuity of recording progress is important, then neither the critical section nor SRWL are suitable as a synchronization mechanism. Other constructs, such as a reader-writer queue, may be the preferred mechanism.
Runtime Competition
concurrency :: reader_writer_lock provides stricter guarantees of priorities than SRWL and is designed specifically to work in conditions of frequent captures. It comes at a price. In my experience, this synchronization primitive is significantly slower than critical sections and SRWL, and also takes up more memory space (72 bytes).
Personally, I think it is too redundant to perform certain tasks (jobs) only to try to capture the lock, but probably it will suit someone.
Incorrect cache hit
An erroneous cache hit is much more likely for SRWL than for critical sections - again, due to the difference in size (8 bytes versus 40). If the critical section gets into the cache, then its 40 bytes occupy most of the 64-byte cache line, which eliminates the possibility of another critical section entering it. If you create an array of locks, try to consider the size of the cache line of your platform.
However, one should not concentrate on this ahead of time. Even SRWLs rarely fall into the same cache line. This only happens when a very large number of threads simultaneously modify some relatively small number of objects. If you have, for example, several thousand small objects, then it’s hardly worthwhile to significantly change their size because of the probability of an erroneous hit of the lock in the cache - the game, as a rule, is not worth the candle. Exactly, of course, it can be stated only after profiling each individual situation.
OS kernel bug
I should mention a bug in the kernel of the Windows OS, which forced me to slightly lose faith in SRWL and indeed in Windows. A few years ago, my colleagues and I began to notice strange bugs when some streams sometimes could not capture certain SRWLs. This happened mainly on dual-core processors, but sometimes, very rarely, on single-core processors too. Debugging showed that at the time of the attempt to capture, no other thread held this lock. Even more surprisingly, an instant later in the same thread, an attempt to capture the same lock was already successful. After a long study, I was able to reduce the playback time of this bug from a few days to half an hour. In the end, I proved that this is a problem in the kernel of the OS, which also applies to IOCP.
From the moment a bug is detected to the hotfix 8 months have passed and, of course, it took some time to distribute the update to user PCs.
conclusions
Most locks protect information in some objects from accidental simultaneous access from various streams. The key word here is “random,” since the requirement of precisely simultaneous access is rarely deliberately programmed. Both the critical section and the SRWL have good performance in capturing and releasing a lock that is currently free. In this case, the overall size of the protected object comes to the fore. If the object is small enough to get into the same cache line with the lock, this immediately gives a performance boost. 32 bytes smaller SRWL is the main reason to use it for this purpose.
For code scenarios, when the lock in most cases at the time of the capture attempt is already taken, it is impossible to draw such unambiguous conclusions. It requires measurements of each optimization done. But in this case, the speed of capturing and releasing the lock itself is unlikely to be a bottleneck in the code. The focus will be on reducing the code runtime inside the lock. Everything that can be done before or after blocking should be done there. Consider using multiple separate locks instead of one. Try to pull the necessary data before calling the lock (this gives a chance that the code inside the lock will work faster because it will get it from the cache). Do not allocate memory in the global heap inside the lock (try using some allocator with preliminary allocation of memory).
And finally, unresponsive locks are much easier to read in code. Reentrant locks are a kind of “goto concurrency” because they complicate the understanding of the current state of the lock and the reasons for the appearance of deadlocks.