NIX_Solutions August 25, 2017 at 10:22

Mutexes and Capture Lockouts in Swift

Transfer

Translation of Matt Gallager's article .

This article will talk about the lack of threading and thread synchronization tools in Swift. We will discuss the proposal for introducing concurrency in Swift and how, before this feature is available, streaming execution in Swift will involve the use of traditional mutexes and a shared mutable state.

Using a mutex in Swift is not particularly difficult, but against this background, I would like to highlight the subtle nuances of performance in Swift - dynamic memory allocation during capture by closures. We want our mutex to be fast, but passing a closure for execution inside the mutex can reduce performance by 10 times due to additional memory overhead. Let's look at several ways to solve this problem.

Lack of threading in Swift

When Swift was first announced in June 2014, it had two obvious omissions:

error processing,
Stream execution and thread synchronization.

Error handling was implemented in Swift 2 and was one of the key features of this release.

And threading, for the most part, is still ignored by Swift. Instead of language tools for stream execution, Swift includes a module Dispatch (libdispatch, aka Grand Central Dispatch) on all platforms, and implicitly offers us to use Dispatch instead of expecting help from the language.

The delegation of responsibility to the library that is shipped seems particularly strange compared to other modern languages such as Go and Rust, in which stream execution primitives and strict thread safety (respectively) have become the main properties of their languages. Even the properties @synchronizedandatomicin Objective-C seem like a generous offer compared to the lack of anything like that in Swift.

What is the reason for such an obvious omission in this language?

Future Multithreading in Swift

The answer is briefly discussed in the proposal for the implementation of multi-threading in the Swift repository .

I mention this sentence to emphasize that Swift developers would like to do something about multithreading in the future, but please keep in mind what Swift developer Joe Groff says: “This document is just a proposal, not an official referral statement development. "

This proposal appeared to describe a situation where, for example, in Cyclone or Rust, links cannot be shared between threads. Regardless of whether the result is similar to these languages, it seems that Swift plans to eliminate the shared memory of threads, except for types that implementCopyable and transmitted through strictly controlled channels (in the sentence called Stream's). A coroutine (in the sentence called Task's) will also appear , which will behave as asynchronous dispatch blocks, which can be paused / resumed.

Further in the proposal, it is stated that in libraries on top of the primitives the Stream / Task / Copyablemost common language tools for streaming execution can be implemented (similar chan to Go, async / await.NET, actor Erlang).

Sounds good, but when to expect multithreading in Swift? In Swift 4? In Swift 5? Not soon.

Thus, now this does not help us, but rather even disturbs us.

The impact of future features on the current library

The problem is that Swift avoids the use of simple multi-threaded primitives in the language, or thread-safe versions of language functions on the grounds that they will be replaced or eliminated by some future means.

You can find clear evidence of this by reading the Swift-Evolution mailing list:

References to objects (both strong and weak) are not defined "if there is read / write, write / write or anything / destroy data in the racing variable" . There is no intention here to change this behavior or propose an integrated “atomic” approach, since it is “one of the few vague rules of conduct that we adopt.” A possible “fix” for this vague behavior will be a new multithreading model.
The resulting types (or other means of throwing ( throw) other than functional interfaces) would be useful for numerous continuation passing style algorithms and would be thoroughly discussed, but ultimately they would be ignored until Swift will not “provide proper language support [for coroutines or asynchronous promises]” as part of the multithreading change.

Trying to find a fast general purpose mutex mutex

In short: if we need multi-threaded behavior, then we need to build it ourselves using the existing streaming tools and mutex properties.

Standard Swift Mutex Tip : Use DispatchQueue and invoke with respect to it sync.

I like libdispatch, but in most cases, using it DispatchQueue.syncas a mutex is the slowest way to fix the problem, more than an order of magnitude slower than other solutions due to the inevitable cost of capturing by the closure that is passed to the functionsync. This is due to the fact that a mutex closure must capture the surrounding state (in particular, capture a reference to a protected resource), and this capture implies the use of a closure context in dynamic memory. Until Swift gets the opportunity to optimize non-escaping closures on the stack, the only way to avoid the extra cost of putting faults in dynamic memory is to make sure they are built-in. Unfortunately, this is not possible within the boundaries of a module, such as the boundaries of a Dispatch module. This makes the DispatchQueue.syncSwift an unnecessarily slow mutex.

The next in frequency proposed option can be considered objc_sync_enter/ objc_sync_exit. Being 2 to 3 times faster than libdispatch, it is still slightly slower than ideal (because it is always a mutexreentry (re-entrant)), and depends on the runtime Objective-C (Apple platform so limited).

The fastest option for a mutex is OSSpinLock - more than 20 times faster than dispatch_sync. In addition to the general limitations of spinlocks (high CPU load if multiple threads try to enter at the same time), iOS has serious problems that make it completely unsuitable for use on this platform. Accordingly, it can only be used on a Mac .

If you are targeting iOS 10 or macOS 10.12, or something newer, you can use os_unfair_lock_t. This performance decision should be close toOSSpinLock, while devoid of his most serious problems. However, this lock is not a FIFO. Instead, the mutex is left to the arbitrary expectant (hence, “unfiar”). You need to decide if this is a problem for your program, although it generally means that this option should not be your first choice for a general purpose mutex.

All these problems make pthread_mutex_lock/ the pthread_mutex_unlockonly reasonable, productive and portable option.

Mutexes and Pitfalls Capture Locked

Like most things in pure C, it pthread_mutex_thas a rather clumsy interface, which helps to use the Swift wrapper (especially for building and automatic cleaning). In addition, it is useful to have a “scoped” mutex - which takes a function and executes it inside the mutex, providing balanced “lock” and “unlock” on either side of the function.

Let's name our wrapper PThreadMutex. The following is the implementation of a simple scoped mutex function in this wrapper:

public func sync(execute: () -> R) -> R {
 pthread_mutex_lock(&m)
 defer { pthread_mutex_unlock(&m) }
 return execute()
}

It should work fast, but it is not. Do you see why?

The problem arises from the implementation of reusable functions, such as the one presented in a separate module CwlUtils. This leads to exactly the same problem as in the case with DispatchQueue.sync: capture by closure leads to the allocation of dynamic memory. Due to the overhead during this process, the function will work more than 10 times slower than necessary (3.124 seconds for 10 million calls, compared with an ideal 0.263 seconds).

What exactly is “captured”? Let's look at the following example:

mutex.sync { doSomething(&protectedMutableState) }

In order to do something useful inside the mutex, the reference to protectedMutableState must be stored in the “closure context”, which is the data in dynamic memory.

This may seem harmless enough (after all, capture is what closures do). But if the function sync cannot be embedded in what calls it (because it is in another module or file, and the optimization of the entire module is turned off), then dynamic memory will be allocated during capture.

But we do not want this. To avoid this, we pass the corresponding parameter to the closure instead of capturing it.

WARNING: The following few code examples are becoming more and more ridiculous, and in most cases I suggest not following them. I am doing this to demonstrate the depth of the problem. Read the chapter “Another Approach” to see what I use in practice.

public func sync_2(_ p: inout T, execute: (inout T) -> Void) {
 pthread_mutex_lock(&m)
 defer { pthread_mutex_unlock(&m) }
 execute(&p)
}

That's better ... now the function works at full speed (0.282 seconds for a test of 10 million calls).

We solved the problem using the values passed by the functions. A similar problem occurs with returning the result. Next function:

public func sync_3(_ p: inout T, execute: (inout T) -> R) -> R {
 pthread_mutex_lock(&m)
 defer { pthread_mutex_unlock(&m) }
 return execute(&p)
}

shows the same low speed of the original, even when the closure does not capture anything (at 1.371 seconds, the speed drops even lower). To process its result, this closure performs dynamic memory allocation.

We can fix this by introducing a parameter into the result inout.

public func sync_4(_ p1: inout T, _ p2: inout U, execute: (inout T, inout U) -> Void) -> Void {
 pthread_mutex_lock(&m)
 execute(&p, &p2)
 pthread_mutex_unlock(&m)
}

and call so

// Предполагается, что `mutableState` и `result` являются валидными, изменяемыми значениями в текущей области видимости
mutex.sync_4(&mutableState, &result) { $1 = doSomething($0) }

We are back to full speed, or close enough to it (0.307 seconds for 10 million calls).

Another approach

One of the advantages of locking circuitry is how light it seems. Elements inside the capture have the same name inside and outside the closure, and the connection between them is obvious. When we avoid capturing by closure, and instead try to pass all the values as parameters, we are forced to either rename all of our variables or give them shadow names (which does not contribute to ease of understanding) and we still run the risk of accidentally capturing the variable again degrading performance.

Let's put everything aside and solve the problem differently.

We can create a free function in our file syncthat takes the mutex as a parameter:

private func sync(mutex: PThreadMutex, execute: () throws -> R) rethrows -> R {
 pthread_mutex_lock(&mutex.m)
 defer { pthread_mutex_unlock(&mutex.m) }
 return try execute()
}

If you put a function in a file from which it will be called, then everything almost works. We get rid of the costs of dynamic memory allocation, while the execution speed drops from 3.043 to 0.374 seconds. But we still have not reached the level of 0.263 seconds, as with a direct call pthread_mutex_lock/ pthread_mutex_unlock. What's wrong again?

It turns out that despite the presence of a private function in the same file, where Swift can completely inline this function, Swift does not eliminate the excessive hold and release (retains and releases) of the parameter PThreadMutex (the type of which is classto pthread_mutex_tnot break when copying).

We can make the compiler avoid these deductions and deallocations by making the function an extension PThreadMutexrather than a free function:

extension PThreadMutex {
 private func sync(execute: () throws -> R) rethrows -> R {
  pthread_mutex_lock(&m)
  defer { pthread_mutex_unlock(&m) }
  return try execute()
 }
}

This forces Swift to treat the parameter self as @guaranteed, eliminating the costs of hold / release, and we finally get to a value of 0.264 seconds.

Semaphores, not mutexes?

Why not use it dispatch_semaphore_t? The advantage of dispatch_semaphore_wait, and dispatch_semaphore_signalis not required circuit for them - they are separate, unscoped-calls.

You can use dispatch_semaphore_tto create a construct like a mutex:

public struct DispatchSemaphoreWrapper {
 let s = DispatchSemaphore(value: 1)
 init() {}
 func sync(execute: () throws -> R) rethrows -> R {
  _ = s.wait(timeout: DispatchTime.distantFuture)
  defer { s.signal() }
  return try execute()
 }
}

It turns out that this is about a third faster than the mutex pthread_mutex_lock/ pthread_mutex_unlock(0.168 seconds versus 0.244). But, despite the increase in speed, using a semaphore for a mutex is not the best option for a general mutex.

Semaphores are prone to a number of errors and problems. The most serious of these are priority inversion forms . Priority inversion is the same type of problem that it OSSpinLock was used for under iOS, but the problem for semaphores is a bit more complicated.

When spin-locked, priority inversion means:

A high-priority thread is active, spinning, and is waiting to unlock the lock held by the lower-priority thread.
A low priority thread never unlocks because it is depleted by a higher priority thread.

In the presence of a semaphore, priority inversion means:

A high priority thread awaits a semaphore.
The priority stream is not dependent on the semaphore.
It is expected that the low priority stream signals with a semaphore that the high priority stream can continue.

A medium priority thread will deplete a low priority thread (this is normal for thread priority). But since the high-priority flow waits for the low-priority flow to signal with a semaphore, the high-priority flow is also depleted by the medium- priority flow . Ideally, this should not happen.

If the correct mutex was used instead of the semaphore, the priority of the high-priority stream will be transferred to the stream with a lower priority, while the high-priority one expects the mutex held by the low-priority stream - this allows the low-priority stream to complete its work and unlock the high-priority stream. However, semaphores are not held by the stream, so priority transfers cannot occur.

Ultimately, semaphores are a good way to associate termination notifications between threads (something that is not easy to do with mutexes), but the design of semaphores is complex and carries risks, so you should limit their use to situations where you know all the involved threads in advance and their priorities — when it is known that the priority of the waiting stream is equal to or lower than the priority of the signaling stream.

All this may seem a bit confusing - since you probably do not intentionally create threads with different priorities in your programs. However, Cocoa frameworks add a bit of complexity: they use dispatch queues everywhere, and each queue has a “QoS class”. And this can lead to the queue working with a different thread priority. If you do not know the sequence of each task in the program (including the user interface and other tasks queued using Cocoa frameworks), then you may suddenly encounter a multi-threaded priority situation. This is best avoided.

Application

A project containing implementations of PThreadMutex and DispatchSemaphoreis available on Github .

The CwlMutex.swift file is completely self-contained, so you can simply copy it if that's all you need.

Or the ReadMe.md file contains detailed information about cloning the entire repository and adding the framework that creates it to your projects.

Conclusion

The best and safest mutex option in Swift for both Mac and iOS is pthread_mutex_t . In the future, Swift will probably have the opportunity to optimize non-escaping closures on the stack, or inline beyond the boundaries of the modules. Any of these features will fix the inherent problems with Dispatch.sync , probably making it a better option. But for now, it is too inefficient.

While semaphores and other “light” locks are reasonable approaches in some scenarios, they are not general-purpose mutexes and, when designing, involve additional considerations and risks.

Regardless of which mutex engine you choose, you need to be careful while providing inlining for maximum performance, otherwise an excessive number of captures by closures can slow down mutexes by 10 times. In the current version of Swift, this may mean copying and pasting the code into the file in which it is used.

Streaming execution, inlining and optimization are all topics in which we can expect significant changes beyond Swift 3. However, current Swift users have to work in Swift 2.3 and Swift 3 - and this article describes the current behavior in these versions when trying to get maximum performance when using a scoped mutex.

Addition: performance indicators

A simple cycle was run 10 million times: input the mutex, increase the counter and output the mutex. The “slow” versions of DispatchSemaphore and PThreadMutex were compiled as part of a dynamic structure, separate from the test code.

Results:

Mutex option	Seconds (Swift 2.3)	Seconds (Swift 3)
PThreadMutex.sync (capture by closure)	3,043	3,124
DispatchQueue.sync	2,330	3,530
PThreadMutex.sync_3 (return result)	1,371	1,364
objc_sync_enter	0.869	0.833
sync (PThreadMutex) (function in the same file)	0.374	0.387
PThreadMutex.sync_4 (Double inout parameters)	0,307	0.310
PThreadMutex.sync_2 (single inout parameter)	0.282	0.284
PThreadMutex.sync (inline non-capture)	0.264	0.265
Direct calls pthread_mutex_lock / unlock	0.263	0.263
OSSpinLockLock	0,092	0.108

The test code used is part of the associated CwlUtils project , but the test file containing these performance tests (CwlMutexPerformanceTests.swift) is not connected to the test module by default and must be included intentionally.

Tags: