Futex basics

Transfer

Futex (futex - short for Fast userspace mutex) is a mechanism proposed by Linux developers from IBM in 2002 and entered the kernel at the end of 2003. The main idea was to provide a more efficient way to synchronize user threads with a minimum number of calls to the OS kernel.

In this article, we will review the futexes, try to understand the principles of their work, and also use them as building blocks for building higher-level (and familiar to us) synchronization objects.

An important point: futexes are quite a low-level tool, it’s worth using directly only when developing fundamental libraries, like the standard C / C ++ library. It is very unlikely that you will need to use futexes in a regular application.

Motivation

Before the emergence of futexes to control access to shared resources from multiple threads, it was necessary to make system calls each time (using, for example, semop ), which, as is well known, is resource intensive, since each call requires switching the context from user mode to kernel mode. With the increase in the number of cores in modern processors and the increase in the number of threads in the application software, this has become a significant problem. It is all the more “offensive”, given that all these calls do not carry any application function, do not implement any business logic, but only guarantee the correct operation of the rest of the code.

The proposal to add to the OS a new concept of “futex” was based on a simple observation: in most cases, an attempt to capture the synchronization object succeeds the first time. Programmers write software in such a way that it takes as little time as possible from capturing a lock to its release, which means there are very high chances that an attempt to capture by another stream will not encounter obstacles. When a thread reaches such a “free” synchronization object, we can capture it without performing a system call, using relatively cheap atomic operations. And there is a very big chance that the atomic operation will work successfully.

In the rare case when we are still trying to gain access to a resource blocked by another thread, an atomic operation will return an error. In this case, we have two options. We can either spin in some user mode spin-lock, waiting for the resource to be released (which will eat up the CPU resources), or ask the kernel to put us into sleep mode, waiting for the resource to be released. This is where futexes come on the scene.

Simple use of futexes - wait and wake up

The futex system call combines quite diverse functionality. We will not consider complex options here (some of them are so elaborate that they are not even described in the official documentation), but focus on the operations FUTEX_WAIT and FUTEX_WAKE. The description in the official documentation will serve as a good base:

The futex () system call provides programs with a method to wait until a certain condition is true. Typically, this system call uses a blocking construct in the context of shared memory synchronization. When using futexes, basic synchronization operations are performed in user space. User-space programs execute the futex () system call only when it is necessary for the program to enter standby mode for a long time until the condition becomes true. Also, futex () can be used to wake up processes or threads waiting for a specific condition.

Simply put, a futex is a kernel-based construct that helps user code synchronize threads when certain events occur. Some processes (or threads) can wait for events in the FUTEX_WAIT call, while others can trigger these events with FUTEX_WAKE. Waiting works efficiently - waiting threads are suspended by the kernel and do not use processor resources until they are awakened when an expected event occurs.

Do not be lazy to read the documentation in full. Or at least read the sections on FUTEX_WAIT and FUTEX_WAKE.

Let's look at a simple example that demonstrates the basic use of futexes to coordinate the work of two processes.

Child process:

Waiting for the value 0xA to appear in the general memory slot.
Writes 0xB value to this slot.

Parent process at this time:

Writes 0xA to shared memory slot.
Waiting for the value 0xB to appear in it.

Such a "handshake" between the two processes. Here is the code:

intmain(int argc, char** argv){
  int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
  if (shm_id < 0) {
    perror("shmget");
    exit(1);
  }
  int* shared_data = shmat(shm_id, NULL, 0);
  *shared_data = 0;
  int forkstatus = fork();
  if (forkstatus < 0) {
    perror("fork");
    exit(1);
  }
  if (forkstatus == 0) {
    // дочерний процессprintf("child waiting for A\n");
    wait_on_futex_value(shared_data, 0xA);
    printf("child writing B\n");
    // записываем 0xB в разделяемый слот памяти и ждём ответа родителя
    *shared_data = 0xB;
    wake_futex_blocking(shared_data);
  } else {
    // родительский процессprintf("parent writing A\n");
    // Записываем 0xA в разделяемый слот памяти и ждём ответа ребёнка
    *shared_data = 0xA;
    wake_futex_blocking(shared_data);
    printf("parent waiting for B\n");
    wait_on_futex_value(shared_data, 0xB);
    // Wait for the child to terminate.
    wait(NULL);
    shmdt(shared_data);
  }
  return0;
}

Note the POSIX calls for allocating shared memory between processes. We could not use ordinary memory allocation here, since even the same pointer addresses in different processes would in fact indicate different blocks of memory (unique for each process).

It should be noted that this example somewhat deviates from the canons, because the futex was originally created to wait for a change in some value "from something concrete to anything," and not "from anything to something concrete." I gave this example in order to demonstrate this possibility, and below we will consider the basic variant (on it we implement the mutex).

And here is the wait_on_futex_value function code:

voidwait_on_futex_value(int* futex_addr, int val){
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAIT, val, NULL, NULL, 0);
    if (futex_rc == -1) {
      if (errno != EAGAIN) {
        perror("futex");
        exit(1);
      }
    } elseif (futex_rc == 0) {
      if (*futex_addr == val) {
        // здесь мы просыпаемсяreturn;
      }
    } else {
      abort();
    }
  }
}

The main objective of this function (in addition to the futex system call itself) is a cycle in which we run during a false (not interested in) awakening. This can happen when installing a new, but not expected value into a shared memory slot. Well, or in the case when another process was awakened before ours (this cannot happen in our particular case, but more generally it is possible).

Futex semantics is tricky enough! The FUTEX_WAIT call will return immediately if the value at the futex address is not equal to the argument val passed. In our case, this can happen if the child process waited before the parent wrote the value 0xA in the slot. Futex in this case will return the value of EAGAIN.

And here is the wake_futex_blocking function code:

voidwake_futex_blocking(int* futex_addr){
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAKE, 1, NULL, NULL, 0);
    if (futex_rc == -1) {
      perror("futex wake");
      exit(1);
    } elseif (futex_rc > 0) {
      return;
    }
  }
}

This is a blocking wrapper over FUTEX_WAKE, which will quickly work out and return value, no matter how many listeners expect it. In our example, this is used as part of a handshake, but other uses are possible.

Futexes are kernel queues for custom code.

Simply put, futex is a queue managed by the kernel to solve user code problems. It allows the user code to request the kernel to suspend the execution of its thread until a certain event occurs, and another thread at the same time to signal this event and wake up all the threads waiting for it. Earlier we mentioned the possibility to organize spin-lok in user mode, waiting for the fulfillment of some condition. However, the queue in the core is a much better alternative, because it saves us from the wasted burned processor instructions that are executed in the wait cycle.

Here is the diagram from the “A futex overview and update” article on LWN:

In the Linux kernel code, futexes are implemented in kernel / futex.c. The kernel stores a hash table, where the keys are addresses - to quickly search for the desired queue and add the calling process to it. Everything, of course, is not so simple - after all, the kernel itself also needs to synchronize access to the data inside, plus support any additional options of futexes.

Timeout with FUTEX_WAIT

The futex system call has a timeout parameter that allows the user to specify how long he is willing to wait. Here is a complete example of where this is implemented, but its key part:

printf("child waiting for A\n");
structtimespectimeout = {.tv_sec = 0, .tv_nsec = 500000000};
while (1) {
  unsignedlonglong t1 = time_ns();
  int futex_rc = futex(shared_data, FUTEX_WAIT, 0xA, &timeout, NULL, 0);
  printf("child woken up rc=%d errno=%s, elapsed=%llu\n", futex_rc,
         futex_rc ? strerror(errno) : "", time_ns() - t1);
  if (futex_rc == 0 && *shared_data == 0xA) {
    break;
  }
}

If the wait is delayed for 500 ms, then the futex function will end, and in the next iteration of the cycle we can somehow react to this (display something on the screen, write to the log, continue the wait or stop).

Use futex for implementing mutex

We began this article by saying that futexes have practical benefits when implementing higher-level synchronization objects. Let's try using them (and also atomics) to implement a classic mutex. The implementation below is based on the code from the article “Futexes are Tricky” written by Ulrich Drepper.

For this example, I use C ++, mainly to be able to use atomics from the C ++ 11 standard. You can find the full code here , but the most important part of it is:

classMutex {public:
  Mutex() : atom_(0) {}
  voidlock(){
    int c = cmpxchg(&atom_, 0, 1);
    // If the lock was previously unlocked, there's nothing else for us to do.// Otherwise, we'll probably have to wait.if (c != 0) {
      do {
        // If the mutex is locked, we signal that we're waiting by setting the// atom to 2. A shortcut checks is it's 2 already and avoids the atomic// operation in this case.if (c == 2 || cmpxchg(&atom_, 1, 2) != 0) {
          // Here we have to actually sleep, because the mutex is actually// locked. Note that it's not necessary to loop around this syscall;// a spurious wakeup will do no harm since we only exit the do...while// loop when atom_ is indeed 0.
          syscall(SYS_futex, (int*)&atom_, FUTEX_WAIT, 2, 0, 0, 0);
        }
        // We're here when either:// (a) the mutex was in fact unlocked (by an intervening thread).// (b) we slept waiting for the atom and were awoken.//// So we try to lock the atom again. We set teh state to 2 because we// can't be certain there's no other thread at this exact point. So we// prefer to err on the safe side.
      } while ((c = cmpxchg(&atom_, 0, 2)) != 0);
    }
  }
  voidunlock(){
    if (atom_.fetch_sub(1) != 1) {
      atom_.store(0);
      syscall(SYS_futex, (int*)&atom_, FUTEX_WAKE, 1, 0, 0, 0);
    }
  }
private:
  // 0 means unlocked// 1 means locked, no waiters// 2 means locked, there are waiters in lock()std::atomic<int> atom_;
};

In this code, the cmpxhg function is a simple wrapper for more convenient use of atomics:

// An atomic_compare_exchange wrapper with semantics expected by the paper's// mutex - return the old value stored in the atom.intcmpxchg(std::atomic<int>* atom, int expected, int desired){
  int* ep = &expected;
  std::atomic_compare_exchange_strong(atom, ep, desired);
  return *ep;
}

This code example contains many comments explaining the logic of its operation. This does not hurt, because there is a significant risk that you will want to write a slightly simpler, but completely wrong version of it. As for this code, it is also not perfect at all. For example, he tries to make an assumption about the internal structure of type std :: atomic, casting its contents to int * to pass to the futex call. This is generally not the case. The code compiles and runs on Linux x64, but we have no guarantee of compatibility with other platforms. To get it, we need to add a layer of platform dependency for atomics. Since this is not the topic of this article (and also because it is very unlikely that you will mix C ++ and futexes in one module), we will omit this implementation. This is just a demonstration!

Glibc mutexes and low-level locks

And here we come to the way glibc implements POSIX threads, of which the pthread_mutex_t type is a part. As I said at the beginning of this article, futexes are not exactly the thing that an ordinary developer will need. They are used by runtime libraries or by something very specialized for implementing higher level synchronization primitives. In this context, it is interesting to look at the NPTL mutex implementation . In the glibc code, this is the nptl / pthread_mutex_lock.c file.

The code is rather complicated due to the need to support various types of mutexes, but we can find quite familiar blocks if we wish. You can also take a look at the files sysdeps / unix / sysv / linux / x86_64 / lowlevellock.h and nptl / lowlevellock.c. The code is somewhat confusing, but the combination of comparison-and-exchange and futex calls is easy.

The initial comment of the systeds / nptl / lowlevellock.h file should already be well understood:

/* Low-level locks use a combination of atomic operations (to acquire and
   release lock ownership) and futex operations (to block until the state
   of a lock changes).  A lock can be in one of three states:
   0:  not acquired,
   1:  acquired with no waiters; no other threads are blocked or about to block
       for changes to the lock state,
   >1: acquired, possibly with waiters; there may be other threads blocked or
       about to block for changes to the lock state.
   We expect that the common case is an uncontended lock, so we just need
   to transition the lock between states 0 and 1; releasing the lock does
   not need to wake any other blocked threads.  If the lock is contended
   and a thread decides to block using a futex operation, then this thread
   needs to first change the state to >1; if this state is observed during
   lock release, the releasing thread will wake one of the potentially
   blocked threads.
 ..
 */

Runes in Runtime Go

Runtime Go does not use libc (in most cases). Thus, it cannot rely on the implementation of POSIX threads. Instead, it directly calls the underlying system calls. This makes it a good example of the use of futexes. Since there is no way to call pthread_mutex_t, you have to write your replacement. Let's see how this is done, let's start with the sync.Mutex type visible to the user (in src / sync / mutex.go).

The Lock method of this type tries to use an atomic exchange operation (swap) to quickly capture a lock. If it turns out that you have to wait, it calls runtime_SemacquireMutex, which calls runtime.lock. This function is defined in src / runtime / lock_futex.go and it declares several constants that you might already find familiar:

const (
  mutex_unlocked = 0
  mutex_locked   = 1
  mutex_sleeping = 2
...
)
// Possible lock states are mutex_unlocked, mutex_locked and mutex_sleeping.// mutex_sleeping means that there is presumably at least one sleeping thread.

runtime.lock also tries to lock a lock with an atomic function. This makes sense, since runtime.lock is called in many places of Go runtime, but it seems to me that we could somehow optimize the code by removing two consecutive atom-function calls when calling runtime.lock from Mutex.lock.

If it turns out that you need to wait, the platform-specific function futexsleep is called, which is defined for Linux in the src / runtime / os_linux.go file. This function makes a futex system call with the code FUTEX_WAIT_PRIVATE (in this case, this is appropriate since the Go runtime lives in the same process).

Tags:

futex