io_submit: an alternative to epoll that you have never heard of
  • Transfer

Recently, the author’s attention was attracted by an article on the LWN about a new polling kernel interface (polling). It discusses the new polling mechanism in Linux AIO API (interface for asynchronous work with files), which was added to the kernel version 4.18. The idea is quite interesting: the author of the patch suggests using the Linux AIO API to work with the network.

But wait! After all, Linux AIO was created to work with asynchronous disk I / O! Disk files are not the same as network connections. Is it even possible to use the Linux AIO API to work with the network?

It turns out, yes, it is possible! This article explains how to use the strengths of the Linux AIO API to create faster and better network servers.

But let's start by clarifying what Linux AIO is.

Introduction to Linux AIO

Linux AIO provides asynchronous disk I / O interface for user software.

Historically, all disk operations on Linux have been blocked. If you call open(), read(), write()or fsync(), the flow is stopped as long as the metadata will not appear in the disk cache. This is usually not a problem. If you do not have a lot of I / O operations and enough memory, system calls will gradually fill the cache, and everything will work quickly enough.

The performance of I / O operations decreases when their number is large enough, for example, in cases with databases and proxy servers. For such applications, it is unacceptable to stop the entire process in order to wait for a single system call read().

To solve this problem, applications can use three methods:

  1. Use thread pools and call blocking functions in separate threads. This is how POSIX AIO works in glibc (do not confuse it with Linux AIO). See the IBM documentation for details . This is how we solved the problem in Cloudflare: to call read()and open()we use a pool of threads .
  2. Warm up disk cache with help posix_fadvise(2)and hope for the best.
  3. Use Linux AIO in conjunction with the XFS file system, opening files with the O_DIRECT flag and avoiding undocumented problems .

However, none of these methods are perfect. Even Linux AIO with mindless use can block in a call io_submit(). This was recently mentioned in another LWN article :
“The asynchronous I / O interface in Linux has many critics and few supporters, but most people expect it to be at least asynchronous. In fact, the AIO operation can be blocked in the kernel for a number of reasons in situations where the calling thread cannot afford it. ”
Now that we know about the weaknesses of the Linux AIO API, let's look at its strengths.

A simple program using Linux AIO

In order to use Linux AIO, you first have to independently determine all five necessary system calls - glibc does not provide them.

  1. First you need to call io_setup()to initialize the structure aio_context. The kernel will give us an opaque pointer to the structure.
  2. After that, you can call io_submit()to add to the queue for processing the vector of "control I / O blocks" in the form of a struct iocb structure.
  3. Now, finally, we can call io_getevents()and wait for a response from it in the form of a vector of struct io_event- results of the work of each of the iocb blocks.

There are eight commands that you can use in iocb. Two commands for reading, two for writing, two options for fsync and the POLL command, which was added to kernel version 4.18 (the eighth command is NOOP):

IOCB_CMD_POLL = 5,   /* from 4.18 */

Структура iocbThe function that is passed to the function io_submitis large enough and designed to work with the disk. Here is its simplified version:

struct iocb {
  __u64 data;           /* user data */
  __u16 aio_lio_opcode; /* see IOCB_CMD_ above */
  __u32 aio_fildes;     /* file descriptor */
  __u64 aio_buf;        /* pointer to buffer */
  __u64 aio_nbytes;     /* buffer size */

The complete structure io_eventthat is returned io_getevents:

struct io_event {
  __u64  data;  /* user data */
  __u64  obj;   /* pointer to request iocb */
  __s64  res;   /* result code for this event */
  __s64  res2;  /* secondary result */

Example. A simple program that reads the / etc / passwd file using the Linux AIO API:

fd = open("/etc/passwd", O_RDONLY);
aio_context_t ctx = 0;
r = io_setup(128, &ctx);
char buf[4096];
struct iocb cb = {.aio_fildes = fd,
                  .aio_lio_opcode = IOCB_CMD_PREAD,
                  .aio_buf = (uint64_t)buf,
                  .aio_nbytes = sizeof(buf)};
struct iocb *list_of_iocb[1] = {&cb};
r = io_submit(ctx, 1, list_of_iocb);
struct io_event events[1] = {{0}};
r = io_getevents(ctx, 1, 1, events, NULL);
bytes_read = events[0].res;
printf("read %lld bytes from /etc/passwd\n", bytes_read);

Full sources, of course, are available on GitHub . Here is the output of the strace of this program:

openat(AT_FDCWD, "/etc/passwd", O_RDONLY)
io_setup(128, [0x7f4fd60ea000])
io_submit(0x7f4fd60ea000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7ffc5ff703d0, aio_nbytes=4096, aio_offset=0}])
io_getevents(0x7f4fd60ea000, 1, 1, [{data=0, obj=0x7ffc5ff70390, res=2494, res2=0}], NULL)

Everything went well, but reading from the disk was not asynchronous: the call to io_submit was blocked and did all the work, the function was io_geteventsexecuted instantly. We could try to read asynchronously, but this requires the O_DIRECT flag, with which disk operations bypass the cache.

Let's better illustrate how it is io_submitblocked on regular files. Here is a similar example that shows the output of strace as a result of reading a 1 GB block from /dev/zero:

io_submit(0x7fe1e800a000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7fe1a79f4000, aio_nbytes=1073741824, aio_offset=0}]) \
    = 1 <0.738380>
io_getevents(0x7fe1e800a000, 1, 1, [{data=0, obj=0x7fffb9588910, res=1073741824, res2=0}], NULL) \
    = 1 <0.000015>

The core spent 738 ms on the call io_submitand only 15 ns on io_getevents. Similarly, it behaves with network connections — all the work is done io_submit.

Photo  Helix84  CC / BY-SA / 3.0

Linux AIO and network

The implementation is io_submitrather conservative: if the transferred file descriptor was not opened with the O_DIRECT flag, the function simply blocks and performs the specified action. In the case of network connections, this means that:

  • for blocking connections, IOCV_CMD_PREAD will wait for the response packet;
  • for non-blocking connections, IOCB_CMD_PREAD will return code -11 (EAGAIN).

The same semantics are used in the usual system call read(), so we can say that io_submit is no smarter than the good old calls when working with network connections read() / write().

It is important to note that requests iocbare executed by the kernel sequentially.

Despite the fact that Linux AIO does not help us with asynchronous operations, it can be used to combine system calls into batches.

If the web server needs to send and receive data from hundreds of network connections, then using it io_submitcan be a great idea because it will avoid hundreds of send and recv calls. This will improve performance - the transition from user space to the kernel and back is not free, especially after the introduction of measures to combat Specter and Meltdown.

One buffer
Multiple buffers
One file descriptor
read ()
readv ()
Multiple file descriptors
io_submit + IOCB_CMD_PREAD
io_submit + IOCB_CMD_PREADV

To illustrate the grouping of system calls into packets, io_submitlet's write a small program that sends data from one TCP connection to another. In its simplest form (without Linux AIO), it looks like this:

while True:
  d =

We can express the same functionality through Linux AIO. The code in this case will be as follows:

struct iocb cb[2] = {{.aio_fildes = sd2,
                      .aio_lio_opcode = IOCB_CMD_PWRITE,
                      .aio_buf = (uint64_t)&buf[0],
                      .aio_nbytes = 0},
                     {.aio_fildes = sd1,
                     .aio_lio_opcode = IOCB_CMD_PREAD,
                     .aio_buf = (uint64_t)&buf[0],
                     .aio_nbytes = BUF_SZ}};
struct iocb *list_of_iocb[2] = {&cb[0], &cb[1]};
while(1) {
  r = io_submit(ctx, 2, list_of_iocb);
  struct io_event events[2] = {};
  r = io_getevents(ctx, 2, 2, events, NULL);
  cb[0].aio_nbytes = events[1].res;

This code adds two tasks to io_submit: first a write request to sd2, and then a read request from sd1. After performing the read, the code corrects the size of the write buffer and repeats the cycle from the beginning. There is one trick: the first time a write occurs with a buffer of size 0. This is necessary because we have the ability to combine write + read in one call io_submit(but not read + write).

Is this code faster than regular read()/ write()? Not yet. Both versions use two system calls: read + write and io_submit + io_getevents. But, fortunately, the code can be improved.

Getting rid of io_getevents

At runtime, the io_setup()kernel allocates several pages of memory for the process. Here is how this block of memory looks in / proc // maps:

marek:~$ cat /proc/`pidof -s aio_passwd`/maps
7f7db8f60000-7f7db8f63000 rw-s 00000000 00:12 2314562     /[aio] (deleted)

Memory block [aio] (12 KB in this case) was allocated io_setup. It is used for the ring buffer where events are stored. In most cases, there is no reason to call io_getevents — data on the completion of events can be obtained from the ring buffer without the need to go into kernel mode. Here is the revised version of the code:

int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
                 struct io_event *events, struct timespec *timeout)
    int i = 0;
    struct aio_ring *ring = (struct aio_ring*)ctx;
    if (ring == NULL || ring->magic != AIO_RING_MAGIC) {
        goto do_syscall;
    while (i < max_nr) {
        unsigned head = ring->head;
        if (head == ring->tail) {
            /* There are no more completions */
        } else {
            /* There is another completion to reap */
            events[i] = ring->events[head];
            ring->head = (head + 1) % ring->nr;
    if (i == 0 && timeout != NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
        /* Requested non blocking operation. */
        return 0;
    if (i && i >= min_nr) {
        return i;
    return syscall(__NR_io_getevents, ctx, min_nr-i, max_nr-i, &events[i], timeout);

The full version of the code is available on GitHub . The interface of this ring buffer is poorly documented, the author adapted the code from the axboe / fio project .

After this change, our version of the code using Linux AIO requires only one system call in the loop, which makes it a bit faster than the original code using read + write.

Photo   Train Photos  CC / BY-SA / 2.0

Alternative to epoll

With the addition of IOCB_CMD_POLL to kernel version 4.18, it became possible to use io_submitselect / poll / epoll as a replacement. For example, this code will expect data from a network connection:

struct iocb cb = {.aio_fildes = sd,
                  .aio_lio_opcode = IOCB_CMD_POLL,
                  .aio_buf = POLLIN};
struct iocb *list_of_iocb[1] = {&cb};
r = io_submit(ctx, 1, list_of_iocb);
r = io_getevents(ctx, 1, 1, events, NULL);

Full code . Here is his strace output:

io_submit(0x7fe44bddd000, 1, [{aio_lio_opcode=IOCB_CMD_POLL, aio_fildes=3}]) \
    = 1 <0.000015>
io_getevents(0x7fe44bddd000, 1, 1, [{data=0, obj=0x7ffef65c11a8, res=1, res2=0}], NULL) \
    = 1 <1.000377>

As you can see, this time the asynchrony worked: io_submit was executed instantly, and io_geteventsblocked for one second while waiting for data. This can be used instead of the system call epoll_wait().

Moreover, working with epollusually requires the use of epoll_ctl system calls. And application developers are trying to avoid frequent calls to this function - in order to understand the reasons, it is enough to read the EPOLLONESHOT and EPOLLET flags in the manual . Using io_submit to poll connections, you can avoid these difficulties and additional system calls. Simply add connections to the iocb vector, call io_submit once and wait for execution. Everything is very simple.


In this post, we looked at the Linux AIO API. This API was originally designed to work with the disk, but it also works with network connections. However, unlike the usual read () + write () calls, using io_submit allows you to group system calls and thus increase performance.

Starting with kernel version 4.18, io_submit и io_geteventsin the case of network connections, they can be used for POLLIN and POLLOUT events. This is an alternative epoll().

I can imagine a network service that uses only io_submit и io_geteventsinstead of the standard set of read, write, epoll_ctl and epoll_wait. In this case, grouping system calls in io_submitcan give a big advantage, such a server would be much faster.

Unfortunately, even after the recent improvements in the Linux AIO API, the discussion about its usefulness continues. It is well known that Linus hates him :

“AIO is a terrible example of“ on-the-knee ”design, where the main excuse is:“ other, less talented people came up with this, so we have to be compatible so that database developers (who rarely have taste) can use it. ” But AIO has always been very, very wry. ”

Several attempts have been made to create a better interface for call grouping and asynchrony, but they lacked a common vision. For example, the recent addition of sendto (MSG_ZEROCOPY) allows for truly asynchronous data transfer, but does not include grouping. io_submitprovides for grouping, but not asynchrony. Even worse, there are currently three ways to deliver asynchronous events in Linux: signals, io_geteventsand MSG_ERRQUEUE.

In any case, it is great that there are new ways to speed up the work of network services.

Also popular now: