epoll and Windows IO Completion Ports: The Practical Difference

Introduction

In this article we will try to figure out what the epoll mechanism differs from the ports of completion in practice (Windows I / O Completion Port or IOCP). This can be interesting for system architects designing high-performance network services or for programmers porting network code from Windows to Linux or vice versa.

Both of these technologies are highly efficient for handling a large number of network connections.

They differ from other methods in the following points:

There are no restrictions (except for the total system resources) on the total number of observed descriptors and event types.
Scaling works quite well - if you already monitor N handles, then the transition to N + 1 monitoring will take very little time and resources.
It's easy enough to use a thread pool for parallel processing of current events.
There is no point in using single network connections. All benefits begin to manifest with 1000+ connections.

To paraphrase all of the above, both of these technologies are designed to develop network services that process many incoming connections from clients. But at the same time there is a significant difference between them and it is important to know when developing the same services.

(Upd: this article is a translation )

Type of Notifications

The first and most important difference between epoll and IOCP is how you are notified of an event that has happened.

epoll tells you when the descriptor is ready to do something with it - " and now you can start reading the data "
IOCP tells you when the requested operation is done - " you asked to read the data and here it is read "

When using epoll app:

Decides what operation it wants to perform with some descriptor (read, write, or both)
Sets the appropriate mask using epoll_ctl
Calls epoll_wait, which blocks the current thread until at least one expected event occurs (or the waiting time expires)
Enumerates received events, takes a pointer to the context (from the data.ptr field)
Initiates event handling according to their type (read, write, or both).
After the completion of the operation (which should occur immediately) continues to wait for receiving / sending data

When using an IOCP application:

Initiates some operation (ReadFile or WriteFile) for some descriptor, while using a non-empty argument OVERLAPPED. The operating system adds the requirement to perform this operation in the queue itself, and the called function immediately (without waiting for the operation to complete) is returned.
Calls GetQueuedCompletionStatus () , which blocks the current thread until exactly one of the previously added queries is completed. If several are completed, only one of them will be selected.
Processes the received operation completion notification using the completion key and a pointer to OVERLAPPED.
Continues waiting for receiving / sending data

The difference in the type of notifications makes it possible (and rather trivial) to emulate IOCP using epoll. For example, the Wine project does just that. However, doing the opposite is not so easy. Even if you succeed, it will probably lead to a loss of performance.

Data availability

If you plan to read data, then your code should have some kind of buffer where you plan to read them. If you plan to send data, then there must be a buffer with data ready to be sent.

epoll doesn't care about these buffers at all and doesn't use them at all
IOCP needs these buffers. The whole point of using IOCP is to work in the style of "read me 256 bytes from this socket into this buffer." We formed such a request, gave it to the OS, waiting for the notification of the completion of the operation (and do not touch the buffer at this time!)

A typical network service operates on connection objects, which include descriptors and associated buffers for reading / writing data. Typically, these objects are destroyed when the corresponding socket is closed. And this imposes some limitations when using IOCP.

IOCP works as a method of adding read and write requests to a queue; these requests are executed in a queue order (that is, sometime later). In both cases, the transmitted buffers must continue to exist until the completion of the required operations. Moreover, one cannot even modify the data in these buffers while waiting. This imposes important limitations:

You cannot use local variables (placed on the stack) as a buffer. The buffer must be valid before waiting for completion of the read / write operation, and the stack will be destroyed when exiting the current function
You cannot relocate the buffer on the move (for example, it turned out that you need to send more data and you want to increase the buffer). You can only create a new buffer and a new send request.
If you write something like a proxy, when the same data will be read and sent, you will have to use two separate buffers for them. You cannot ask the OS to read the data in some buffer in one request, and in the other - right there from it send this data
You need to think carefully about how your connection manager class will destroy each specific connection. You must have a full guarantee that at the time of the destruction of the connection there is not a single request to read / write data using the buffers of this connection.

IOCP operations also require the transfer of a pointer to an OVERLAPPED structure, which must also continue to exist (and not be reused) until the end of the expected operation is completed. This means that if you need to read and write data at the same time, you cannot inherit from the OVERLAPPED structure (an idea that comes to mind often). Instead, you need to store two OVERLAPPED structures in a separate class of their own, passing one of them to read requests and the other to write requests.

epoll does not use any buffers transferred to it from the user code, so all these problems do not concern it in any way.

Changing waiting conditions

Adding a new type of expected events (for example, we were waiting for the opportunity to read data from the socket, and now we also wanted to get the opportunity to send them) is possible and simple enough for both the epoll and IOCP. epoll allows you to change the mask of the expected events (at any time, even from another thread), and IOCP allows you to start another operation waiting for a new type of events.

Changing or deleting events already expected, however, is different. epoll still allows you to modify the condition by calling the epoll_ctl (including from other threads). With IOCP everything is more complicated. If an I / O operation has been scheduled - you can cancel it by calling the CancelIo () function. Worse, only the same thread that launched the original operation can call this function. All ideas for organizing a separate control flow are broken about this limitation. In addition, even after calling CancelIo (), we cannot be sure that the operation will be immediately canceled (it is possible that it is already running, using the OVERLAPPED structure and the transferred read / write buffer). We still have to wait for the operation to complete (its result will be returned by the GetOverlappedResult () function) and only after that we will be able to free the buffer.

Another problem with IOCP is that once an operation has been scheduled for execution, it can no longer be changed. For example, you cannot change the planned ReadFile request and say that you want to read only 10 bytes, not 8192. You need to cancel the current operation and start a new one. This is not a problem for epoll, which when you start waiting has no idea how much data you want to read at the time when the notification comes about the possibility of reading the data.

Unlockable connection

Some network service implementations (related services, FTP, p2p) require outgoing connections. Both epoll and IOCP support non-blocking connection requests, but in different ways.

When using epoll, the code is generally the same as for select or poll. You create a non-blocking socket, call connect () for it and wait for a notification about its availability for writing.

When using IOCP, you need to use a separate ConnectEx function, since the connect () call does not accept the OVERLAPPED structure, which means it cannot later generate a notification about a change in the state of the socket. So the connection initiation code will not only be different from the code using epoll, it will be different even from Windows code using select or poll. However, the changes can be considered minimal.

Interestingly, accept () works with IOCP as usual. There is also the AcceptEx function, but its role is completely unrelated to a non-blocking connection. This is not a “non-blocking accept,” as one might think by analogy with connect / ConnectEx.

Event monitoring

Often, after an event is triggered, additional data comes in very quickly. For example, we expected the input data from the socket using epoll or IOCP, received an event about the first few bytes of data and right there, while we were reading them, another hundred bytes came. Can I read them without restarting event monitoring?

With epoll, this is possible. You receive the event “something can now be read” - and you read everything that can be read from the socket (until you receive an EAGAIN error). The same is true with sending data - having received a signal that the socket is ready to send data, you can write something to it until the write function returns EAGAIN.

With IOCP this will not work. If you have asked for a socket to read or send 10 bytes of data, that is exactly what will be read / sent (even if it could already be more). For each next block, you need to make a separate request using ReadFile or WriteFile, and then wait until it is executed. This can create an extra level of complexity. Consider the following example:

The socket class created a request to read data by calling ReadFile. Threads A and B are waiting for the result by calling GetOverlappedResult ()
Read operation completed, thread A received a notification and called a socket class method to process the received data
The socket class has decided that this data is not enough, we must expect the following. It places another read request.
This request is executed immediately (the data has already arrived, the OS can give them immediately). Stream B receives the notification, reads the data and passes it to the socket class.
At the moment, the function of reading data in the socket class is called from both threads A and B, which leads either to the risk of data corruption (without using synchronization objects) or to additional pauses (when using synchronization objects)

With objects of synchronization in this case is generally difficult. Well, if he is alone. But if we have 100,000 connections and each has some kind of synchronization object, this can seriously hurt system resources. And if you still keep on 2 (in case of separation of the processing of requests for reading and writing)? Even worse.

The usual solution here is to create a connection manager class that will be responsible for calling ReadFile or WriteFile for the connection class. This works better, but makes the code more complex.

findings

Both epoll and IOCP are suitable (and used in practice) for writing high-performance network services that can handle a large number of connections. Technologies themselves are different ways of handling events. These differences are so significant that it is hardly worth trying to write them on some common basis (the amount of the same code will be minimal). I worked several times on trying to bring both approaches to some kind of universal solution - and each time the result was worse in terms of complexity, readability and support compared to two independent implementations. Each time it was necessary to refuse from the obtained universal result.

When porting code from one platform to another, it is usually easier to port an IOCP code to use epoll than vice versa.

Tips:

If your task is to develop a cross-platform network service, you should start with the implementation on Windows using IOCP. As soon as everything is ready and debugged, add a trivial epoll-backend.
You should not try to write the common classes Connection and ConnectionMgr that implement the logic of epoll and IOCP at the same time. This looks bad in terms of code architecture and results in a bunch of #ifdefs with different logic inside them. Better make base classes and inherit separate implementations from them. In base classes, you can keep some general methods or data, if any.
Carefully monitor the lifetime of the objects of the Connection class (well, or whatever you call the class where the received / sent data buffers will be stored). It should not be destroyed until the scheduled read / write operations that use its buffers are completed.

Tags: