The whole truth about linux epoll

    Well, or almost all ...

    I think that the problem in the modern Internet is an overabundance of information of different quality. Finding material on a topic of interest is not a problem; it’s a problem to distinguish good material from bad if you have little experience in this field. I see a picture when there is a lot of overview information "at the top" (almost at the level of a simple listing), very few in-depth articles and absolutely no transitional articles from simple to complex. Nevertheless, it is the knowledge of the features of a particular mechanism that allows us to make an informed choice during the development.

    In the article I will try to reveal what is the fundamental difference between the epoll and other mechanisms, what makes it unique, as well as provide articles that just need to be read for a deeper understanding of the possibilities and problems of epoll .

    It can be melees meloes.

    I assume that the reader is familiar with epoll , at least read the man page. About epoll , poll , select is written enough that everyone who developed under Linux, at least once heard about it.

    Many fd

    When people talk about epoll , I basically hear the thesis that its "performance is higher when there are a lot of file descriptors".

    Just want to ask a question - how much is this much? How many connections are needed, and most importantly, under what conditions will epoll start to give a tangible performance gain?

    For those who studied the epoll (there is a lot of material including scientific articles) the answer is obvious - it is better if and only if the number of "waiting for an event" compounds significantly exceeds the number of "ready for processing". The mark of the quantity, when the gain becomes so significant that it is no longer just the urine to ignore this fact, is considered 10k compounds [4].

    The assumption that most connections will be pending comes from sound logic and load monitoring of servers that are in active operation.

    If the number of active compounds tends to total, there will be no winningsthere will be no significant gain, significant gain is due to the fact that the epoll returns only descriptors that need attention, and poll returns all descriptors that have been added for observation.

    Obviously, in the latter case, we spend time bypassing all the descriptors + overhead for copying an array of events from the kernel.

    Indeed, in the initial measurement of performance, which was attached to the patch [9], this point is not underlined and you can guess only by the presence of the deadcon utility mentioned in the article (unfortunately, the code of the utility pipetest.c is lost). On the other hand, in other sources [6, 8] it is very difficult not to notice, since this fact is practically bulging.

    The question immediately arises, what now if you do not plan to maintain such a number of epoll file descriptors , however, you don’t need it?

    Despite the fact that the epoll was originally created specifically for such situations [5, 8, 9], this is not the only difference between the epoll .


    First of all, let us see what is the difference between the edge-triggered and level-triggered operations; there is a very good statement on this topic in the Edge Triggered Vs Level Triggered interrupts - Venkatesh Yadav article :

    Level interruption, it's like a child. If the baby is crying, you have to drop everything you did and run to the baby to feed him. Then you put the baby back in the crib. If he cries again, you will not leave him anywhere, but you will try to calm him down. And while the child is crying, you will not leave him for a single moment, and only return to work when he calms down. But let's say that we went out to the garden (the interruption is off), when the baby started crying, then when you returned home (the interruption is on), the first thing you do is go check the baby. But you will never know that he cried while you were in the garden.

    Interrupting the front is like an electronic nanny for deaf parents. As soon as the baby starts crying on the device, the red light comes on and turns on until you press the button. Even if the child began to cry, but quickly stopped and fell asleep, you still know that the child was crying. But if he started to cry, and you pressed the button (confirm interruption), the light will not light up even if it continues to cry. The sound level in the room should drop, and then rise again so that the light bulb comes on.

    If in the level-triggered behavior of the epoll (as well as poll / select ) is unlocked if the descriptor is in the specified state and will assume it is active until this state is removed, then the edge-triggered is unlocked only by changing the current given state.

    This allows you to handle the event later, rather than immediately upon receipt (almost a direct analogy with the upper half (top half) and the lower half (bottom half) of the interrupt handler).

    A specific example with an epoll:

    Level triggered

    • handle added to epoll with EPOLLIN flag
    • epoll_wait () is blocked waiting for an event
    • write to the file descriptor 19 bytes
    • epoll_wait () is unlocked with an EPOLLIN event .
    • we do nothing with the data that came
    • epoll_wait () is unlocked again with an EPOLLIN event .

    And so it will continue until we fully consider or reset the data from the descriptor.

    Edge triggered

    • handle added to epoll with EPOLLIN flags | EPOLLET
    • epoll_wait () is blocked waiting for an event
    • write to the file descriptor 19 bytes
    • epoll_wait () is unlocked with an EPOLLIN event .
    • we do nothing with the data that came
    • epoll_wait () is blocked waiting for a new event.
    • write another 19 bytes to the file descriptor
    • epoll_wait () unlocked with new EPOLLIN event
    • epoll_wait () is blocked waiting for a new event.

    simple example: epollet_socket.c

    This mechanism is made to prevent epoll_wait () from returning due to an event that is already being processed.

    If, in the case of level, when calling epoll_wait (), the kernel checks if fd is in this state, then edge skips this check and immediately puts the invoking process into a sleep state.

    Actually EPOLLET is what makes epoll O (1) a multiplexer for events.

    It is necessary to explain about EAGAIN and EPOLLET - the recommendation with EAGAIN is not to refer to byte-stream, the danger in the latter case arises only if you did not read the descriptor to the end, and new data did not come. Then the tail will hang in the descriptor, and you will not receive a new notification. With accept (), the situation is different; you are obliged to continue there until accept () returns EAGAIN , only in this case correct operation is guaranteed.

    // TCP socket (byte stream)// читаем fd возвращенный с событием EPOLLIN в режиме срабатывания по фронтуintlen = read(fd, buffer, BUFFER_LEN);
        if(len < BUFFER_LEN) {
            // все хорошо
        } else {
            // нет гарантии что не осталось данных в дескрипторе// если что-то осталось то мы останемся висеть на epoll_wait, // если не придут новые данные

    // accept// читаем listenfd возвращенный с событием EPOLLIN в режиме срабатывания по фронту = EPOLLIN | EPOLLERR;
        epoll_ctl(epoll_fd, EPOLL_CTL_ADD, server_fd, &event);
        sleep(5); // за это время к нам поключилось >1 клиентов// плохой сценарий while(epoll_wait()) {
            newfd = accept(listenfd, ...); // принимаем подключение от первого клиента// все сколько бы не поключилось далее клентов // из epoll_wait мы событий от listenfd больше не получим
        // хороший сценарийwhile(epoll_wait()) {
            while((newfd = accept(...)) > 0)
                // делаем что-нибудь полезное
            if(newfd == -1 && errno = EAGAIN) 
                // все хорошо состояние дескриптора было сброшено// мы получим уведомление на следующем соединении

    With this property, it is enough just to get starvation (starvation):

    • packets come to the handle
    • read packets to buffer
    • another packet arrives
    • read packets to buffer
    • comes a small portion
    • ...

    Thus , we will not receive EAGAIN soon, but we may not receive it at all.

    Thus, other file descriptors do not receive time for processing, and we are busy reading constantly incoming, small portions of data.

    thundering nerd herd

    In order to go to the last flag, it is necessary to understand why it was actually created and one of the problems that arose before the developers with the evolution of hardware and software.

    Thundering herd problem

    The problem of a thundering herd.

    Imagine a huge number of processes waiting for an event. If an event comes, they will all be awakened and the struggle for resources will begin, although only one process is required that will deal with further processing of the event that has occurred. The remaining processes will sleep again.

    IT terminology - Vasily Alekseenko

    In this case, we are interested in the problem of accept () and read () distributed among the threads in conjunction with the epoll .


    Actually, with the blocking accept () call, there are no problems anymore. The kernel itself will take care that only one process has been unlocked at this event, and all incoming connections are serialized.

    But with epoll this trick will not work. If we have listen () on a non-blocking socket, when the connection is established, all epoll_wait () awaiting an event from the given descriptor will be awakened .

    Of course, accept () can only be done by one thread, the rest will receive EAGAIN , but this is a waste of resources.

    Moreover, EPOLLET does not help us either, since we do not know how many connections are in the backlog queue . As we remember, when using EPOLLET , socket processing should continue until the return with the error code EAGAIN , so there is a chance that all accept () will be processed by one thread and the rest will not work.

    And this again leads us to a situation where a neighboring stream was awakened in vain.

    We can also get a starvation of another kind - we will only have one stream loaded, and the rest will not receive connections for processing.


    Prior to version 4.5, the only correct way to handle epoll distributed across threads to a non-blocking listen () handle with a subsequent accept () call was to set the EPOLLONESHOT flag , which again led us to accept () being processed only in one thread at a time.

    In short, if an EPOLLONESHOT is used , the event associated with a specific descriptor will work only once, after which it is necessary to re-set the flags with the help of epoll_ctl () .


    This is where EPOLLEXCLUSIVE and level-triggered come to the rescue .

    EPOLLEXCLUSIVE unlocks one waiting epoll_wait () at a time for one event.

    The scheme is quite simple (actually not):

    • We have N threads waiting for a connection event.
    • The first client connects to us.
    • Thread 0 will kick back and start processing, the rest will remain blocked.
    • A second client connects to us, if stream 0 is still busy processing, then stream 1 is unblocked
    • We continue further until the thread pool has been exhausted (no one expects an event on epoll_wait () )
    • We are connecting another client
    • And its processing will receive the first thread that will cause epoll_wait ()
    • The processing of the second client will be received by the next thread, which will call epoll_wait ()

    Thus, all services are evenly distributed over the threads.

    $ ./epollexclusive --help  
        -i, --ip=ADDR specify ip address  
        -p, --port=PORT specify port  
        -n, --threads=NUM specify number of threads to use # количество потоков сервера - клиенты n*8
        -t, --thunder not adding EPOLLEXCLUSIVE # с этим флагом воспроизведется thunder herd
        -h, --help prints this message
    $ sudo  taskset -c 0-7 ./epollexclusive -i -p 40000 -n 8 2>&1

    example code: epollexclusive.c (will work only with kernel version from 4.5)

    Get the pre-fork model on epoll. This scheme is well applicable for TCP connections with a short lifetime ( short life-time TCP connections).


    But with read () in the case of byte-streaming, EPOLLEXCLUSIVE , just like EPOLLET will not help us.

    For obvious reasons, without EPOLLEXCLUSIVE, we cannot use level-triggered, at all. With EPOLLEXCLUSIVE, everything is not better, since we can get a parcel that is spread in streams, besides with an unknown order of bytes that have arrived.

    With EPOLLET, the situation is the same.

    And here the output will be EPOLLONESHOT with reinitialization upon completion of the work. So, as soon as one stream will work with this file descriptor and buffer:

    • handle added to epoll with EPOLLONESHOT flags | EPOLLET
    • waiting on epoll_wait ()
    • read from socket to buffer until read () returns EAGAIN
    • re-initialize with EPOLLONESHOT flags | EPOLLET

    struct epoll_event

    typedefunion  epoll_data {
        void *ptr;
        int  fd;
        uint32_t u32;
        uint64_t u64;
    } epoll_data_t;
    structepoll_event {uint32_t events; /* Epoll  events */epoll_data_t  data; /* User  data  variable */

    This item is probably the only thing in the article my personal IMHO. The ability to use a pointer or number is useful. For example, using a pointer while using epoll allows you to do a trick like this:

    #define  container_of(ptr, type, member) ({ \
        const  typeof( ((type *)0)->member ) *__mptr = (ptr); \
        (type  *)( (char *)__mptr - offsetof(type,member) );})
    struct  epoll_client {
        /** some  usefull  associated  data...*/
        struct  epoll_event  event;
    struct  epoll_client* to_epoll_client(struct  epoll_event* event)
        return  container_of(event, struct  epoll_client, event);
    struct  epoll_client  ec;
    epoll_ctl(efd, EPOLL_CTL_ADD, fd, &ec.e);
    epoll_wait (efd, events, 1, -1);
    struct  epoll_client* ec_ = to_epoll_client(events[0].data.ptr);

    I think everyone knows where the reception came from.


    I hope we managed to open up the epoll theme . Those who wish to use this mechanism consciously, just need to read the articles in the list of references [1, 2, 3, 5].

    Based on this material (and even better thoughtfully reading materials from the list of references), you can make a multi-threaded pre-fork (advance process spawning) lockfree (without lock) server or revise existing strategies based on the special properties of epoll () ).

    epoll is one of the unique mechanisms that people who have chosen Linux programming paths to be aware of, since they give a significant advantage over other operating systems), and, possibly, will allow to refuse from cross-platform for a specific case (let it work only under Linux but will do it well).

    Reasoning about the "specificity" of the problem

    Before someone tells about the specificity of these flags and usage patterns, I want to ask the question:

    "Is there anything that we are trying to discuss the specificity for the mechanism that was originally created for specific tasks [9, 11]? Or do we have the service even for 1k connections is quite a daily task for a programmer?"

    I do not understand the concept of "specificity of the task", it reminds me of all sorts of cries about the usefulness and uselessness of the various disciplines taught. Allowing ourselves to reason in this way, we appropriate for ourselves the right to decide for others what kind of information is useful to them and what is useless, while noticing, not participating in the education process as a whole.

    For skeptics, a couple of links:

    Increase productivity with SO_REUSEPORT in 1.9.1 NGINX - VBart
    the Learning from Unicorn is: the the accept () thundering herd the non-problem - by Chris Siebenmann
    serializing the accept (), AKA Thundering Herd, AKA the Zeeg Problem - of Roberto De Ioris
    How does the epoll's EPOLLEXCLUSIVE mode interact with level-triggering?


    1. Select is fundamentally broken - Marek
    2. Epoll is fundamentally broken 1/2 - Marek
    3. Epoll is fundamentally broken 2/2 - Marek
    4. The C10K problem - Dan Kegel
    5. Poll vs Epoll, once again - Jacques Mattheij
    6. epoll - I / O event notification facility - The Mann
    7. The method to epoll's madness - Cindy Sridharan



    Epoll evolution



    Many thanks to Sergey ( dlinyj ) and Peter Ovchenkov for valuable discussions, comments and help!

    Also popular now: