Poorly documented Linux features

    Sighing, she said:
    “How long have I been sleeping!”
    imageOnce, when I first met Unix, I was fascinated by the logical harmony and completeness of the system. For several years after that, I furiously studied the kernel device and system calls, reading everything I could get. Little by little my hobby came to naught, more pressing matters were found, and now, starting from some time, I began to discover one or the other feature about which I did not know before. The process is natural, but too often such incidents are united by one thing - the lack of an authoritative source of documentation. Often the answer is in the form of the third comment above on stackoverflow, often you have to bring together two or three sources to get the answer to exactly the question you asked. I want to bring here a small collection of such poorly documented features. None of them are new, some are even not very new, but for each I killed several hours at one time and often still do not know a systematic description.

    All examples relate to Linux, although many of them are valid for other * nix systems, I just took as a basis the most actively developing OS, besides the one that is before my eyes and where I can quickly check the proposed code.

    Please note that in the heading I wrote “poorly documented” and not “little-known”, so those who are in the know please post in the comments links to articulate documentation, I will be happy to add at the end of the list.

    Does freed memory return to the OS?

    This question, posed by a completely respected colleague of mine, served as a trigger for this publication. For half an hour after that I mixed it with dirt and called it comparative epithets, explaining what else the classics taught - memory in Unix is ​​allocated through the sbrk () system call , which simply increases the upper limit of available addresses; usually stands out in large pieces; which, of course, it is technically possible to lower the limit and return memory to the OS for other processes, however, it’s very expensive for an allocator to track all used and unused fragments, so memory return is not provided by design. This classic mechanism works fine in most cases, with the exception of a server sitting idle quietly for hours / months, suddenly requesting a lot of pages to process an event, and falling asleep again (but in this case it helps the swap). After which, having satisfied my FAQ, I went as an honest person to confirm my opinion on the Internet and was surprised to find that Linux starting from 2.4 can use both sbrk () and mmap () to allocate memory, depending on the requested size. Moreover, the memory allocated through mmap () completely returns to the OS after calling free () / delete. After such a blow, I had only onetwo - to humbly apologize and find out what exactly this mysterious limit is exactly equal to. Since I did not find any information, I had to measure with my hands. It turned out that on my system (3.13.0) - only 120 bytes. The code of the line for those who want to measure is here .

    What is the minimum interval that a process / thread can sleep through?

    The same Maurice Bach taught: the scheduler of processes in the kernel is activated by any interrupt; After receiving control, the scheduler goes through the list of sleeping processes and transfers those that woke up (received the requested data from the file or socket, the sleep () interval has expired , etc.) to the “ready to run” list, and then exits the interrupt into the current process. When the system timer is interrupted, which happened once every 100 ms, then, by increasing the CPU speed, every 10 ms, the scheduler puts the current process at the end of the “ready to run” list and starts the first process from the beginning of this list. Thus, if I called sleep (0)or even fell asleep for an instant for any reason, so that my process was rearranged from the “ready to run” list to the “preempted” list, he has no chance of earning again in less than 10 ms, even if he is alone in the system. In principle, the kernel can be rebuilt by reducing this interval, however, this causes unreasonably high CPU costs, so this is not an option. This well-known limitation has been poisoning the lives of developers of fast-responding systems for many years, and it has stimulated the development of real-time systems and non-blocking ( lockfree ) algorithms to a large extent .

    And somehow I repeated this experiment (I was actually interested in more subtle moments like the probability distribution) and suddenly I saw that the process wakes up aftersleep (0) after 40 μs, 250 times faster. The same after calls to yield (), std :: mutex :: lock () and all other blocking calls. What is going on ?!

    The search pretty quickly led to the Completely Fair Scheduler introduced since 2.6.23, but for a long time I could not understand how exactly this mechanism leads to such a quick switch. As I finally found out, the difference lies precisely in the default scheduler class algorithm itself, under which all processes are started by default. Unlike the classical implementation, in this every working process / thread has a dynamic priority, so that the working process has a priority that is gradually reduced relative to other pending execution. Thus, the scheduler can decide to start another process immediately, without waiting for the end of the fixed interval, and the algorithm for enumerating processes now O (1) is much easier and can be executed more often.

    This change leads to surprisingly far-reaching consequences, in fact, the gap between real-time and the usual system almost disappeared, the proposed delay of 40 microseconds is really quite small for most applications, the same can be said for non-blocking algorithms - the classical blocking data structures on mutexes have become very competitive.

    And what are these scheduling policies ?

    This topic is more or less described, I will not repeat it, and nevertheless, we will open one and the second authoritative books on the corresponding page and compare them with each other. There is almost literal repetition of each other in some places, as well as some discrepancies with what man -s2 sched_setscheduler says . However a symptom.

    Let's just play around with the code for a while. I create several threads with different priorities, suspend them all on the mutex and wake them all at once. I naturally expect that they will wake up in strict accordance with their priority:

    iBolit# ./sche -d0 -i0 -b0 -f1 -r2 -f3 -i0 -i0 -i0 -d0
    6 SCHED_FIFO[3]
    5 SCHED_RR[2]
    4 SCHED_FIFO[1]
    1 SCHED_OTHER[0]
    2 SCHED_IDLE[0]
    3 SCHED_BATCH[0]
    7 SCHED_IDLE[0]
    8 SCHED_IDLE[0]
    9 SCHED_IDLE[0]
    10 SCHED_OTHER[0]

    The number at the beginning of the line indicates the order in which the threads were created. As you can see, the two priority classes SCHED_FIFO and SCHED_RR always take precedence over the three ordinary classes SCHED_OTHER, SCHED_BATCH and SCHED_IDLE, and are ranked strictly by priority, which is what was required. But for example, the fact that all three user classes at the start are equal is not mentioned anywhere at all, even SCHED_IDLE, which is much inferior in rights compared to the default SCHED_OTHER, runs ahead of it if it is the first in the queue at the mutex. Well, at least everything works, but
    Solaris has a hole in this place
    A few years ago, I ran this test under Solaris and found that the priorities of the threads are completely ignored, the threads wake up in a completely random order. I then contacted Sun technical support, but I received a surprisingly slurred and meaningless answer (before that, they willingly cooperated with us). Two weeks later, Sun was gone . I sincerely hope that not my request was the reason.

    For those who want to bargain with priorities and classes, the source code is in the same place .

    Delayed TCP packets

    If the previous examples can be considered a pleasant surprise, then this one is hardly pleasant.
    The story began a few years ago when we suddenly discovered that one of our servers, sending clients a continuous stream of data, experiences periodic delays of 40 milliseconds. This happened infrequently, but we could not afford such luxury, so a ritual dance was performed with a sniffer and subsequent analysis. Attention , when discussing on the Internet, this problem is usually associated with the Nagle algorithm , it is not true , according to our results, the problem occurs on Linux when delayed ACK and slow start interact . Let's remember another classic,Richard Stevens to refresh memory.
    delayed ACK is an algorithm that delays sending the ACK to the received packet for several tens of milliseconds, in the expectation that a response packet will be sent immediately and the ACK can be built into it for the obvious purpose of reducing the traffic of empty datagrams over the network. This mechanism works in an interactive TCP session and in 1994, when TCP / IP Illustrated came out , it was already a standard part of the TCP / IP stack. What is important for further understanding, the delay can be interrupted in particular by the arrival of the next data packet, in which case the cumulative ACK for both datagrams is sent immediately.
    slow start- A no less old algorithm designed to protect intermediate routers from an overly aggressive source. The sending party at the beginning of the session can send only one packet and must wait for the ACK from the recipient, after which it can send two, four, etc., until it rests against other regulatory mechanisms. This mechanism obviously works in the case of heavy traffic and, significantly, it turns on at the beginning of the session and after each forced relay of the lost datagram.
    TCP sessions can be divided into two large classes - interactive (such as telnet ) and voluminous ( bulk traffic, such as ftp) It is easy to notice that the requirements for traffic-regulating algorithms in these cases are often opposite, in particular, the requirements “hold ACK” and “wait for ACK” obviously contradict each other. In the case of a stable TCP session, the condition mentioned above rescues - the receipt of the next packet interrupts the delay and the ACK is sent to both segments without waiting for a passing packet with data. However, if one of the packets is suddenly lost, the sending side immediately initiates a slow start - it sends one datagram and waits for a response, the receiving side receives one datagram and delays the ACK, since no data is sent in response, the whole exchange hangs for 40 ms. Voilà.
    The effect arises precisely in Linux - Linux TCP connections, I have not seen this in other systems, it seems that they have something in the implementation. And how to deal with it? Well, in principle, Linux offers the (non-standard) TCP_QUICKACK option , which disables the delayed ACK , but this option is unstable, it turns off automatically, so you have to set the flag before each read () / write () . There is also / proc / sys / net / ipv4 , in particular / proc / sys / net / ipv4 / tcp_low_latency , but whether she does what I suspect she should do is unknown. In addition, this flag will apply to all TCP connections on this machine, not good.
    What are the suggestions?

    From the darkness of centuries

    And finally, the very first incident in the history of Linux, just to complete the picture.
    From the very beginning, Linux had a non-standard system call - clone () . It works the same as fork () , that is, it creates a copy of the current process, but at the same time the address space remains in shared use. It is easy to guess what it was invented for and indeed, this elegant solution immediately put Linux in the forefront among the OS for the implementation of multithreading. However, there is always one caveat ...

    The fact is that when cloning a process, all file descriptors, including sockets, are also cloned. If there was a worked out scheme before: a socket is opened, it is transferred to other flows, everyone works together sending and receiving data, one of the flows decides to close the socket, all the others immediately see that the socket is closed, at the other end of the connection (in the case of TCP) they also see that the socket closed; what happens now? If one of the threads decides to close its socket, the other threads do not know anything about it, because they are actually separate processes and they have their own copies of this socket and continue to work. Moreover, the other end of the connection also considers the connection open. It's a thing of the past, but once this innovation broke the pattern of many network programmers, and the code had to be rewritten under Linux pretty much.


    1. Maurice J. Bach. The Design of the UNIX Operating System.
    2. Robert Love. Linux Kernel Development
    3. Daniel P. Bovet, Marco Cesati. Understanding the Linux Kernel
    4. Richard Stevens. TCP / IP Illustrated, Volume 1: The Protocols
    5. Richard Stevens. Unix network programming
    6. Richard Stevens. Advanced Programming in the UNIX Environment
    7. Uresh Vahalia. UNIX Internals: The New Frontiers

    Here could be your link to the topics covered.

    The first swallows:

    And I’m really curious how much I still overslept and how far behind life. Let me turn on a little survey.

    Only registered users can participate in the survey. Please come in.

    How little are the issues raised?

    • 2.4% Everything is trivial and well-known, that's exactly how everything should work. 28
    • 2.2% In principle, everything is known, but these are all really deviations from the classical Unix 26
    • 53.4% Learned something new. Interesting 621
    • 19.5% Never heard anything like this 227
    • 9.9% I use high-level libraries and frameworks. Such problems never occur 115
    • 1.2% What language is it written in? Use <....> and there will be no problems 14
    • 11.1% What is Linux? 130

    Also popular now: