zuzzas February 8, 2018 at 11:24

Defending swap [on Linux]: common misconceptions

Transfer

Note perev. : This fascinating article, detailing the purpose of swap in Linux and responding to a common misconception in this regard, was written by Chris Down - SRE from Facebook, which, in particular, is developing new metrics in the kernel that help analyze the load on RAM. And he begins his story with a concise TL; DR ...

TL; DR

Swap is an important part of a well-functioning system. Without it, it is more difficult to achieve prudent memory management.
Swap is needed not so much for urgent memory retrieval as for smooth and efficient memory release. Using it as an “urgent memory” is generally very harmful.
Disabling swap does not save you from disk I / O when competing for memory - disk I / O simply moves from anonymous pages to file pages. Not only can this be less effective, since there remains a smaller pool of pages available for release, but in itself can contribute to this high competition.

Foreword

While working on improving and using cgroup v2 , I managed to talk with many engineers about their attitude to memory management, especially about the behavior of the application under load and the heuristic algorithm of the operating system used “under the hood” for memory management.

The recurring theme of these discussions was swap. The topic of swap is actively disputed and poorly understood even by those who have worked with Linux for many years. Many perceive it as something useless or very harmful - they say, this is a relic of the past, when there was little memory and disks were a necessary evil, providing much-needed swap space. And so far, over the past few years, I often observe disputes around this statement: I myself have had a lot of discussions with colleagues, friends, and industry colleagues, helping them understand why swap is still a useful concept on modern computers, having much more physical memory than in the old days.

There is widespread misunderstanding about the purpose of swap: many people see in it only “slow additional memory” for use in critical situations, but do not understand its contribution to the adequate functioning of the operating system as a whole under normal load.

Many of us have heard such common phrases about memory: “Linux uses too much memory”, “swap should be twice the size of physical memory”, etc. These misconceptions are easy to dispel and their discussions have become more precise in recent years, but the myth of the “useless” swap is much more tied to heuristics and sacraments that cannot be explained with a simple analogy — a deeper understanding of memory management is required to discuss it.

This publication is mainly aimed at those who administer Linux-systems and are interested in hearing arguments against the absence / too small amount of swap or working with vm.swappinessset to 0.

Introduction

It’s hard to say why having a swap and moving pages of memory into it is good during normal operation, without sharing an understanding of some of the underlying underlying memory management mechanisms in Linux, so let's make sure we speak the same language.

Types of memory

Linux has many different types of memory, and each of these types has its own properties. Understanding their features is the key to understanding why swap is important.

For example, there are pages (“blocks” of memory, usually 4k each) that are responsible for storing code for each process running on a computer. There are also pages responsible for caching data and metadata related to files accessed by these programs to speed up their downloads in the future. They are part of the page cache [cache page] , and then I will refer to them as a file [file] memory.

There are also pages that are responsible for allocating memory made inside this code, for example whenmallocnew memory is allocated for writing to it or when the flag MAP_ANONYMOUSin is used mmap. These are “anonymous” pages - they are so called because they are not “supported” by anything - and I will refer to them as anonymous [anon] memory.

There are other types of memory: shared memory, slab memory, kernel stack memory, buffers and others, but anonymous memory and file memory are better known than others and easy to understand, therefore they will be used in examples, which, however, are equally applicable and to other types.

Memory with and without release

In thinking about a particular type of memory, one of the main questions is the possibility of its release. “ Reclaim” means that the system can, without data loss, delete pages of this type from physical memory.

For some types of pages, this is quite simple. For example, in the case of pure [clean] , i.e. unmodified, page cache memory, we just cache for better performance what is already on disk, so we can reset the page without the need for any special operations.

For some types of pages, this is possible, but not easy. For example, in case of muddy [dirty], i.e. Of the modified, page cache memory, we cannot just reset the page, because there are no modifications made to the disk yet. Therefore, you must either refuse the reclamation , or transfer our changes back to disk before flushing this memory.

For some types of pages this is not possible. For example, the previously mentioned anonymous pages can exist only in memory and no other backup storage, so they must be stored here (i.e. in the memory itself).

About swap's nature

If you look for an explanation of why swap is needed in Linux, inevitably there are numerous discussions about its purpose simply as extensions of physical RAM for critical cases. Here, for example, is a random post that I pulled from the first results on Google for “what is swap”:

“At its core, swap is emergency memory; spare space for cases when the system for some time needs more physical memory than is available in RAM. It is considered “bad” in the sense that it is slow and inefficient, and if the system constantly needs to use swap, it obviously does not have enough memory. [..] If you have enough RAM to satisfy all your needs and do not expect it to be exceeded, you can work just fine without swap space. "

Let me explain that I do not blame the author of this comment for the contents of his post at all - this is a “well-known fact” recognized by many Linux system administrators and perhaps one of the most likely answers to the question about swap. Unfortunately, this is in addition a misconception about the purpose and use of swap, especially on modern systems.

As I wrote above, the release of anonymous pages is “impossible”, because anonymous pages by their nature do not have a backup storage that can be accessed when deleting data from memory - thus, their release will lead to complete loss of data from the corresponding pages. However ... what will happen if we could create such a repository for these pages?

That's exactly what swap is for. Swap - a storage area for these seeming "nevysvobozhdaemymi» [unreclaimable] , pages, allows you to send them to on-demand storage. This means that they can begin to be considered as available for release as their more simple friends in this sense (like clean file pages), which allows more efficient use of free physical memory.

Swap is primarily a mechanism for equal release, and not for urgent "extra memory." Not swap slows down your application - the slowdown is due to the beginning of the aggregate competition for memory.

So, in what situations is this “equal release" justifiable to choose the release of anonymous pages? Here are abstract examples of some of the rarer scenarios:

During initialization, a long-running program can select and use many pages. The same pages can be used in the process of shutting down / cleaning, but they are not required after the "start" (in the understanding of the application itself) of the program. A fairly common occurrence for demons using large dependencies to initialize.
During normal operation of the program, we can allocate memory, which is then rarely used. For overall system performance, it may be more reasonable to use memory for something more important than to perform a significant page failure with unloading the data from this page to disk.

What happens with and without swap

Let's look at typical situations and what they lead to in the presence and absence of swap. I talk about metrics of "competition for memory" in a report about cgroup v2 .

No competition or little competition for memory

With swap : we can put anonymous memory in swap, which is rarely used and is needed only in a small part of the process life cycle. This allows you to use this memory to improve cache hit rates and other optimizations.
Without swap : we cannot add rarely used anonymous memory to swap, since it is forced to be stored only in memory. It’s not a fact that this will immediately lead to a problem, however, in some workloads, performance may decline due to outdated anonymous pages taking away more important tasks.

With moderate or high competition for memory

With swap : all types of memory have the same chance of being freed. This means there is a greater likelihood of successful page release - we can release pages that will not quickly lead to a failure again (to thrashing ).
Without swap : anonymous pages are limited by memory since have no alternatives for storage. The likelihood of successful long-term page release is lower, since it is only available for certain types of memory. The risk of page skipping is higher. An occasional reader might think that this would be better anyway, since there will be no load on the disk I / O, but this is not so: we simply transfer disk I / O due to swapping to reset the hot page cache and code segments that we will need it soon.

With temporary bursts in memory consumption

In the presence of swap : resistance to temporary bursts is higher, however, in the case of a sharp lack of memory, the time between slipping and OOM killer operation can increase. We can better see the causes of the memory load and we can more rationally influence them, we can carry out controlled intervention.
Without swap : OOM killer is called faster because anonymous pages are limited by memory and cannot be freed. We are more likely to encounter slippage, but the time between it and OOMing will be reduced. Will it be better or worse - depends on the specific application. For example, a queue-based application might want to require such a quick transition from slipping to OOMing. Nevertheless, it is still too late for useful actions - the OOM killer is called only in cases of a sharp lack of memory. Instead of relying on such behavior, it is first of all better to take care of a more opportunistic approach (i.e., aimed at following one’s own interests - approx. Transl. ) To kill processes when a state of competition for memory is reached.

Ok, I want a system swap, but how to configure it for specific applications?

You did not think that in this article there will be no mention of the use of cgroup v2?

Obviously, it is hard for the general heuristic algorithm not to make mistakes all the time, therefore it is important to be able to give the necessary instructions to the kernel. Historically, the only setting that could be applied at the system level was vm.swappiness. It has two problems: vm.swappinessit is extremely difficult to apply reasonably, because it is only a small part of a much larger heuristic system, and it applies only to the entire system, but not to a limited set of processes.

It can also be used mlockto fix pages in memory, but this approach requires either modification of the program code and fun withLD_PRELOAD, or terrible dancing with a debugger during the execution of the application. In languages based on virtual machines, all this also doesn’t work well, because you usually don’t have the ability to control the memory allocation and you have to do mlockallit that does not have exact settings for those pages that are really important.

In cgroup v2 there is a setting defined for each cgroupmemory.low, which allows you to tell the kernel to give preference to other applications for release until a certain threshold of used memory is reached. There is no guarantee that the kernel will prevent the swapping of parts of the application, however it will prefer release for other applications in the event of competition for memory. Under normal conditions, the swap logic in the kernel as a whole is good enough, so permission to opportunistically put pages in swap generally improves system performance. Slipping a swap in conditions of strong competition for memory is not ideal, but it is rather just a feature of the memory shortage situation than the swapper problem. In situations where pressure on memory begins to increase, you usually want to quickly shut down non-critical processes through their “suicide”.

And in this matter, you can’t just rely on the OOM killer. Because the OOM killer is called only in the most critical situations, when the system is already in a significantly unhealthy state and may have been in it for some time. It is necessary to independently and opportunistically resolve the situation before even thinking about the OOM killer.

Nevertheless, it is quite difficult to detect memory pressure using traditional Linux memory counters. We have access to something that somehow relates to the problem, but rather on the tangent: memory consumption, the number of page scanning operations, etc. - and with these metrics alone it is very difficult to distinguish the effective memory configuration from the one that leads to competition for memory. We have a Facebook group led byJohannes and working on new metrics to simplify the demonstration of pressure on memory - this should help us in the future. More information about this can be obtained from my report on cgroup v2, where I begin to talk more about one of the metrics.

Tuning

How much swap do I need then?

In general, the minimum amount of swap space required for optimal memory management depends on the number of anonymous pages that are tied to memory space and rarely accessed by the application, as well as the cost of releasing these anonymous pages. The latter is more a question of which pages should no longer be deleted in order to give way to those anonymous pages that are rarely accessed.

If you have enough disk space and a fresh (4.0+) kernel, more swap is almost always better than less. In older kernels, kswapd, one of the kernel processes responsible for swap management, was historically too zealous for moving memory to swap, making it all the more active the more swap was available. Recently, the behavior of swapping in the presence of a large swap-space has been significantly improved. So if you are working with the 4.0+ kernel, a large swap will not lead to excessive swapping. In general, on modern kernels it is normal to have a swap of several gigabytes in size, if you have such a space.

If disk space is limited, the answer really depends on the compromise you are willing to make and the environment. Ideally, you should have enough swap for the system to function optimally under normal and peak (from memory) load. I recommend setting up several test systems with 2-3 GB swap or more and observing what happens for a week or so under different load conditions (memory). If during this week there were no situations of a sharp lack of memory, which means insufficient use of such a test, everything will end with swap busy with a small amount of megabytes. In this case, it would probably be wise to have a swap of at least this size with the addition of a small buffer for changing loads. Also atop in logging mode in columnSWAPSZcan show which application pages fall into swap. If you are not yet using this utility on your servers for logging the server status history, it may be worth adding to the experiment its setting on test machines (in logging mode). At the same time, you will find out when the application started moving pages in swap, which can be tied to events from logs or other important indicators.

Another thing to think about is the type of media for swap. Reading from swap tends to be very random, because you cannot confidently predict which pages will fail and when. For SSDs, this is not a big deal, but for spinning disks, random I / O can be very expensive because it requires physical movement. On the other hand, file page crashes are usually less random, since files related to one running application are usually less fragmented. This may mean that for a spinning disk, you may want to shift towards the release of file pages instead of swapping anonymous pages, but, again, you need to test and evaluate how the balance will be maintained for your workload.

For laptop / desktop users who want to use swap to go into hibernate , this fact must also be taken into account, since the swap file must at least correspond to the size of the physical RAM.

What should be the swappiness setting?

First, it’s important to understand what you are doing vm.swappiness. This is a system setting (sysctl) that offsets memory to anonymous pages or file pages. Two different attributes are used for implementation: file_prio(the desire to release file pages) and anon_prio(the desire to release anonymous pages). vm.swappinessplays with these attributes, becoming the default value for anon_prioand subtracting from the standard value of 200 V file_prio, that is, vm.swappiness = 50equivalent to the value anon_prioof 50 and file_prio150 (exact numbers do not matter - their weight relative to each other is important).

It means thatvm.swappiness - this is essentially just the ratio of expensive anonymous memory that can be freed up and crash compared to the file memory for your hardware and workload. The lower the value, the more actively you tell the kernel that rare calls to anonymous pages of the road to move to and from swap on your hardware. The higher this value, the more you tell the kernel that the cost of swapping anonymous and file pages is the same on your hardware. The memory management subsystem will continue to try to decide whether to place file or anonymous pages in swap, guided by how hot the memory is, however swappiness tends to count in favor of more swapping or more skipping file system caches when both methods are available. On SSDs, these approaches are almost equal in cost, so installationvm.swappiness = 100(i.e. complete equality) may work well. On spinning disks, swapping can be significantly more expensive since in general, it requires random reading, so you most likely want to shift to a lower value.

The reality is that most people have no idea what their hardware requires, so setting this value based on instinct only is difficult - this is a question that requires personal testing with different meanings. You can also do an analysis of the memory composition of your system, the main applications and their behavior in the conditions of a small memory release.

Speaking of vm.swappiness, it is necessary to take into account the extremely important recent change made by Satoru Moriya in vmscan in 2012: it significantly changes the behaviorvm.swappiness = 0.

This patch essentially says that during installation vm.swappiness = 0we are completely opposed to scanning (and releasing) any anonymous pages until a state of high competition for memory has arrived. As noted earlier, in the general case, you do not want this behavior, since it eliminates the equality of release priorities to moments of extreme pressure on the memory, which in itself can actually lead to this extreme pressure. Therefore vm.swappiness = 1- the minimum value that you should choose if you do not want to activate such special behavior for scanning anonymous pages, implemented in the patch.

The default kernel value isvm.swappiness = 60. This is generally a good value for most workloads, but it's hard to have a common standard value that would suit everyone. Therefore, a valuable addition to the tuning mentioned in the section “How much swap do I need then?” Will be testing systems with different values vm.swappinessand observing application and system metrics under heavy load (memory). In the near future, when we get a decent implementation of the definition of refault in the kernel (see also “ refault distance-based file cache sizing ” - approx. Transl. ) , You can determine the value fairly independently of the workload by looking at the page refaulting metrics in cgroup v2.

Conclusion

Swap is a useful tool for the possibility of equal release of pages of memory, but its purpose is often misunderstood, which leads to its negative perception in the industry. If you use swap in the key for which it was created, that is, as a means of increasing the equality of release, you will find it a useful utility, and not some kind of problem.
Disabling swap does not save you from disk I / O when competing for memory - disk I / O simply moves from anonymous pages to file pages. Not only can this be less effective, since there remains a smaller pool of pages available for release, but in itself can contribute to this high competition.
Swap can slow down the OOM kill call by the system, as it is another, slower memory source for slipping in out-of-memory situations. But the OOM killer is used by the kernel as a last resort, when all other possibilities are completely exhausted. These capabilities themselves depend on the specific system:
- You can change the workload of the system according to your needs, depending on local (cgroup) or global pressure on memory. This will help to avoid such situations, however, throughout the history of Unix, exhaustive metrics to measure memory pressure have not been enough. Hopes are pinned on a quick fix with the advent of refault detection .
- You can move the release (and thus swapping) from certain processes (per-cgroup) with the help of memory.lowwhich it will be available for critical demons without completely disabling swap.

Tags: