How Linux works with memory. Yandex Workshop
Hey. My name is Vyacheslav Biryukov. At Yandex, I lead the search operations team. Recently, for students of Yandex Information Technology Courses, I gave a lecture on working with memory in Linux. Why exactly memory? The main answer: I like working with memory. In addition, information about it is quite small, and the one that is, as a rule, is irrelevant, because this part of the Linux kernel changes quickly enough and does not have time to get into books. I will talk about the x86_64 architecture and the Linux kernel version 2.6.32. The kernel version 3.x will be in places.
This lecture will be useful not only for system administrators, but also for developers of highly loaded systems programs. It will help them understand how exactly the interaction with the kernel of the operating system occurs.
Resident memory - this is the amount of memory that is now in the RAM of the server, computer, laptop.
Anonymous memory is memory without the file cache and memory that has a file backend on disk.
Page fault is a memory access trap. The regular mechanism when working with virtual memory.
Work with memory is organized through pages. The memory size is usually large, addressing is present, but it is not very convenient for the operating system and hardware to work with each of the addresses separately, therefore all memory is divided into pages. Page size - 4 KB. There are also pages of a different size: the so-called Huge Pages of 2 MB in size and pages of 1 GB in size (we will not talk about them today).
Virtual memoryIs the address space of the process. The process does not work directly with physical memory, but with virtual memory. Such an abstraction makes it easier to write application code, not to think about the fact that you can accidentally access the wrong memory addresses or the addresses of another process. This simplifies the development of applications, and also allows you to exceed the size of the main RAM due to the mechanisms described below. Virtual memory consists of main memory and a swap device. That is, the amount of virtual memory can be, in principle, of unlimited size.
To manage virtual memory, a parameter is present in the system
You can see how much memory we have stored, how much is used and how much we can allocate, in the lines
In modern systems, all virtual memory is divided into NUMA nodes. Once we had computers with one processor and one memory bank (memory bank). This architecture was called UMA (SMP). Everything was very clear: one system bus for communication of all components. Later it became inconvenient, began to limit the development of architecture, and, as a result, NUMA was invented.
As you can see from the slide, we have two processors that communicate with each other on some channel, and each of them has its own buses through which they communicate with their memory banks. If we look at the picture, then the delay from CPU 1 to RAM 1 in the NUMA-note will be two times less than from CPU 1 to RAM 2. We can obtain this data and other information using the command
We see that the server has two nodes and information on them (how much free physical memory is in each node). Memory is allocated on each node separately. Therefore, you can consume all the free memory on one node, and underload the other. To prevent this from happening (this is characteristic of databases), you can start the process with the numactl interleave = all command. This allows you to distribute the memory allocation between two nodes evenly. Otherwise, the kernel selects the node on which this process was scheduled to run (CPU scheduling) and always tries to allocate memory on it.
Also, the memory in the system is divided into Memory Zones. Each NUMA node is divided into a number of such zones. They serve to support special hardware that cannot communicate over the entire range of addresses. For example, ZONE_DMA is 16 MB of the first addresses, ZONE_DMA32 is 4 GB. We look at the memory zones and their status through the file
Through Page Cache on Linux, all read and write operations go by default. It is of dynamic size, that is, it is he who will eat all your memory, if it is free. As an old joke says, if you need free memory in a server, just pull it out of the server. Page Cache divides all the files we read into pages (the page, as we said, is 4 KB). You can see if there are any pages in a particular file in Page Cache using a system call
How is the recording going? Any writing to the disk does not happen immediately, but in Page Cache, and this is done almost instantly. Here you can see an interesting "anomaly": writing to disk is much faster than reading. The fact is that when reading (if this page of the file is not in Page Cache), we will go to disk and wait for a response synchronously, and the record, in turn, will go instantly to the cache.
The disadvantage of this behavior is that in fact, the data has not been recorded anywhere - they are just in memory, and someday they will need to be flushed to disk. Each page is marked with a checkbox when recording (it is called dirty). Such a dirty page appears in Page Cache. If many of these pages accumulate, the system understands that it is time to dump them to disk, otherwise you can lose them (if power suddenly disappears, our data will also be lost).
The process consists of the following segments. We have a stack that grows down; he has a limit beyond which he cannot grow.
Then comes the mmap region: there are all the memory files of the process that we opened or created through a system call
If we are talking about memory inside a process, then working with pages is also inconvenient: as a rule, allocation of memory inside a process occurs in blocks. It is very rarely necessary to single out one or two pages, usually you need to immediately select some spacing of pages. Therefore, in Linux there is such a thing as a virtual memory area (VMA), which describes some kind of address space inside the virtual address space of this process. Each such VMA has its own rights (read, write, execute) and scope: it can be private or shared (which is “shared” with other processes in the system).
The allocation of memory can be divided into four cases: there is a allocation of private memory and memory that we can share with someone (share); the other two categories are partitioning into anonymous memory and one that is associated with a file on disk. The most common memory allocation functions are malloc and free. If we talk about
In fact, Linux does not allocate all the requested memory at once. The process of memory allocation - Demand Paging - begins with the fact that we request a page of memory from the system kernel, and it falls into the Only Allocated area. The kernel responds to the process: here is your memory page, you can use it. And nothing else is happening. No physical allocation occurs. And it will happen only if we try to record on this page. At this moment, the appeal will go to Page Table - this structure translates the virtual addresses of the process into physical addresses of RAM. In this case, two units will also be involved: MMU and TLB, as can be seen from the figure. They allow you to speed up the allocation and serve to translate virtual addresses into physical ones.
After we understand that nothing corresponds to this page in Page Table, that is, there is no connection with physical memory, we get Page Fault - in this case, minor (minor), since there is no access to the disk. After this process, the system can record to a dedicated memory page. For the process, all this happens transparently. And we can observe an increase in the minor Page Fault counter for the process by one unit. There is also a major Page Fault - in the case when the disk is accessed for the contents of the page (in the case
One of the tricks in working with memory in Linux - Copy On Write - allows you to make very fast processes (fork).
The memory subsystem and the file subsystem are closely related. Since working with a drive directly is very slow, the kernel uses RAM as a layer.
What conclusions can be drawn? We can work with files as with memory. We have lazy lading, that is, we can map a very, very large file, and it will be loaded into the process memory via Page Cache only as needed. Everything also happens faster because we use less system calls and, in the end, it saves memory. It is also worth noting that at the end of the program, memory does not go anywhere and remains in Page Cache.
In the beginning it was said that all writing and reading go through Page Cache, but sometimes for some reason, there is a need to move away from this behavior. Some software products work this way, for example MySQL with InnoDB.
You can tell the kernel that in the near future we will not work with this file, and you can force the page to be unloaded from Page Cache using special system calls:
The vmtouch utility can also remove pages from a file from Page Cache - the “e” key.
Let's talk about Readahead. If you read files from disk through Page Cache every page, then we will have quite a lot of Page Fault and we will often go to disk for data. Therefore, we can control the size of Readahead: if we read the first and second page, then the kernel understands that, most likely, we need a third. And since it’s expensive to go to disk, we can read a little more in advance by uploading the file in advance to Page Cache and responding from it in the future. Thus, the replacement of future heavy major Page Faults with minor (minor) page fault occurs.
So we gave everyone a memory, all processes are happy, and suddenly our memory is over. Now we need to somehow free her. The process of finding and allocating free memory in the kernel is called Page Reclaiming. In memory there may be pages of memory that cannot be taken away - locked pages (locked). In addition to them, there are four more categories of pages. Kernel pages that should not be unloaded, because this will slow down the entire system; Swappable pages are pages of anonymous memory that cannot be unloaded anywhere except in a swap device; Syncable Pages - those that can be synchronized with the disk, and in the case of an open file for reading only - such pages can be easily erased from memory; and Discardable Pages are pages that you can simply opt out of.
In simple terms, the kernel has one large Free List (in fact, this is not so), which stores memory pages that can be issued to processes. The kernel tries to maintain the size of this list in some non-zero state in order to quickly allocate memory to processes. This list is supplemented by four sources: Page Cache, Swap, Kernel Memory and OOM Killer.
We must distinguish between hot and cold memory areas and somehow replenish our Free Lists at the expense of them. Page Cache is based on the principle of LRU / 2 queues. There is an active list of pages (Active List) and an inactive list (Inactive List) of pages, between which there is some kind of connection. In the Free List, requests for memory allocation arrive. The system gives pages from the head of this list, and pages from the tail of the inactive list fall into the tail of the list. When we read a file through Page Cache, new pages always fall into the head and go to the end of the inactive list if there has not been at least one more visit to these pages. If such an appeal was anywhere in the inactive list, then the pages go directly to the head of the active list and begin to move towards its tail. If at this moment they are again accessed, then the pages again break through to the top of the list. Thus, the system tries to balance the lists: the hottest data is always in Page Cache in the active list, and the Free List is never replenished at their expense.
It is also worth noting an interesting behavior: pages that replenish the Free List, which in turn arrive from the inactive list, but which have not yet been sent for allocation, can be returned back to the inactive list (in this case, to the head of the inactive list) .
In total, we get five such sheets: Active Anon, Inactive Anon, Active File, Inactive File, Unevictable. Such lists are created for each NUMA node and for each Memory Zone.
With cgroups, we can limit several processes to any parameters. In this case, we are interested in memory: we can limit memory without swap, but we can also memory and swap. For each group we can fasten our Out Of Memory Killer. Using cgroups, you can conveniently obtain statistics on memory usage for a process or group of processes in the context of anonymous and non-anonymous memory, using Page Cache and more (/sys/fs/cgroup/memory/memory.stat). When using cgroups with limited memory, Page Reclaiming can be of two types:
Books
For those who want to plunge into the device and work of Linux with memory in more detail, I recommend reading:
This lecture will be useful not only for system administrators, but also for developers of highly loaded systems programs. It will help them understand how exactly the interaction with the kernel of the operating system occurs.
Terms
Resident memory - this is the amount of memory that is now in the RAM of the server, computer, laptop.
Anonymous memory is memory without the file cache and memory that has a file backend on disk.
Page fault is a memory access trap. The regular mechanism when working with virtual memory.
Presentation at http://www.slideshare.net/yandex/linux-44775898 is not available.
Pages of memory
Work with memory is organized through pages. The memory size is usually large, addressing is present, but it is not very convenient for the operating system and hardware to work with each of the addresses separately, therefore all memory is divided into pages. Page size - 4 KB. There are also pages of a different size: the so-called Huge Pages of 2 MB in size and pages of 1 GB in size (we will not talk about them today).
Virtual memoryIs the address space of the process. The process does not work directly with physical memory, but with virtual memory. Such an abstraction makes it easier to write application code, not to think about the fact that you can accidentally access the wrong memory addresses or the addresses of another process. This simplifies the development of applications, and also allows you to exceed the size of the main RAM due to the mechanisms described below. Virtual memory consists of main memory and a swap device. That is, the amount of virtual memory can be, in principle, of unlimited size.
To manage virtual memory, a parameter is present in the system
overcommit
. He makes sure that we do not reuse the size of the memory. Managed via sysctl and can be in the following three values:- 0 is the default value. In this case, heuristics are used, which ensures that we cannot allocate much more virtual memory in the process than there is in the system;
- 1 - indicates that we do not follow the amount of allocated memory. This is useful, for example, in computational programs that allocate large amounts of data and work with them in a special way;
- 2 - a parameter that allows you to strictly limit the amount of virtual memory in the process.
You can see how much memory we have stored, how much is used and how much we can allocate, in the lines
CommitLimit
and Commited_AS
from the file /proc/meminfo
.Memory Zones and NUMA
In modern systems, all virtual memory is divided into NUMA nodes. Once we had computers with one processor and one memory bank (memory bank). This architecture was called UMA (SMP). Everything was very clear: one system bus for communication of all components. Later it became inconvenient, began to limit the development of architecture, and, as a result, NUMA was invented.
As you can see from the slide, we have two processors that communicate with each other on some channel, and each of them has its own buses through which they communicate with their memory banks. If we look at the picture, then the delay from CPU 1 to RAM 1 in the NUMA-note will be two times less than from CPU 1 to RAM 2. We can obtain this data and other information using the command
numactl hardware
. We see that the server has two nodes and information on them (how much free physical memory is in each node). Memory is allocated on each node separately. Therefore, you can consume all the free memory on one node, and underload the other. To prevent this from happening (this is characteristic of databases), you can start the process with the numactl interleave = all command. This allows you to distribute the memory allocation between two nodes evenly. Otherwise, the kernel selects the node on which this process was scheduled to run (CPU scheduling) and always tries to allocate memory on it.
Also, the memory in the system is divided into Memory Zones. Each NUMA node is divided into a number of such zones. They serve to support special hardware that cannot communicate over the entire range of addresses. For example, ZONE_DMA is 16 MB of the first addresses, ZONE_DMA32 is 4 GB. We look at the memory zones and their status through the file
/proc/zoneinfo
.Page Cache
Through Page Cache on Linux, all read and write operations go by default. It is of dynamic size, that is, it is he who will eat all your memory, if it is free. As an old joke says, if you need free memory in a server, just pull it out of the server. Page Cache divides all the files we read into pages (the page, as we said, is 4 KB). You can see if there are any pages in a particular file in Page Cache using a system call
mincore()
. Or using the vmtouch utility , which is written using this system call.How is the recording going? Any writing to the disk does not happen immediately, but in Page Cache, and this is done almost instantly. Here you can see an interesting "anomaly": writing to disk is much faster than reading. The fact is that when reading (if this page of the file is not in Page Cache), we will go to disk and wait for a response synchronously, and the record, in turn, will go instantly to the cache.
The disadvantage of this behavior is that in fact, the data has not been recorded anywhere - they are just in memory, and someday they will need to be flushed to disk. Each page is marked with a checkbox when recording (it is called dirty). Such a dirty page appears in Page Cache. If many of these pages accumulate, the system understands that it is time to dump them to disk, otherwise you can lose them (if power suddenly disappears, our data will also be lost).
Process memory
The process consists of the following segments. We have a stack that grows down; he has a limit beyond which he cannot grow.
Then comes the mmap region: there are all the memory files of the process that we opened or created through a system call
mmap()
. Next comes the large unallocated virtual memory space that we can use. From bottom to top heap grows - this is an area of anonymous memory. Below are the areas of the binary that we are launching.If we are talking about memory inside a process, then working with pages is also inconvenient: as a rule, allocation of memory inside a process occurs in blocks. It is very rarely necessary to single out one or two pages, usually you need to immediately select some spacing of pages. Therefore, in Linux there is such a thing as a virtual memory area (VMA), which describes some kind of address space inside the virtual address space of this process. Each such VMA has its own rights (read, write, execute) and scope: it can be private or shared (which is “shared” with other processes in the system).
Memory allocation
The allocation of memory can be divided into four cases: there is a allocation of private memory and memory that we can share with someone (share); the other two categories are partitioning into anonymous memory and one that is associated with a file on disk. The most common memory allocation functions are malloc and free. If we talk about
glibc malloc()
, then it allocates anonymous memory in such an interesting way: it uses heap to allocate small volumes (less than 128 KB) and mmap()
for large volumes. This allocation is necessary so that the memory is spent more optimally and it can easily be transferred to the system. If the heap does not have enough memory to allocate, a system call is called brk()
that expands the heap bounds. System callmmap()
deals with mapping the contents of the file to the address space. munmap()
in turn frees up the display. There mmap()
are flags that control the visibility of changes and the level of access.In fact, Linux does not allocate all the requested memory at once. The process of memory allocation - Demand Paging - begins with the fact that we request a page of memory from the system kernel, and it falls into the Only Allocated area. The kernel responds to the process: here is your memory page, you can use it. And nothing else is happening. No physical allocation occurs. And it will happen only if we try to record on this page. At this moment, the appeal will go to Page Table - this structure translates the virtual addresses of the process into physical addresses of RAM. In this case, two units will also be involved: MMU and TLB, as can be seen from the figure. They allow you to speed up the allocation and serve to translate virtual addresses into physical ones.
After we understand that nothing corresponds to this page in Page Table, that is, there is no connection with physical memory, we get Page Fault - in this case, minor (minor), since there is no access to the disk. After this process, the system can record to a dedicated memory page. For the process, all this happens transparently. And we can observe an increase in the minor Page Fault counter for the process by one unit. There is also a major Page Fault - in the case when the disk is accessed for the contents of the page (in the case
mmpa()
). One of the tricks in working with memory in Linux - Copy On Write - allows you to make very fast processes (fork).
Work with files and memory
The memory subsystem and the file subsystem are closely related. Since working with a drive directly is very slow, the kernel uses RAM as a layer.
malloc()
uses more memory: copying to user space. More CPU is also consumed, and we get more context switches than if we were working with the file through mmap()
.What conclusions can be drawn? We can work with files as with memory. We have lazy lading, that is, we can map a very, very large file, and it will be loaded into the process memory via Page Cache only as needed. Everything also happens faster because we use less system calls and, in the end, it saves memory. It is also worth noting that at the end of the program, memory does not go anywhere and remains in Page Cache.
In the beginning it was said that all writing and reading go through Page Cache, but sometimes for some reason, there is a need to move away from this behavior. Some software products work this way, for example MySQL with InnoDB.
You can tell the kernel that in the near future we will not work with this file, and you can force the page to be unloaded from Page Cache using special system calls:
- posix_fadvide ();
- madvise ();
- mincore ().
The vmtouch utility can also remove pages from a file from Page Cache - the “e” key.
Readahead
Let's talk about Readahead. If you read files from disk through Page Cache every page, then we will have quite a lot of Page Fault and we will often go to disk for data. Therefore, we can control the size of Readahead: if we read the first and second page, then the kernel understands that, most likely, we need a third. And since it’s expensive to go to disk, we can read a little more in advance by uploading the file in advance to Page Cache and responding from it in the future. Thus, the replacement of future heavy major Page Faults with minor (minor) page fault occurs.
So we gave everyone a memory, all processes are happy, and suddenly our memory is over. Now we need to somehow free her. The process of finding and allocating free memory in the kernel is called Page Reclaiming. In memory there may be pages of memory that cannot be taken away - locked pages (locked). In addition to them, there are four more categories of pages. Kernel pages that should not be unloaded, because this will slow down the entire system; Swappable pages are pages of anonymous memory that cannot be unloaded anywhere except in a swap device; Syncable Pages - those that can be synchronized with the disk, and in the case of an open file for reading only - such pages can be easily erased from memory; and Discardable Pages are pages that you can simply opt out of.
Sources of replenishment Free List
In simple terms, the kernel has one large Free List (in fact, this is not so), which stores memory pages that can be issued to processes. The kernel tries to maintain the size of this list in some non-zero state in order to quickly allocate memory to processes. This list is supplemented by four sources: Page Cache, Swap, Kernel Memory and OOM Killer.
We must distinguish between hot and cold memory areas and somehow replenish our Free Lists at the expense of them. Page Cache is based on the principle of LRU / 2 queues. There is an active list of pages (Active List) and an inactive list (Inactive List) of pages, between which there is some kind of connection. In the Free List, requests for memory allocation arrive. The system gives pages from the head of this list, and pages from the tail of the inactive list fall into the tail of the list. When we read a file through Page Cache, new pages always fall into the head and go to the end of the inactive list if there has not been at least one more visit to these pages. If such an appeal was anywhere in the inactive list, then the pages go directly to the head of the active list and begin to move towards its tail. If at this moment they are again accessed, then the pages again break through to the top of the list. Thus, the system tries to balance the lists: the hottest data is always in Page Cache in the active list, and the Free List is never replenished at their expense.
It is also worth noting an interesting behavior: pages that replenish the Free List, which in turn arrive from the inactive list, but which have not yet been sent for allocation, can be returned back to the inactive list (in this case, to the head of the inactive list) .
In total, we get five such sheets: Active Anon, Inactive Anon, Active File, Inactive File, Unevictable. Such lists are created for each NUMA node and for each Memory Zone.
A few words about cgroups
With cgroups, we can limit several processes to any parameters. In this case, we are interested in memory: we can limit memory without swap, but we can also memory and swap. For each group we can fasten our Out Of Memory Killer. Using cgroups, you can conveniently obtain statistics on memory usage for a process or group of processes in the context of anonymous and non-anonymous memory, using Page Cache and more (/sys/fs/cgroup/memory/memory.stat). When using cgroups with limited memory, Page Reclaiming can be of two types:
- Global Reclaiming, when we look for memory for the entire system - we replenish Free Lists of the system;
- Target Reclaiming, when we free memory in one of cgroup - in case of lack of memory in it.
Books
For those who want to plunge into the device and work of Linux with memory in more detail, I recommend reading:
- Systems Performance: Enterprise and the Cloud;
- Linux System Programming: Talking Directly to the Kernel and C Library;
- Linux Kernel Development (3rd Edition).