On the proper use of memory in NUMA systems running Linux

  • Tutorial
Recently, an article about NUMA systems appeared on our blog , and I would like to continue the topic by sharing my experience in Linux. Today I will talk about what happens if it is incorrect to use memory in NUMA and how to diagnose such a problem using performance counters.

So, let's start with a simple example:



This is a simple test that sums up the elements of an array in a loop. Run it in several threads on a dual-socket server, which has a quad-core processor in each socket. Below is a graph on which we see the program execution times depending on the number of threads:



We see that the execution time on eight threads is only 1.16 times shorter than on four threads, although when switching from two to four threads, the performance increase is noticeably higher. Now let's make a simple code transformation: we will add a parallelization directive before array initialization:



And we will collect run times again:



And now, on eight threads, performance almost doubled. Thus, our application scales almost linearly over the entire range of streams.
So, let's see what happened? How did simple parallelization of the initialization cycle lead to an almost twofold increase? Consider a dual-processor server with NUMA support:



Each four-core processor is assigned a certain amount of physical memory with which it communicates through an integrated memory controller and data bus. Such a bunch of processor + memory is called a node or node. In NUMA-systems (Non Uniform Memory Access) access to the memory of another node takes much longer than access to the memory of its node. When an application first accesses memory, virtual pages of memory are assigned to physical ones. But in NUMA-systems running Linux, this process has its own specifics: the physical pages to which the virtual ones will be assigned are highlighted on the node with which the first access occurred. This is the so-called “first-touch policy”. Those. if an appeal to any memory occurred from the first node, then the virtual pages of this memory will be displayed on the physical, which will also be allocated on the first node. Therefore, it is important to correctly initialize the data, because the application performance will depend on how the data is assigned to the nodes. If we talk about the first example, the entire array was initialized on one node, which led to fixing all the data to the first node, after which half of this array was read by another node, and this led to poor performance.

The attentive reader should already have asked the question “Isn't memory allocation through malloc the first access?”. Specifically, in this case - no. Here's the thing: when allocating large blocks of memory in Linux, the glibc m alloc function (as well as calloc and realloc) by default calls the mmap kernel service function. This service function only makes notes about the amount of allocated memory, but physical allocation occurs only at the first access to them. This mechanism is implemented through Page-Fault and Copy-On-Write interruptions (exceptions), as well as through mapping to the “zero” page. Those who are interested in details can read the book Understanding the Linux Kernel. In general, a situation is possible when the glibc c functionalloc will perform the first memory access in order to "null" it. But again, this will happen if calloc decides to return the previously freed memory on the heap (heap) to the user, and such memory will already exist on the physical pages. Therefore, in order to avoid unnecessary puzzles, it is recommended to use the so-called NUMA-aware memory managers (for example TCMalloc), but this is another topic.

Now let's answer the main question of this article: “How do I know if the application works correctly with memory in the NUMA system?”. This question will always be the first and foremost for us when adapting applications for servers with NUMA support, regardless of the operating system.

To answer this question, we need a VTune Amplifier that can read events for two performance counters: OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_DRAM and OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_DRAM. The first counter counts the number of all requests for which data was found in the RAM of its node, and the second - in the memory of another node. Just in case, you can still collect counters for the cache: OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_CACHE and OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_CACHE. Suddenly it turns out that the data is not in memory, but in the processor’s cache on a foreign node?

So, let's run our application without parallelizing the initialization into eight threads under VTune and calculate the number of events for the above counters:



We see that the thread running on cpu 0 worked mainly with its node. Although from time to time, the vmlinux module on this kernel for some reason looked into other people's nodes. But the stream on cpu 1 did the opposite: only for 0.13% of all requests the data was found in its own node. Here I must explain how the kernels are assigned to the nodes. The kernels 0,2,4,6 belong to the first node, and the kernels 1,3,5,7 belong to the second. The topology can be found using the numactl utility:

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 12277 MB
node 0 free: 10853 MB
node 1 cpus: 1 3 5 7
node 1 size: 12287 MB
node 1 free: 11386 MB
node distances:
node 0 1
0: 10 20
1: 20 10

Please note that logical numbers are listed here, in reality, the cores 0,2,4,6 belong to one quad-core processor, and the cores 1,3,5,7 belong to another.

Now let's look at the value of counters for an example with parallel initialization:



The picture is almost perfect, we see that all the kernels work mainly with their own nodes. Calls to foreign nodes make up no more than half a percent of all requests, with the exception of cpu 6. This kernel sends about 4.5% of all requests to a foreign node. Because accessing a foreign node takes 2 times longer than its own, then 4.5% of such requests do not significantly degrade performance. Therefore, we can say that now the application works correctly with memory.

Thus, using these counters you can always determine if it is possible to speed up the application for the NUMA system. In practice, I had cases when the correct initialization of the data accelerated the application by 2 times, and in some applications I had to parallel all the cycles, slightly degrading the performance for a regular SMP system.

For those who are interested in where 4.5% come from, I suggest going further. The Nehalem processor and its descendants have a rich set of counters for analyzing the activity of the memory system. All of these counters begin with the name OFFCORE_RESPONSE. It may even seem that there are too many of them. But if you look carefully, you can see that they are all combinations of composite requests and responses. Each composite request or response consists of basic requests and responses that are specified by a bitmask.

The following are the values ​​of bit masks for composite requests and responses:



This is how the counter OFFCORE_RESPONSE_0 is formed in the Nehalem processor:



Let's look at, for example, our counter OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_DRAM. It consists of a compound request ANY_REQUEST and a composite response REMOTE_DRAM. The ANY_REQUEST request has the value xxFF, which means tracking all events: from reading data “on demand” (bit 0, Demand Data Rd in the table) to instruction cache prefetters (bit 6, PF Ifetch) and the rest “trifles” (bit 7 , OTHER). The response REMOTE_DRAM is 20xx, which means tracking requests for which data was found only in the memory of a foreign node (bit 13 L3_MISS_REMOTE_DRAM). All information on these counters can be found on intel.com document “Intel 64 and IA-32 Architectures Optimization Reference Manual”, section “B.2.3.5 Measuring Core Memory Access Latency”.

In order to understand exactly who sends his requests to a foreign node, you need to decompose ANY_REQUEST into composite requests: DEMAND_DATA_RD, DEMAND_RFO, DEMAND_IFETCH, COREWB, PF_DATA_RD, PF_RFO, PF_IFETCH, OTHER and collect events for them separately. Thus, the culprit was found:

OFFCORE_RESPONSE_0.PREFETCH.REMOTE_DRAM
cpu 0: 6405
cpu 1: 597190
cpu 2: 2503
cpu 3: 229271
cpu 4: 2035
cpu 5: 190549
cpu 6: 19364266
cpu 7: 2280

But why did the prefetcher on the 6th core peek into a foreign node, while the prefetchers of the other kernels worked with their own nodes? The fact is that before running the parallel-initialization example, I additionally set up tight binding of threads to the kernels as follows:

export KMP_AFFINITY = granularity = fine, proclist = [0,2,4,6,1,3,5,7], explicit, verbose
./a.out

OMP: Info # 204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info # 202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info # 154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info # 156: KMP_AFFINITY: 8 available OS procs
OMP: Info # 157: KMP_AFFINITY: Uniform topology
OMP: Info # 179: KMP_AFFINITY: 2 packages x 4 cores / pkg x 1 threads / core (8 total cores)
OMP: Info # 206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info # 171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info # 171: KMP_AFFINITY: OS proc 4 maps to package 0 core 1
OMP: Info # 171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2
OMP: Info # 171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3
OMP: Info # 171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0
OMP: Info # 171: KMP_AFFINITY: OS proc 5 maps to package 1 core 1
OMP: Info # 171: KMP_AFFINITY: OS proc 3 maps to package 1 core 2
OMP: Info # 171: KMP_AFFINITY: OS proc 7 maps to package 1 core 3
OMP: Info # 147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info # 147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2}
OMP: Info # 147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4}
OMP: Info # 147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6}
OMP: Info # 147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {1}
OMP: Info # 147 : KMP_AFFINITY: Internal thread 5 bound to OS proc set {3}
OMP: Info # 147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {5}
OMP: Info # 147: KMP_AFFINITY: Internal thread 7 bound to OS proc set { 7}

According to this binding, the first four threads work on the first node, and the second four threads work on the second. This shows that the 6th core is the last core belonging to the first node (0,2,4,6). Usually, the prefetcher always tries to download the memory with a lead that is far ahead (or behind, depending on the direction in which the program accesses the memory). In our case, the prefetcher of the sixth core uploaded the memory that was ahead of the one with which the thread Internal thread 3 was working at that moment. This is where the call to a foreign node occurred, since the memory in front partly belonged to the first core of a foreign node (1,3 5.7). And this led to the appearance of 4.5% of calls to a foreign node.

Note: the test program was compiled by the Intel compiler with the –no-vec option to get scalar code instead of vector code. This was done in order to obtain "beautiful data" to facilitate understanding of the theory.

Also popular now: