vmnomad July 6, 2011 at 11:52

NUMA and what does vSphere know about it?

I think many people have already managed to drop by and read this article in English on my blog, but for those who are still more comfortable reading in their native language than foreign (I would say in dirty.ru - in anti-Mongolian), I translate my next article.

You probably already know that NUMA is an uneven memory access. This technology is currently featured on Intel Nehalem and AMD Opteron processors. Honestly, as a practicing networker for the most part, I have always been sure that all processors are evenly fighting for access to memory, but in the case of NUMA processors, my idea is very outdated.

This is what it looked like before the advent of a new generation of processors.

In the new architecture, each processor socket has direct access only to certain memory slots and forms a NUMA node. That is, with 4 processors and 64 GB of memory, you will have 4 NUMA nodes, each with 16 GB of memory.

As I understand it, a new approach to the allocation of memory access was invented due to the fact that modern servers are so crammed with processors and memory that it becomes technologically and economically disadvantageous to provide access to memory through a single shared bus. Which in turn can lead to competition for bandwidth between processors, and to lower scalability of the performance of the servers themselves. The new approach brings 2 new concepts - Local memory and Remote memory. While the processor accesses the local memory directly, it has to access the Remote memory using the old grandfather method, via a shared bus, which means a higher delay. It also means that for the effective use of the new architecture, our OS must understand that it works on a NUMA node and correctly manage its applications / processes, otherwise the OS simply runs the risk of a situation where the application is running on the processor of one node, while its (application) address memory space is located on another node. A quick search showed that NUMA architecture has been supported by Microsoft since Windows 2003 and Vmware - at least with ESX Server 2.

I'm not sure that in the GUI you can somehow see the NUMA data of the node, but it can definitely be seen in esxtop.

So, here we can observe that in our server there are 2 NUMA nodes, and that each of them has 48 GB of memory. This document says that the first value means the amount of local memory in the NUMA node, and the second, in brackets, is the amount of free memory. However, a couple of times on my production servers I observed a second value higher than the first, and could not find any explanation for this.
So, as soon as the ESX server detects that it is running on a server with NUMA architecture, it immediately turns on the NUMA scheduler, which in turn takes care of the virtual machines and that all the vCPUs of each machine are within the same NUMA node. In previous versions of ESX (up to 4.1), for effective operation on NUMA systems, the maximum number of vCPUs of a virtual machine was always limited by the number of cores on one processor. Otherwise, the NUMA scheduler simply ignored this VM and the vCPUs were evenly distributed over all available cores. However, ESX 4.1 introduced a new technology called Wide VM. It allows us to assign more vCPUs to VMs than cores on a processor. According to Vmware documentthe scheduler splits our “wide virtual machine” into several NUMA clients and then each NUMA client is processed according to the standard scheme, within one NUMA node. However, the memory will still be fragmented between the NUMA selected nodes of this Wide VM, on which the vCPUs of the virtual machine are running. This is because it is almost impossible to predict which part of the memory a particular vCPU NUMA client will access. Despite this, Wide VMs still provide a significantly improved memory access mechanism compared to the standard “smearing” of a virtual machine on top of all NUMA nodes.

Another great feature of the NUMA scheduler is that it not only decides where to position the virtual machine when it starts, but also constantly monitors its ratio of local and remote memory. And if this value goes below the threshold (according to an unconfirmed info - 80%), then the scheduler starts migrating the VM to another NUMA node. Moreover, ESX will control the migration speed to avoid overloading the shared bus through which all NUMA nodes communicate.

It is also worth noting that when installing the server in memory, you must configure the memory in the correct slots, because The memory allocation between NUMA nodes is not the responsibility of the NUMA scheduler, but the physical architecture of the server.
And finally, some useful information that you can get from esxtop.

Summary of values:
NHN NUMA
number of NMIG node Number of virtual machine migrations between NUMA
NMREM nodes Number of remote memory used by VM
NLMEM Number of local memory used by VM
N&L Percentage of local and remote memory
GST_ND (X) Number of allocated memory for VMs on node X
OVD_ND (X) The amount of memory spent on overhead on node X.

I would like to note that, as usual, this whole article is just a compilation of what seemed interesting to me from a product read recently on blogs such Look for both Frank Denneman and Duncan Epping , as well as official Vmware docs.

Tags:

vsphere esx numa

NUMA and what does vSphere know about it?

Also popular now: