
Memory management basics in vSphere 4.1
This article on logic should have appeared first, before the article on Transparent Page Sharing , since this is the base from which immersion in memory resource management in vSphere 4.1 should begin.
In my English-language blog , when I was just starting to study this topic, I broke it into two parts - it was easier for me to accept information completely new to me. But since the audience on the hub is serious and experienced, I decided to combine the material in one article.
We will start with the most basic element called Memory Page. He is given the following definition - a continuous block of data of a fixed size used to allocate memory. Typically, the page size can be 4 Kbytes (Small Page) or 2 MB (Large Page). For each application, the OS allocates 2 GB of virtual memory, which belongs only to this application. So that the OS can know which page of physical memory (Physical Address - PA) corresponds to a specific page of virtual memory (Virtual Address - VA), the OS records all pages of memory using the Page Table . This is where all the correspondences between VA and PA are stored.
Next, we need some tool that can, at the request of the memory application, find the required VA - PA pair in Page Table. Such a tool is called a Memory Management Unit (MMU). Finding a VA-PA pair may not always be fast, considering that a 2 GB virtual address space can have up to 524,288 pages in size of 4 KB. To speed up this kind of search, the MMU actively uses the Translation Lookaside Buffer (TLB), which stores the recently used VA-PA pairs. Each time the application makes a memory request, the MMU first checks the TLB for the presence of a VA-PA pair. If it is there - great, PA is given to the processor - this is called TLB hit. If nothing is found in the TLB (TLB miss), then the MMU has to “wool” the entire Page Table and as soon as the required pair is found, put it in the TLB and let the application know
If the desired page is in the swap, then first the page is sent from the swap back to memory, then the VA-PA pair is written to the TLB, and only then does the application access the memory. Visually, it looks like this.

TLB is significantly limited in size. In Nehalem processors, the first level TLB contains 64 records of 4 kB pages or 32 records of 2 MB pages, the second level TLB can work only with small pages and contains 512 records. Based on this, it can be assumed that the use of large pages will lead to a significantly smaller number of TLB miss - (64x4 = 256) + (512x4 = 2048) = 2304 Kbytes against 32x2 = 64 MB.
I can not refrain from calculating the cost of TLB miss from Wikipedia, where a certain averaged TLB is considered.
Size: 8 - 4,096 entries
Hit time: 0.5 - 1 clock cycle
Miss penalty: 10 - 100 clock cycles
Miss rate: 0.01 - 1%
If the TLB hit is executed in one CPU cycle, the TLB miss takes 30 cycles, and if the average TLB miss frequency is 1%, then the average the number of cycles per memory request is 1.30.
All these considerations and calculations are valid for situations when our OS runs on a physical server. But in the case when the OS runs on a virtual machine, we have another level of memory translation. When an application in a virtual machine makes a memory request, it uses a virtual address (VA), which must be translated into the physical address of the virtual machine (PA), which in turn must be translated into the physical address of the host ESXi memory (HA). That is, we already need two Page Table: one for VA-PA pairs, the other for PA-HA broadcasts.
If we try to display it graphically, we get something like this.

I don’t want to get into the deep historical jungle of memory virtualization technologies at the dawn of ESX, so we will look at the last two - Software Memory Virtualization and Hardware Assisted Memory Virtualization. Here I need to introduce another Virtual Machine Monitor (VMM) element into our story. VMM is involved in the execution of all the instructions that the virtual machine gives to the processor and memory. There is a separate VMM process for each virtual machine.
Now that we have refreshed the basics, we can go on to some details.
Software Memory Virtualization
So, as the name implies, we have here a software implementation of memory virtualization. At the same time, VMM creates a Shadow Page Table, into which the following translations are copied:
1. VA - PA, these address pairs are directly copied from the Page Table of the virtual machine for which the guest OS is responsible.
2. PA - HA, VMM itself is responsible for these address pairs.
That is, this is a kind of summary double translation table. Each time an application in a virtual machine accesses memory (VA), the MMU should look in the Shadow Page Table and find the corresponding VA - HA pair so that the physical processor can work with the real physical address of the host memory (HA). At the same time, isolation of the MMU from the virtual machine is achieved so that one virtual machine does not gain access to the memory of another virtual machine.
According to VMware documentation, this type of address translation is very comparable in speed to address translation on a regular physical server.
The question is, why then fence in the garden and invent new technologies? It turns out there is something - not all types of memory requests from a virtual machine can be executed at the same speed as on a physical server. For example, every time a change occurs in the Page Table in the virtual machine OS (that is, the VA - PA pair changes), VMM should intercept this request and update the corresponding section of the Shadow Page Table (that is, the resulting VA - HA pair). Another good example is when an application makes the very first request to a specific memory location. Since VMM has not heard anything about this VA so far, that is, it is necessary to create a new record in the Shadow Page Table, thus again introducing a delay in accessing the memory. And finally, although this is not critical, it can be noted that the Shadow Page Table itself also consumes memory. This technology is applicable to vSphere running on processors that were released before the Nehalem / Barcelona families appeared on the market.
Hardware Assisted Memory Virtualization
There are currently two main MMU virtualization technologies on the market. The first was introduced by Intel in the Nehalem processor family and is called this new feature - Extended Page Tables (EPT). The second was introduced by AMD in the Barcelona family of processors called Rapid Virtualization Indexing (RVI). In principle, both technologies perform the same functionality, and differ only in very deep technical details, the study of which I neglected because of their insignificance for my work.
So, the main advantage of both technologies is that now the new MMU can simultaneously launch two processes for finding addresses in Page Tables. The first process looks for a VA - PA pair in the Page Table of your virtual machine, the other process looks for a PA - HA pair in the Page Table that VMM controls. The second Page Table is called the Extended (sometimes Nested) Page Table. Once both pairs are found, the MMU writes the resulting VA - PA pair to the TLB. Since both tables are now accessible by MMU, the content of the Shadow Page Table is no longer needed. Another important point is that now, when both tables are separated from each other, the virtual machine can easily be managed from its Page Table without any control from VMM.
Well, another significant difference between the Nehalem architecture is that a new Virtual Processor ID field is now introduced in TLB. In older processors, this field was not present, and when the processor switched from the context of one virtual machine to the context of another virtual machine, all TLB contents were deleted for security reasons. Now, with the help of VPID this can be avoided and accordingly again reduced the number of cases of TLB miss.
The only noted problem with this solution is the higher cost of TLB miss for one simple reason - every time when the MMU does not find the required pair of addresses in the TLB, the MMU has to search again in two tables. That is why when vSphere discovers that it runs on a Nehalem processor, it necessarily uses large pages. Next week I will try to end the disconnection of large page support on all our ESXi hosts and lay out the results of performance - in my first article I already mentioned the results of disabling Large Pages on one of the production servers.
For a number of reasons, from laziness and trying to make the material more accessible to my ignorance, I omitted a lot of different nuances and details. For example, such as various types of TLBs and processors, an additional level of translation of memory from Linear Address to Virtual Address, differences in working with memory in different operating systems, etc. Criticism of the quality of the material, technical inaccuracies, tricky questions, and indeed any lively interest are welcome, because they will help us all fill our knowledge gaps.
The main sources of my inspiration and information were Wikipedia and this document , for which I would like to thank many of their authors.
In my English-language blog , when I was just starting to study this topic, I broke it into two parts - it was easier for me to accept information completely new to me. But since the audience on the hub is serious and experienced, I decided to combine the material in one article.
We will start with the most basic element called Memory Page. He is given the following definition - a continuous block of data of a fixed size used to allocate memory. Typically, the page size can be 4 Kbytes (Small Page) or 2 MB (Large Page). For each application, the OS allocates 2 GB of virtual memory, which belongs only to this application. So that the OS can know which page of physical memory (Physical Address - PA) corresponds to a specific page of virtual memory (Virtual Address - VA), the OS records all pages of memory using the Page Table . This is where all the correspondences between VA and PA are stored.
Next, we need some tool that can, at the request of the memory application, find the required VA - PA pair in Page Table. Such a tool is called a Memory Management Unit (MMU). Finding a VA-PA pair may not always be fast, considering that a 2 GB virtual address space can have up to 524,288 pages in size of 4 KB. To speed up this kind of search, the MMU actively uses the Translation Lookaside Buffer (TLB), which stores the recently used VA-PA pairs. Each time the application makes a memory request, the MMU first checks the TLB for the presence of a VA-PA pair. If it is there - great, PA is given to the processor - this is called TLB hit. If nothing is found in the TLB (TLB miss), then the MMU has to “wool” the entire Page Table and as soon as the required pair is found, put it in the TLB and let the application know
If the desired page is in the swap, then first the page is sent from the swap back to memory, then the VA-PA pair is written to the TLB, and only then does the application access the memory. Visually, it looks like this.

TLB is significantly limited in size. In Nehalem processors, the first level TLB contains 64 records of 4 kB pages or 32 records of 2 MB pages, the second level TLB can work only with small pages and contains 512 records. Based on this, it can be assumed that the use of large pages will lead to a significantly smaller number of TLB miss - (64x4 = 256) + (512x4 = 2048) = 2304 Kbytes against 32x2 = 64 MB.
I can not refrain from calculating the cost of TLB miss from Wikipedia, where a certain averaged TLB is considered.
Size: 8 - 4,096 entries
Hit time: 0.5 - 1 clock cycle
Miss penalty: 10 - 100 clock cycles
Miss rate: 0.01 - 1%
If the TLB hit is executed in one CPU cycle, the TLB miss takes 30 cycles, and if the average TLB miss frequency is 1%, then the average the number of cycles per memory request is 1.30.
All these considerations and calculations are valid for situations when our OS runs on a physical server. But in the case when the OS runs on a virtual machine, we have another level of memory translation. When an application in a virtual machine makes a memory request, it uses a virtual address (VA), which must be translated into the physical address of the virtual machine (PA), which in turn must be translated into the physical address of the host ESXi memory (HA). That is, we already need two Page Table: one for VA-PA pairs, the other for PA-HA broadcasts.
If we try to display it graphically, we get something like this.

I don’t want to get into the deep historical jungle of memory virtualization technologies at the dawn of ESX, so we will look at the last two - Software Memory Virtualization and Hardware Assisted Memory Virtualization. Here I need to introduce another Virtual Machine Monitor (VMM) element into our story. VMM is involved in the execution of all the instructions that the virtual machine gives to the processor and memory. There is a separate VMM process for each virtual machine.
Now that we have refreshed the basics, we can go on to some details.
Software Memory Virtualization
So, as the name implies, we have here a software implementation of memory virtualization. At the same time, VMM creates a Shadow Page Table, into which the following translations are copied:
1. VA - PA, these address pairs are directly copied from the Page Table of the virtual machine for which the guest OS is responsible.
2. PA - HA, VMM itself is responsible for these address pairs.
That is, this is a kind of summary double translation table. Each time an application in a virtual machine accesses memory (VA), the MMU should look in the Shadow Page Table and find the corresponding VA - HA pair so that the physical processor can work with the real physical address of the host memory (HA). At the same time, isolation of the MMU from the virtual machine is achieved so that one virtual machine does not gain access to the memory of another virtual machine.
According to VMware documentation, this type of address translation is very comparable in speed to address translation on a regular physical server.
The question is, why then fence in the garden and invent new technologies? It turns out there is something - not all types of memory requests from a virtual machine can be executed at the same speed as on a physical server. For example, every time a change occurs in the Page Table in the virtual machine OS (that is, the VA - PA pair changes), VMM should intercept this request and update the corresponding section of the Shadow Page Table (that is, the resulting VA - HA pair). Another good example is when an application makes the very first request to a specific memory location. Since VMM has not heard anything about this VA so far, that is, it is necessary to create a new record in the Shadow Page Table, thus again introducing a delay in accessing the memory. And finally, although this is not critical, it can be noted that the Shadow Page Table itself also consumes memory. This technology is applicable to vSphere running on processors that were released before the Nehalem / Barcelona families appeared on the market.
Hardware Assisted Memory Virtualization
There are currently two main MMU virtualization technologies on the market. The first was introduced by Intel in the Nehalem processor family and is called this new feature - Extended Page Tables (EPT). The second was introduced by AMD in the Barcelona family of processors called Rapid Virtualization Indexing (RVI). In principle, both technologies perform the same functionality, and differ only in very deep technical details, the study of which I neglected because of their insignificance for my work.
So, the main advantage of both technologies is that now the new MMU can simultaneously launch two processes for finding addresses in Page Tables. The first process looks for a VA - PA pair in the Page Table of your virtual machine, the other process looks for a PA - HA pair in the Page Table that VMM controls. The second Page Table is called the Extended (sometimes Nested) Page Table. Once both pairs are found, the MMU writes the resulting VA - PA pair to the TLB. Since both tables are now accessible by MMU, the content of the Shadow Page Table is no longer needed. Another important point is that now, when both tables are separated from each other, the virtual machine can easily be managed from its Page Table without any control from VMM.
Well, another significant difference between the Nehalem architecture is that a new Virtual Processor ID field is now introduced in TLB. In older processors, this field was not present, and when the processor switched from the context of one virtual machine to the context of another virtual machine, all TLB contents were deleted for security reasons. Now, with the help of VPID this can be avoided and accordingly again reduced the number of cases of TLB miss.
The only noted problem with this solution is the higher cost of TLB miss for one simple reason - every time when the MMU does not find the required pair of addresses in the TLB, the MMU has to search again in two tables. That is why when vSphere discovers that it runs on a Nehalem processor, it necessarily uses large pages. Next week I will try to end the disconnection of large page support on all our ESXi hosts and lay out the results of performance - in my first article I already mentioned the results of disabling Large Pages on one of the production servers.
For a number of reasons, from laziness and trying to make the material more accessible to my ignorance, I omitted a lot of different nuances and details. For example, such as various types of TLBs and processors, an additional level of translation of memory from Linear Address to Virtual Address, differences in working with memory in different operating systems, etc. Criticism of the quality of the material, technical inaccuracies, tricky questions, and indeed any lively interest are welcome, because they will help us all fill our knowledge gaps.
The main sources of my inspiration and information were Wikipedia and this document , for which I would like to thank many of their authors.