valeriyk February 18, 2014 at 21:57

MMU in pictures (part 1)

I want to talk about the memory management unit (MMU). As you, of course, know, the main function of the MMU is the hardware support of virtual memory. The Dictionary of Cybernetics edited by Academician Glushkov tells us that virtual memory is an imaginary memory allocated by the operating system to host a user program, its working fields and information arrays.

Systems with virtual memory have four main properties:

User processes are isolated from each other and, dying, do not drag the whole system
User processes are isolated from physical memory, that is, they don’t know how much RAM you actually have and at what addresses it is located.
The operating system is much more complicated than on systems without virtual memory
You can never know in advance how long it will take to execute the next processor command.

The benefits of all of the above items are obvious: millions of ~~crooked-handed~~ application programmers, thousands of developers of operating systems and an uncountable number of embedders are grateful to virtual memory for the fact that they are still in business.

Unfortunately, for some reason, all of the above comrades are not respectful enough about MMUs, and their acquaintance with virtual memory usually begins and ends with a study of the page organization of memory and the Translation Lookaside Buffer (TLB). The most interesting part remains behind the scenes.

I don’t want to repeat Wikipedia, so if you even forgot about what page memory is, then it's time to follow the link . About TLB there will be a couple of lines below.

MMU device

Now let's get down to business. This is how a processor without virtual memory support looks like:

All addresses used in the program for such a processor are real, “physical”, i.e. The programmer, linking the program, must know at what addresses the RAM is located. If you have soldered 640 kB of RAM displayed in the addresses 0x02300000-0x0239FFFF, then all addresses in your program should fall into this area.

If a programmer wants to think that he always has four gigabytes of memory (of course, we are talking about a 32-bit processor), and his program is the only thing that distracts the processor from sleep, then we need virtual memory. To add support for virtual memory, it is enough to place an MMU between the processor and RAM, which will translate virtual addresses (addresses used in the program) into physical (addresses that go to the input of memory chips):

This arrangement is very convenient - the MMU is used only when the processor accesses the memory (for example, when the cache misses), and the rest of the time is not used and saves energy. In addition, in this case, the MMU has almost no effect on processor performance.

Here is what happens inside the MMU:

This process looks like this:

The processor inputs a virtual address to the MMU input
If the MMU is turned off or if the virtual address falls into the untranslated area, then the physical address is simply equated to the virtual
If the MMU is turned on and the virtual address falls into the broadcast area, the address is translated, that is, the virtual page number is replaced with the number of the corresponding physical page (the offset within the page is the same):
- If the record with the desired virtual page number is in the TLB, then the physical page number is taken from it
- If there is no necessary entry in the TLB, then you have to look for it in the page tables that the operating system places in the untranslated RAM area (so that there is no TLB miss during the previous miss processing). The search can be implemented both hardware and software - through an exception handler called a page fault. The found entry is added to the TLB, after which the command that caused the TLB miss is executed again.

Lyrical digression about MPU

By the way, do not confuse MMU and MPU (Memory Protection Unit), which is often used in microcontrollers. Roughly speaking, MPU is a very simplified MMU, of all the functions providing only memory protection. Accordingly, TLB is not there. When using MPU, the address space of the processor is divided into several pages (for example, 32 pages of 128 megabytes for a 32-bit processor), for each of which you can set individual access rights.

Consider the work of TLB with a simple example. Suppose we have two processes A and B. Each of them exists in its own address space and all addresses from zero to 0xFFFFFFFF are available to it. The address space of each process is divided into pages of 256 bytes (I took this number from the ceiling - usually the page size is not less than one kilobyte), i.e. the address of the first page of each process is zero, the second is 0x100, the third is 0x200, and so on up to the last page at 0xFFFFFF00. Of course, the process does not have to occupy all the space available to it. In our case, Process A takes up only two pages, and Process B takes up three. Moreover, one of the pages is common to both processes.

We also have 1536 bytes of physical memory, divided into six pages of 256 bytes (the page size of the virtual and physical memory is always the same), and this memory is mapped to the physical address space of the processor from address 0x40000000 (well, this is how it was soldered to the processor).

In our case, the first page of Process A is located in physical memory from address 0x40000500. We will not go into details of how this page gets there - it is enough to know that the operating system loads it. She also adds an entry to the page table (but not in the TLB), linking this physical page with the corresponding virtual page, and transfers control to the process. The first command executed by Process A will cause a TLB miss, as a result of which a new record will be added to the TLB.
When Process A needs access to its second page, the operating system will load it into some free space in physical memory (let it be 0x40000200). Another TLB miss will happen, and the desired entry will be added to the TLB again. If there is no place in the TLB, one of the earlier entries will be overwritten.

After that, the operating system can suspend Process A and start Process B. It will load its first page at the physical address 0x40000000. However, unlike Process A, the first command of Process B will no longer cause a TLB miss, since there is already an entry for the zero virtual address in the TLB. As a result, Process B will begin to process Process A! Interestingly, this is exactly how the ARM9 processor, widely known in narrow circles, worked.

The easiest way to solve this problem is to invalidate the TLB when switching contexts, that is, mark all entries in the TLB as invalid. This is not a good idea, as:

In the TLB at this moment, only two out of eight entries are occupied, that is, a new entry would fit there without problems
Removing entries belonging to other processes from the TLB, we force these processes to generate repeated page errors when the operating system starts them again. If the TLB has not eight entries, but a thousand, then system performance can drop significantly.

The more complicated way is to link all the programs so that they use different parts of the virtual address space of the processor. For example, Process A may occupy the lowest part (0x0-0x7FFFFFFF), and Process B the highest (0x80000000-0xFFFFFFFF). Obviously, in this case, there is no question of any isolation of processes from each other, but this method is sometimes sometimes used in embedded systems. For general-purpose systems, for obvious reasons, it is not suitable.

The third way is the development of the second. Instead of sharing four gigabytes of virtual processor address space between multiple processes, why not just increase it? Say 256 times? And to ensure process isolation, to make sure that each process still has exactly four gigabytes of RAM available?
It turned out to be very simple. The virtual address was expanded to 40 bits, with the highest eight bits being unique for each process and recorded in a special register - the process identifier (PID). When switching the context, the operating system overwrites the PID with a new value (the process itself cannot change its PID).
If for our Process A the PID is one, and for Process B it is two, then the virtual addresses that are identical from the point of view of processes, for example 0x00000100, turn out to be different from the point of view of the processor - 0x0100000100 and 0x0200000100, respectively.
Obviously, in our TLB now should not be 24-bit, but 32-bit virtual page numbers. For convenience, the high eight bits are stored in a separate field - Address Space IDentifier (ASID).

Now, when the processor sends a virtual address to the MMU input, the TLB is searched using a combination of VPN and ASID, so the first command of Process B will cause a page error even without previous TLB invalidation.

In modern processors, ASIDs are most often either eight-bit or 16-bit. For example, in all ARM processors with MMU, starting with ARM11, ASID is eight-bit, and support for 16-bit ASID is added in the ARMv8 architecture.

By the way, if the processor supports virtualization, then in addition to ASID, it can also have a VSID (Virtual address Space IDentificator), which further expands the virtual address space of the processor and contains the number of the virtual machine running on it.

Even after adding the ASID, there may be situations when you need to invalidate one or more records or even the entire TLB:

If a physical page is unloaded from RAM to disk - because this page can be loaded back to memory at a completely different address, that is, the virtual address will not change, but the physical one will change
If the operating system has changed the PID of the process - because the ASID will be different
If the operating system has completed the process

This could complete the story about the MMU, if not for one nuance. The fact is that between the processor and RAM, in addition to the MMU, there is also a cache memory.

Those who have forgotten what cache memory is and how it works can refresh knowledge here and there .

For an example we will take a two-channel (2-way) cache in the size of one kilobyte with a size of a line of a cache (or a line of a cache - as you like) in 64 bytes. Now it doesn’t matter if it is a command cache, data cache or a combined cache. Since the size of the memory page is 256 bytes, four cache lines are placed on each page.

Since the cache is between the processor and the MMU, it is obvious that only virtual addresses are used for indexing and tag comparison (physical addresses appear only at the output of the MMU). In English, such a cache is called Virtually Indexed, Virtually Tagged cache (VIVT).

Parasites in the Omonima chassis in the VIVT cache

What, in fact, is the problem? It's time to consider another example. Take the same two processes A and B:

Assume that Process A is first executed. During the execution, one by one, lines with commands or data for this process are loaded into the cache memory (Lines 0-4).
The operating system stops Process A and transfers control to Process B.
In theory, at this moment, the processor should load the first line from the first page of Process B into the cache and start executing commands from there. In fact, the processor sends a virtual address 0x0 to the cache input, after which the cache replies that it does not need to load anything, because the desired line is already cached.
Process B begins to cheerfully execute the code of Process A.

Wait, you say, we just solved this problem by adding ASID to TLB? And the cache we have is up to the MMU - when there is no cache miss, we don’t even look at the MMU, and we don’t know what ASID is there.

This is the first problem with the VIVT cache - the so-called homonyms, when the same virtual address can be mapped to different physical addresses (in our case, the virtual address 0x00000000 is mapped to the physical address 0x40000500 for Process A and 0x40000000 for Process B).

There are many options for solving the problem of homonyms:

“Flush” the cache (cache flush, that is, write the changed contents of the cache back to memory) and invalidate the cache (that is, mark lines as invalid) when switching context. If we have a separate cache of commands and data, then the command cache is quite easy to invalidate (that is, mark all lines as empty), because the command cache is read-only to the processor and it does not make sense to flush it
Add ASID to cache tag
Contact the MMU every time we go to the cache, and not only when a mistake occurs - then we can use the existing logic for comparing PID with ASID. Goodbye, energy saving and performance!

Which option do you like? I would probably think about the second. In ARM9 and some other processors with VIVT caches, a third one is implemented.

Synonyms in VIVT Cache

If homonyms in the cache were the only problem, then processor developers would be the happiest people in the world. There is a second problem - synonyms, aliases, when several virtual addresses are mapped to the same physical address.
Let's go back to our ~~rams~~ processes A and B. Suppose that we solved the problem of homonyms in some decent way, because flashing the cache each time is really very expensive!
So:

First, Process A is executed - Lines A0 - A3 and General line 0 are loaded into the cache one after another (remember that processes A and B have one page in common)
Then the operating system, not a flash cache, switches the context
Process B starts to run. Lines B0 - B7 are loaded into the cache.
Finally, Process B refers to the Common line 0. This line is already loaded into the cache by Process A, but the processor does not know about it, since it appears in the cache under a different virtual address (I recall that there are no physical addresses in the VIVT cache)
A cache miss occurs. The virtual address 0x00000200 is translated to the physical address 0x40000200, and General line 0 is reloaded into the cache. Its location is determined by the virtual address - and the address 0x00000200 corresponds to Index 0 (bits 8-6) and Tag 0x1 (bits 31-9).
Since both channels in the set with index 0 are already occupied, you have to throw out one of the already loaded lines (A0 or B0). Using the LRU (Least Recently Used) algorithm, the cache throws Line A0 and in its place adds General Line 0.

As a result, the same piece of physical memory is in the cache in two different places. Now, if both processes change their copy, and then want to keep it in memory, then one of them will have a surprise.
Obviously, this problem is relevant only for the data cache or for the combined cache (again, the command cache is read-only), but it is much more difficult to solve than the homonyms problem:

The easiest way, as we already found out, is to flush the cache each time the context is switched (by the way, to counter synonyms, unlike homonyms, you do not need to invalidate the cache). However, firstly, it is expensive, and secondly, it will not help if the process wants to have several copies of the same physical page in its address space
You can identify synonyms in hardware:
- Or, with each miss, run through the entire cache, translating the address for each tag and comparing the received physical addresses with the one received during the broadcast of the address that caused the miss (and envy the living dead!)
- Or add a new block to the processor that will perform the reverse translation - the conversion of the physical address to virtual. After that, for each miss, translate the address that caused it, and then use the reverse translation to convert this physical address to virtual and compare it with all the tags in the cache. By golly, you'd better just flush the cache!
The third way is to avoid synonyms programmatically. For example, do not use shared pages. Or again start linking programs to the shared address space.

Coherence and VIVT Cache

But that's not all! We live in the era of multi-core processors with our own L1 caches, which strive to become incoherent. Tricky protocols were invented to combat cache incoherence . However, this is bad: the protocols monitor the external bus, and what are the addresses on it? Physical! And the cache uses virtual ones.

How then to find out which cache line should be urgently written to memory, because it is waiting for a neighboring processor? ~~But in any way.~~ Perhaps this is a topic for a separate article.

Advantages of VIVT Cache

The VIVT cache has the only significant plus: if a physical page is unloaded from memory, then the corresponding cache lines need to be flashed, but it should not be invalidated. Even if this physical page is loaded to another location after some time, this will not affect the contents of the cache, because when accessing the VIVT cache, physical addresses are not used and the processor doesn’t care if they have changed or not.

About twenty years ago there were more pluses. VIVT caches were actively used in those days when an external MMU was used as a separate chip. Access to such an MMU took much more time than to an MMU located on the same chip as the processor. But since access to the MMU was required only in case of a cache miss, this allowed us to ensure acceptable system performance.

What to do?

So, VIVT caches have too many disadvantages and few advantages (by the way, if I forgot some other sins - write in comments). Smart people thought and decided: we’ll transfer the cache for the MMU. Let indices and tags come from physical, not virtual, addresses. And moved. And if it worked, you will learn in the next part.

Tags: