Rings, privilege levels and protection in x86

Original author: Gustavo Duarte
  • Transfer
image

You probably intuitively guessed that applications running on Intel x86 computers are limited in their capabilities, and that some actions can be performed exclusively by the operating system. But do you know how this really works? In this post, we’ll look at x86 privilege levels, a mechanism in which the OS and processor work together to limit what user mode applications can do.


Gustavo Duarte article translation: CPU Rings, Privilege, and Protection

There are four privilege levels, they are numbered from 0 (the most privileged level), to 3 (the least privileged level), and three types of resources that are protected by processor protection mechanisms: memory, I / O ports, and the ability to execute certain instructions. At any moment, the x86 processor runs at a certain privilege level, and what depends on what the code can and cannot do. Privilege levels are also often called protection rings, which are represented as nested circles. The most privileged level corresponds to the circle with the greatest degree of nesting. Most modern x86 cores use only two privilege levels - 0 and 3.

The execution of order 15 instructions (and only a few dozen) is possible only in ring 0. Other instructions have restrictions associated with valid operands. If these restrictions did not exist, then it would be impossible to ensure the functioning of protection mechanisms, since The instructions mentioned may bypass them or lead to other negative consequences. Instructions for which restrictions exist can only be used in kernel code. Attempting to execute them outside the zero ring will result in a #GP (general-protection exception) exception. Exactly the same exception occurs, for example, when a program tries to access invalid memory addresses. Similarly, depending on the privilege level, access to memory and I / O ports is limited.

Before we look at the security mechanism, let's see how the processor monitors the current privilege level. Segment selectors, which we examined in a previous post, are directly related to this. Here they are:

image

A programmer uses certain instructions to load a segment selector for a data segment into any of the segment registers for data (for example, SS or DS). This loads the entire contents of the segment selector, including the Requested Privilege Level (RPL) field, the purpose of which is described below. As for the CS register, here everything happens a little differently. Firstly, you cannot directly use load instructions like mov to load a selector into this register. Instead, the contents of the register can only change as a result of executing an instruction that controls the flow of execution, such as far call. Secondly, and this is very important for us, instead of the RPL field, the value of which is determined by the programmer, the selector in the CS register has the Current Privilege Level (CPL) field, whose value is set and controlled by the processor itself. The two-bit CPL field in the CS register always reflects the current privilege level at which the processor operates. Intel docks, including those published on the Internet, may have differences on this issue, however, this fact accurately reflects the essence of things. At any moment, so that it does not happen in the processor itself, looking at the value of the CPL field in the CS register, we always find out the current privilege level with which the code is executed. This fact accurately reflects the essence of things. At any moment, so that it does not happen in the processor itself, looking at the value of the CPL field in the CS register, we always find out the current privilege level with which the code is executed. This fact accurately reflects the essence of things. At any moment, so that it does not happen in the processor itself, looking at the value of the CPL field in the CS register, we always find out the current privilege level with which the code is executed.

Please note that the current privilege level of the processor has nothing to do with privileges by users of the operating system. It doesn’t matter from which account you work under - root, administrator, guest or regular user. The code of all user applications is executed in ring 3, while any code related to the kernel is executed in ring 0. Sometimes some tasks typical of the kernel can be transferred to user space - for example, some drivers are implemented in Windows Vista, but this can be considered only as a special case when a special kind of process just does some work for the kernel. Usually they can be killed without any serious consequences.

Due to existing restrictions on access to memory and I / O ports, a program executed in user mode cannot actually affect the “environment” and cannot do anything without the help of the kernel. Such a program cannot open a file, send a network packet, display a line of text, or allocate memory for itself. We can say that user processes are executed in a kind of sandboxes with radically truncated capabilities that were prepared for them by the “gods of the zero ring”. Thus, a memory leak, if caused by the process, cannot survive the process itself, just like files opened during the life of a process will not remain open after it is completed. All data structures used to control such things - allocated memory, open files - not available for user code; as soon as the program completes execution, its “sandbox” is destroyed by the kernel. That is why modern servers can have 600 hours of uptime - if the hardware or the kernel does not fail, everything can work forever. And this, by the way, is the reason why Windows 95/98 crashed so often: no, it’s not because “M $ sucks”, just to ensure backward compatibility, some important data structures were left available for user mode applications. At that time, this was probably a reasonable compromise, but it cost a very high price. on which Windows 95/98 crashed so often: no, it’s not because “M $ sucks”, just to ensure backward compatibility, some important data structures were left available for user mode applications. At that time, this was probably a reasonable compromise, but it cost a very high price. on which Windows 95/98 crashed so often: no, it’s not because “M $ sucks”, just to ensure backward compatibility, some important data structures were left available for user mode applications. At that time, this was probably a reasonable compromise, but it cost a very high price.

The processor protects memory at two strategic points: at the moment when an attempt is made to load a segment selector into a register, and also when a memory page is accessed. The protection mechanism, thus, reflects the main stages of address translation, where segmentation and paging are also involved. Attempting to load a segment selector is accompanied by the following check:

image

The larger the number, the lower the privilege level it indicates. Thus, the MAX () function selects the value that expresses the least privilege level (whether CPL or RPL), which is then compared with the privilege level of the target descriptor (DPL). If the DPL is greater than or equal to, access is allowed. The whole point of using RPL in this formula is that it allows the kernel to access a segment with an intentionally low privilege level, if necessary. For example, you can use a selector in which the RPL is set to 3 in order to limit some operation to the ability to work only with user mode data segments. The check when loading the selector into the SS register is different, here for its successful passage all three values ​​of CPL, RPL and DPL must coincide.
In fact, protection at the segment level does not play a special role, since modern kernels use a “flat” model of memory organization, in which the user mode segment covers all available physical address space. Some useful memory protection is provided in the paging unit when converting a linear address to a physical one. Each page of memory is a sequence of bytes, which is described by an entry in the page table. In this entry, two fields are related to security mechanisms, namely the supervisor flag and the read / write flag. The Supervisor flag is the primary defense mechanism used by modern kernels. When this flag is cleared, the page cannot be accessed from ring 3. Although the read / write flag does not play any role in checking privileges, it still finds interesting application. When a program is loaded into memory for execution, memory pages that store the executable image of the program are marked as read-only. This allows you to catch some errors when working with pointers, if using them an attempt is made to write to these pages. The Read / write flag is also used to implement the copy on write mechanism when creating a child process using the fork () system call on Unix-like operating systems. When a fork is made, the memory pages of the parent process are marked as read-only. The child process will initially use the same pages of memory as the parent process. If either of them tries to write to the memory page, the processor initiates a fault, and the kernel will work it out as follows - it creates for the process,

Move on. We need a mechanism that allows the CPU to switch between different privilege levels. If the code running in ring 3 could transfer control to arbitrary places in the kernel, it would be possible to easily bypass the protective mechanisms of the operating system simply by executing jmp at the wrong (or correct?) Address. To prevent this, a mechanism is needed that provides a controlled transfer of execution flow. It is implemented on the basis of the so-called. gate descriptors or sysenter instructions. Gate descriptor is a segment descriptor of type "system". There are four varieties of it: call-gate, interrupt-gate, trap-gate and task-gate descriptors. Through a call-gate descriptor, the kernel can provide an entry point that can be used with the usual instructions like far call and far jump. Call-gate descriptors are used infrequently, so we will not talk about them. Task-gate descriptors are also of little interest to us (in Linux they are used only when processing double faults, which usually happen due to problems with the kernel or hardware).

Two other much more interesting types of descriptors remain: interrupt-gate and trap-gate, which are used to handle hardware interrupts (keyboard, timers, disks) and exceptions (page faults, division by zero). Both of these types of mechanisms I will arbitrarily call "interruptions." This type of descriptor is stored in the interrupt descriptor table (IDT). Each interrupt is assigned an identification number from 0 to 255, called a vector. When it is necessary to determine which descriptor to use to handle the interrupt, the processor uses the vector as a pointer to the descriptor in the IDT. The format of the interrupt-gate and trap-gate descriptors is virtually identical. This format, as well as the privilege check that is performed when an interrupt occurs, are shown in the figure. I filled some descriptor fields with the values

image

Access is controlled based on the current CPL and DPL of the target segment, and the entry point is determined based on the selector and Offset field in the gate descriptor. In modern kernels, the segment selector contained in the corresponding field of the gate descriptor usually selects the code segment of the kernel. The interrupt mechanism is designed so that it cannot be used to transfer control from a more privileged ring to a less privileged ring. The privilege level must either remain the same or increase (this happens when, for example, the user mode of the application is interrupted). In any case, the new CPL value will be equal to the DPL of the target code segment. In a situation where the CPL changes, the stack segment also switches automatically. If the interrupt is programmatic (caused by the execution of an INT n instruction, for example), In addition, another check is performed: gate DPL must be equal to or greater than the original CPL. This test is designed to make the call of some interrupt handlers inaccessible to user code. On Linux, all interrupt handlers run in ring 0.

During initialization, the Linux kernel function setup_idt () creates an IDT table without specifying specific entry points in the descriptors. Then the descriptors will be filled with data in accordance with the contents of the files include / asm-x86 / desc.h and arch / x86 / kernel / traps_32.c. In Linux terminology, a descriptor that has the word “system” in its name is available for use by user mode code, and for it the DPL gate is set to 3. “System gate” is an intelligent trap-gate available for use by user mode code. In addition, there are no more differences in terminology. Hardware interrupts are configured not here, but in the corresponding drivers.

Three gates are available for use in user mode: vectors 3 and 4 are used for debugging and checking for numerical overflows, respectively. Then comes the system gate, which has an ID equal to the value of the SYSCALL_VECTOR constant - for x86 architecture it is 0x80. Previously, this was the main mechanism for transferring control to the kernel when making a system call. There was a time when I also wanted a thug number with the characters “int 0x80”. Starting with Pentium Pro, a new sysenter instruction has been added, designed to make a system call faster. This instruction uses special registers that store information about the code segment, entry point, etc. When the sysenter instruction is executed, the privilege level is not checked, the processor immediately switches to CPL 0 and loads the corresponding values ​​into the registers, related to code and stack (CS, EIP, SS and ESP). Loading values ​​into the registers used by the sysenter instruction is possible only from ring 0 and is performed by the enable_sep_cpu () function.

Finally, when it is time to return to ring 3, the kernel uses an IRET or SYSEXIT instruction to return control after processing an interrupt or system call, respectively. As a result, we leave ring 0, and user mode code with a CPL of 3 is resumed. Vim tells me that the number of words is already close to 1900, so let’s leave the subject of input / output ports for later. Thank you all for your attention!

Also popular now: