Zircon highlight: vDSO (virtual Dynamic Shared Object)

Original Author: Google Inc., depletionmode

Transfer

Zircon? What is it?

In August 2016, without any official announcements from Google, the sources of the new operating system Fuchsia were discovered . This OS is based on a microkernel called Zircon, which in turn is based on LK (Little Kernel) .

Fuchsia is not Linux

Translator's notes

Я не ~~настоящий сварщик~~ являюсь разработчиком и/или экспертом Zircon. Тест под катом является компиляцией частичных переводов: официальной документации Zircon vDSO и статьи Admiring the Zircon Part 1: Understanding Minimal Process Creation от @depletionmode, куда было добавлено немного отсебятины (которая убрана под спойлеры). Поэтому конструктивные предложения по улучшению статьи, как и всегда, приветствуются.

What will be discussed in the article?

vDSO in Zircon is the only means of accessing system calls (syscalls) .

But is it not possible to directly call the instructions of the SYSENTER / SYSCALL processor from our code? No, these processor instructions are not part of the system ABI. User code is prohibited from directly following such instructions.

Those who wish to learn more details about such an architectural step are invited to the cat.

Zircon vDSO (virtual Dynamic Shared Object)

The abbreviation vDSO stands for v irtual D ynamic S hared O bject:

Dynamic Shared Object is a term used to refer to shared libraries for ELF format (.so-files).
Virtual (virtual) this object is due to the fact that it does not load from an existing separate file on the file system. The vDSO image is provided directly by the kernel.

Kernel support

Support for vDSO as the only monitored ABI for user-mode applications is implemented in two ways:

Projecting a virtual memory object ( VMO, Virtual Memory Object ).

When zx_vmar_map processes VMO for vDSO (and is requested in arguments ZX_VM_PERM_EXECUTE), the kernel requires that the offset and size strictly coincide with the vDSO executable segment. This (including) guarantees only one projection of the vDSO into the process memory. After the first successful projection of the vDSO into the process, it can no longer be deleted. And an attempt to re-project a vDSO into the process memory, attempts to delete a projected VMO for a vDSO, or projecting with the wrong offset and / or size fails ZX_ERR_ACCESS_DENIED.
The offset and the size of the vDSO code at the compilation stage are extracted from the ELF file and then used in the kernel code to perform the checks described above. After the first successful projection of the vDSO, the OS kernel remembers the address for the target process to speed up the checks.
Check return addresses for system call functions.

When the user mode code calls the kernel, the low-level system call number is passed in the register. Low-level system calls are the internal (private) interface between the vDSO and the Zircon core. Some (most) directly correspond to the system calls of the public ABI, while others do not.
For each low-level system call, the vDSO code has a fixed set of offsets in the code that make this call. The source code for vDSO defines internal symbols identifying each such location. During compilation, these locations are extracted from the vDSO symbol table and are used to generate kernel code that determines the predicate of the validity of the code address for each low-level system call. These predicates allow you to quickly check the calling code for validity, given the offset from the beginning of the vDSO code segment.
If the predicate determines that the calling code is not allowed to make a system call, a synthetic exception is generated, just as if the calling code tried to execute a nonexistent or preferred instruction.

vDSO when creating a new process

To start the execution of the first thread (thread) of the newly created process, use the zx_process_start system call . The last parameter of this system call (see arg2 in the documentation) is the argument for the first thread of the process being created. By agreement, the program loader maps the vDSO to the address space of the new process (to a random place selected by the system) and passes the base display address with the argument arg2 to the first thread of the process being created. This address is the address of the ELF file header, which can be used to find the necessary named functions for making system calls.

Memory Card (layout) vDSO

vDSO is the regular EFL shared library, which can be viewed like any other. But for vDSO, a small subset of the entire ELF format is intentionally chosen. This has several advantages:

The mapping of such an ELF into a process is simple and does not include any complicated boundary cases that are required to fully support ELF programs.
Using vDSO does not require full-featured dynamic ELF binding. In particular, vDSO does not have dynamic relocations. Projecting the PT_LOAD segments of an ELF file is the only action required.
The vDSO code is stateless and reenterable. It works exclusively with processor registers and a stack. This makes it suitable for use in a wide variety of contexts with minimal restrictions, which corresponds to the mandatory ABI of the operating system. It also simplifies the analysis and verification of the code for reliability and security.

All vDSO memory is represented by two consecutive segments, each of which contains aligned whole pages:

The first segment is read only and includes ELF headers as well as constant data.
The second segment is executable and contains the vDSO code.

The entire vDSO image consists of only the pages of these two segments. Only two values extracted from ELF headers are required to display vDSO memory: the number of pages in each segment.

OS constant load times

Some system calls simply return values that are constant (values must be queried at run time and cannot be compiled into user mode code). These values are either fixed in the kernel at compile time, or determined by the kernel at boot time (boot parameters and hardware parameters). For example: zx_system_get_version () , zx_system_get_num_cpus () and zx_ticks_per_second () . The return value of the last function, for example, is affected by the kernel command line parameter .

Wait, is the number of CPUs a constant?

Интересно, что и в описании функции zx_system_get_num_cpus() так же явно указано, что ОС не поддерживает горячее изменение количества процессоров:

This number cannot change during a run of the system, only at boot time.

Это, как минимум, косвенно указывает на то, что ОС не позиционируется, как серверная.

Since these values are constant, it makes no sense to pay for real system calls to the OS kernel. Instead, their implementation is simple C ++ functions that return data read from the vDSO constant segment. Values captured at compile time (such as the system version string) are simply compiled into vDSO.

For values specified at boot time, the kernel must change the contents of the vDSO. This is done using early-stage code that forms the VMO vDSO before the kernel starts the first user process (and passes the VMO handle to it). During compilation, offsets from the vDSO image ( vdso_constants ) are extracted from the ELF file and then embedded in the kernel. And at boot time, the kernel temporarily displays the pages covering vdso_constants in its own address space to pre-initialize the structure with the correct values (for the current system startup).

Why all this headache ?

One of the most important reasons is safety. That is, if an attacker succeeds in executing an arbitrary (shell-) code, he will have to use vDSO functions to call system functions. The first obstacle will be the above-mentioned randomization of the vDSO boot address for each process created. And since the VM OS (virtual memory object) of the VDSO is responsible for the OS kernel, it can choose to display a completely different vDSO in a specific process, thereby prohibiting dangerous (and not needed by a specific process) system calls. For example: you can prevent drivers from spawning child processes or processing the projection of MMIO areas. This is a great tool for reducing attack surface.

Note: currently, support for multiple vDSOs is being actively developed. There is already a concept implementation (proof-of-concept) and simple tests, but more work is needed to improve the reliability of the implementation and determine which options will be available. The current concept provides options for a vDSO image that export only a subset of the full vDSO system call interface.

What about other operating systems?

Следуют отметить, что подобные техники уже успешно используются в других ОС. Например, в Windows есть ProcessSystemCallDisablePolicy:

Win32k System Call Disable Restriction to restrict ability to use NTUser and GDI

Tags: