Assembler for simulation tasks. Part 2: the core of the simulation

    HCF, n. Mnemonic for 'Halt and Catch Fire', any of several undocumented and semi-mythical machine instructions with destructive side-effects <...>
    Jargon File
    In a previous post, I started a talk about the areas of application of assembler in the development of software models of computer systems - simulators. I described the operation of a software decoder, and also reasoned about a method for testing a simulator using unit tests.
    This article will explain why a programmer needs knowledge about the structure of machine code when creating an equally important component of a simulator - the kernel, which is responsible for modeling individual instructions.
    So far, the discussion has mainly focused on guest assembler. It's time to talk about assembler master.

    With assembler in the heart - the core of the simulator


    A serious simulation product should have a multi-camera “heart”: several ways to execute a guest code. At any given time, the most effective of them is used.
    In general, three technologies are distinguished: interpretation, binary translation, and direct execution. And in each of them there is a place for machine code and assembler.


    Interpreter and intrinsic


    The simplest interpreter-based simulator is written in a portable, high-level language. This means that every procedure that describes an instruction simply implements its logic in C.
    A large proportion of machine instructions have fairly simple semantics that are easily expressed in C: add two numbers, compare with the third, shift left and right, etc.
    Privileged instructions are usually more complex due to the need to perform a variety of access checks and throw exceptions. However, they are relatively few in number.
    Difficulties come forth. Here are the instructions that work with IEEE 754 numbers, i.e. floating point, "floating". You will have to correctly handle several formats of these numbers, from float16 through float32, float64, sometimes a semi-standard float80 and even float82; it seems that no architecture supports directly float128 yet, although the standard describes them. Maintain non-NaN numbers, denormalized numbers, take into account rounding modes and exception signaling. And also implement all kinds of arithmetic, such as sines, roots, inverse values.
    Some help is the open Softfloat library , which implements quite a lot of the standard.
    Another example of a class of instructions that are difficult to simulate is vector, SIMD. They perform one operation immediately on the vector of the same type of arguments. Firstly, they also often work with “buoyancy”, although with integer operands too. Secondly, there are many such instructions because of the combinatorial effect: for each operation there are several lengths of vectors and element formats, mask formats, the optional use of “mixing” broadcast, gather / scatter, etc.
    Having successfully implemented emulating procedures for all the required guest instructions, the creator of the model is likely to encounter an extremely low interpreter speed. And this is not surprising: what is done on a real machine in one instruction will be presented in the model as a procedure with a loop inside and non-trivial logic that calculates all edge scripts! Now, if something for us implemented the semantics of instructions, and did it quickly! ..
    Wait a minute, but the host processor probably has exactly the same or at least very similar instructions! Let not for everyone, but at least for a part. Moreover, popular compilers provide an interface for including machine instructions in code - intrinsics(English intrinsic - internal) - descriptions of functions that wrap machine instructions. Example intrinsic description for LZCNT instruction from Intel SDM :
    Intel C / C ++ Compiler Intrinsic Equivalent
    LZCNT:
    unsigned __int32 _lzcnt_u32 (unsigned __int32 src);
    LZCNT:
    unsigned __int64 _lzcnt_u64 (unsigned __int64 src);


    The same intrinsics work in GCC. Below I did a little experiment:
    $ cat lzcnt1.c
    #include 
    #include 
    int main(int argc, char **argv) {
            int64_t src = argc;
            int64_t dst = _lzcnt_u64(src);
            return (int)dst;
    }
    $ gcc -O3 -mlzcnt lzcnt1.c # Явно указываю архитектуру, т.к. мой процессор не поддерживает LZCNT
    $ objdump -d a.out
    <...пропускаем...>
    Disassembly of section .text:
    00000000004003c0 
    : 4003c0: 48 63 c7 movslq %edi,%rax 4003c3: f3 48 0f bd c0 lzcnt %rax,%rax 4003c8: c3 retq 4003c9: 90 nop 4003ca: 90 nop 4003cb: 90 nop <...пропускаем...>


    The -O3compiler did everything with the optimization flag flawlessly: _lzcnt_u64()there was no prologue or epilogue from the “function” , only the machine instruction that we needed.
    Like machine instructions, there are usually many intrinsics (but still fewer than instructions). Each compiler provides its own set, somewhat similar, somewhat different from the rest.
    • Intrinsics present in Microsoft compilers are described separately on x86 and x64 in MSDN .
    • The Intel C / C ++ compiler documentation for several years has been available in a convenient interactive format on a web page . It’s quite convenient to filter them by extension class (SSE2, SSE3, AVX, etc.) and by functionality (bit operations, logical, cryptographic, etc.), and also get help on semantics and speed (in measures).
    • The intrinsics of the GCC compiler for IA-32 are basically the same as those described for ICC.
    • For Clang, I did not find any clear documentation on the available intrinsics for any architecture. If the reader has relevant information on this issue, then please share it in the comments.


    Compared to the inline assembler handwritten sections, intrinsics have the following advantages.
    1. The function call is much more familiar, it is easier to understand and less likely to spoil it when writing. Intrinsicists transfer the work of allocating input and output registers to the compiler, and also allow it to conduct syntax checking, type matching, and other useful things and, if necessary, report problems. In the case of inline-code, assembler diagnostics will be much more mysterious. Anyone who often has to write clobber specifications for GNU as (and make mistakes in them) will agree with me.
    2. Intrinsics are not for the compiler the "black boxes" of the inline assembler, in which updates of registers and memory unknown to it occur. Accordingly, its register allocation algorithms can take this into account when processing the procedure code. The result is easier to get faster code.
    3. Intrinsics, although weak, are portable between compilers (but not host architectures). In an extreme case, you can write a prototype of your implementation option if the host architecture does not support the instruction directly. Example from practice: SSE2 instruction CVTSI2SD xmm, r/m64does not have valid encoding in 32-bit processor mode. Accordingly, there is no intrinsics, while in 64-bit mode, for which a tool was originally developed, it was, and the code used it. When compiling code on a 32-bit host, an error was thrown. Since the procedure tied to this intrinsic was not “hot” (the application operation speed depended slightly on it), its own implementation _mm_cvtsi64_sd()in C was written , which was substituted in the case of a 32-bit assembly.

    For these or some other reasons, Microsoft stopped supporting inline assembler in MS Visual Studio 2010 and later for x64 architecture. To insert machine code into files with C / C ++ in this case only intrinsics remain available.
    However, I would go against the truth by saying that using intrinsics is a panacea. Still, you need to keep an eye on the code generated by the compiler, especially when you want to squeeze the maximum performance out of it .


    Binary Translator and Code Generation


    The binary translator (hereinafter referred to as DT) usually works faster than the interpreter, because it converts entire blocks of the guest machine code into equivalent blocks of the host machine code, which then, in the case of a hot code, are repeatedly run. The interpreter (if caching is not implemented in it) is forced to process every guest instruction it encounters from scratch, even if he recently worked with it.
    And, unlike the interpreter, which can be written from beginning to end, without delving into the particulars of the host architecture, DT will require knowledge of both assembler and encodings of machine instructions. When transferring your simulator to a new host system, a significant part of it, which is responsible precisely for code generation, will have to be rewritten. This is the price of speed.
    In this article, I will describe one of the simple ways to build a so-called template translator . If there is interest, then some other time I’ll try to talk about a more advanced way of binary translation.
    Having received information about the guest instruction from the decoder, DT generates a piece of machine code for it - a capsule . For several instructions executed sequentially, a translation block is created consisting of their capsules recorded in sequence. As a result, when control is transferred to the first translated instruction in the guest system, to simulate this and subsequent commands, it is enough to execute the code from the translation unit.
    How to generate the code for the guest instruction, knowing its opcode and the values ​​of the operands? According to the opcode, the simulator selects a template - a master host code blank that implements the necessary semantics. It is distinguished from the procedures usually created by the compiler by the absence of a prologue and epilogue, since we directly “glue” such templates into a single translation unit. However, this is still not enough to mark the translation unit as ready.
    Another task remained unfulfilled - to pass the values ​​of the operands as arguments to the template, thus specializing it and turning it into a capsule. Moreover, it is most often necessary to transfer operands precisely at the stage of translation: they are already known. That is, you need to "sew" them directly into the host code of the capsule. With implicit operands (for example, values ​​lying on the stack) this will not work, and they, of course, will have to be processed at the simulation stage, while wasting time.
    If the dimension of the set (= the number of combinations) of the explicit operands is small, then they can be “sewn” into the group of patterns for this instruction — one for each combination. As a result, for each guest opcode, you have to choose from N patterns according to what values ​​the operands took in each particular case.
    Unfortunately, not everything is so simple. In practice, it is often impossible to generate patterns for all kinds of operand values ​​due to the combinatorial explosion of their number. So, a three-operand command on an architecture with 32 registers will require 32 × 32 × 32 = 2¹⁵ blocks of code. And if the guest architecture has literal operands (and all important ones have it) 32 bits wide, then you will have to store 2³² of capsule options. Need to come up with something.
    In fact, there is no need to store a bunch of almost identical templates - they all contain the same owner's instructions. When the guest operands are varied, they only change some host operands (but sometimes the length of the instruction, see my previous post) describing where the simulated state is stored or which literal is transmitted. When forming a capsule from a template, you just need to “patch” bits or bytes at the corresponding offsets:

    Question for experts: What architectures in the example above are used as guest and host?

    Thus, for each guest instruction in the simulator with DT, one master host code template and one procedure that corrects the original operands for the correct ones are sufficient. Naturally, to correctly patch the template, you need to know the offsets of all operands relative to its beginning, that is, to understand the encoding of the commands of the host system. In fact, you must either implement your own encoder, or somehow learn how to isolate the necessary information from the work of a third-party tool.
    In general, the template translation process is presented in the following figure.


    Direct execution and virtualization


    The third simulation mechanism I am considering is direct execution. The principle of its operation directly follows from the name - to simulate a guest code, without changing it, launching it on the host. Obviously, this method potentially gives the highest simulation speed; however, he is also the most “moody”. The following requirements must be met.
    1. The architecture of the guest and the host must match. In other words, it will not be possible to directly model the code for ARM on MIPS and vice versa; in any case, this will not be direct execution.
    2. The host architecture must satisfy the conditions of effective virtualization .


    Assume that the guest architecture satisfies these conditions, for example, Intel IA-32 / Intel 64 with Intel® VT-x extensions. The next task that arises when adding direct execution support to the simulator is writing a kernel module (driver) of the operating system. You can’t do without it: the simulator will need to execute privileged instructions and manipulate system resources, such as page tables, physical memory, interrupts, and more. From user space, they cannot be reached. On the other hand, completely “digging in” the kernel is harmful: programming and debugging drivers is much more time and nerve-consuming than writing application programs. Therefore, only the very minimum of the simulator functionality, which is accessed through the system call interfaces, is usually taken to the kernel.
    Since the kernel module is written to a specific OS, you need to understand that when you transfer an application to another OS, you will have to rewrite it, possibly quite strongly. This is another reason to minimize its size.
    In principle, the use of assembler in the kernel is justified under approximately the same conditions as in userland - that is, when you can not do without it. Virtual machines work with system structures, such as VMCS (virtual machine control structure), control, debugging and model-specific registers, which are available only through specialized instructions. It would be most reasonable to use intrinsics for them, but ...
    Not all machine instructions have ready-made intrinsics. In compilers designed to build predominantly user code, they somehow forget about the needs of driver writers. To access them, you have to use the inline assembler. In the source code of the KVM virtual machine, for example, there is such a definition for the function for reading VMCS fields:
    #define ASM_VMX_VMREAD_RDX_RAX    ".byte 0x0f, 0x78, 0xd0"
    static __always_inline unsigned long vmcs_readl(unsigned long field)
    {
            unsigned long value;
            asm volatile (__ex_clear(ASM_VMX_VMREAD_RDX_RAX, "%0")
                          : "=a"(value) : "d"(field) : "cc");
            return value;
    }
    

    Honestly, I expected to see VMREAD's mnemonic call here vmread, but for some reason its raw representation in the form of bytes is used. Maybe this way the authors wanted to support the assembly with compilers that were unaware of such an instruction.
    By the way, the intrinsic example for LZCNT from the example above can be rewritten using the inline assembler format as follows. In this simple case, the machine code is generated the same.
    #include 
    int main(int argc, char **argv) {
            int64_t src = argc;
            int64_t dst;
            __asm__ volatile(
                    "lzcnt %1, %0\n"
                    :"=r"(dst)
                    :"r"(src)
                    :"cc"
            );
            return (int)dst;
    }
    

    Although I originally planned to describe in detail in this article the features of the GNU-inline assembler format, I decided not to do this, because there is a lot of information on the Internet on this topic. If the need still arises, I can do this in my next article.
    It happens that it is more profitable to assemble the entire assembler into a single file than to try to fit it among the C code. I did not find examples for KVM, but they were for Xen . I note that in this assembler file itself there is no more than a quarter in volume, the rest are preprocessor directives and comments documenting what this code does and what its interface is.

    Summary


    Assembly language plays a key role in the development of simulation solutions. It is used in various components of models, as well as in the process of testing them.
    Assembler code itself in a complex project that also uses high-level languages ​​can be represented in three ways.
    1. Intrinsics are wrappers for individual machine instructions with the interface of ordinary C / C ++ functions.
    2. Assembler inserts - fragments of assembler code specific to the selected compiler / assembler, consistent with the high-level code surrounding them.
    3. Files written entirely in assembler - used in those (rare) cases when it is more convenient to express a certain sequence of actions entirely in assembler. They interact with the outside world either through the function interface (independently implementing the ABI of the platform for which they are intended), or not interacting in any way (in the case of independent unit tests).



    Also popular now: