There is nothing easier than calling a function, I myself have done this repeatedly

    The previous article on exceptions in C ++ left a bunch of dark places, the
    main thing that remained incomprehensible - so how does
    control transfer when an exception is raised?
    With SJLJ, everything is clear, but it is argued that this technology has almost been
    superseded by some cost-free (with no exceptions) table mechanism.
    But what kind of mechanism is this and how it is arranged, we will understand under the cut.

    This article appeared in the process of preparing for a speech in C ++ Siberia, when it turned out some details that may well be useful to someone else, except for an author known for his tediousness.


    It all started with a simple desire to find out the size of the buffer that setjmp / longjmp functions use:
    • sizeof (jmp_buf) == 64 bytes (MSVC 2013, win32)
    • sizeof (jmp_buf) == 256 bytes (MSVC 2013, x64)
    • sizeof (jmp_buf) == 200 bytes (GCC-4.8.4, Ubuntu 14.04 x64)
    It is believed that the state of the processor is preserved in this structure.
    And how is this consistent with the number of registers ( AMD x86-64 )?

    MMX and FP87 registers are combined.
    In 32-bit mode - 32 + 128 + 80 + 12 (eip, flags) = 252 bytes
    In 64- bit mode - 128 + 256 + 80 + 24 (...) = 480 bytes
    Something does not fit.
    Read the documentation :
    Calling setjmp saves the current stack position, non-volatile registers and flags.
    What are non-volatile registers? Read the documentation again :
    When calling a function, the responsibility for maintaining the contents of part of the registers lies with the caller and these are the so-called volatile registers. The contents of the remaining registers are taken care of by the called party and these are non-volatile registers.
    So what is it and why is it needed?

    Division of registers into volatile and non-volatile

    It's about optimizing function calls. Such a division has existed for a long time.
    In any case, in the 68K architecture (1979), two out of 8 general-purpose registers and seven address registers were considered volatile, the rest were protected by the called party.

    In 88K architecture (1988), 12 out of 32 general-purpose registers were protected by the called party + (stack and frame pointers).

    In IBM S / 360(1964) 16 general-purpose integer registers (32-bit) and 4 floating-point registers, but no hardware stack. Before calling the function, all registers are stored in a special area. For a recursive call, you need to dynamically request memory from the OS. Parameters are passed as a pointer to a list of pointers to parameter values. In fact, 11 registers are non-volatile.

    For architectures with fewer registers, there is no problem with saving them. In PDP-11 (1970) there are only 6 general purpose registers. And the parameters are passed through the stack. Actually, “C calling convention” ( cdecl ) appeared and it was here that the C language crystallized .

    8086. Where would it be without him. The processor has 8 general-purpose registers, but only the BX has no architectural burdens. There were several conventions for calling functions, they concerned the transfer of parameters, while all registers were considered volatile.

    We will not dwell on IA-32 , we will return again to x86-64 . In this case, there are too many registers to store their values ​​before each call. In the case of a full-fledged function, one way or another it will have to be done, but for small “functionlets” it is wasteful. A compromise is needed.
    Who can determine which class a particular register belongs to? The compiler itself, this has an indirect relation to architecture. Here is what he wrote in 2003 about thisone of the developers of GCC:
    The decision to which category to assign a particular register was not easy. AMD64 has 15 general-purpose registers (pr. Lane:% rsp does not count, saving it anyway), and using 8 of them (the so-called extended registers) in the instruction requires the presence of the REX prefix, which increases its size. In addition, the% rax,% rdx,% rcx,% rsi, and% rdi registers are implicitly used in some IA-32 instructions. We decided to make these volatile registers to avoid restrictions on the use of instructions.
    Thus, we can do non-volatile only% rbx,% rbp and extended registers. A series of tests showed that the smallest code is obtained when non-volatile registers are assigned (% rbx,% rbp,% r12-% r15).

    Initially, we wanted to make volatile 6 SSE registers. However, difficulties arose - these registers are 128-bit and only 64 bits are usually used to store data, so saving them for the calling party is more expensive than for the called party.
    Various experiments were carried out and we came to the conclusion that the most compact and fast code is obtained when all SSE registers are declared as volatile.

    This is still the case, see AMD64 ABI spec , page 21.
    The same ABI is also supported in OS X.

    Microsoft considered it different and their division is as follows :
    • volatile: RAX, RCX, RDX, R8: R11, XMM0: XMM5, YMM0: YMM5
    • non-volatile: RSI, RDI, RBX, RBP, RSP, R12: R15, XMM6: XMM15, YMM6: YMM15
    Well, at least you can fit all this into the size of jmp_buf .

    What about more case-rich architectures?
    This is how things stand with the OS X 64-bit compiler for PowerPC, where there are 32 integer registers and one floating-point register:
    • volatile: GPR0, GPR2: GPR10, GPR12, FPR0: FPR13, total 11 + 14
    • non-volatile: GPR1, GPR11 (*), GPR13: GPR31, FPR14: FPR31, total 21 + 18
    GPR11 (*) - non-volatile for leaf functions (of which there are no other calls)

    Total: separation of registers into two classes implements universal optimization of function calls:
    • part of the registers is used to pass arguments, this is faster than working through the stack (plus part of the registers is spent on official needs)
    • the number of these registers is determined by compiler developers based on statistics and their ideas about typical code
    • the contents of the rest of the registers are saved only by necessity, so in the case of small and non-greedy functions, it may save nothing
    • and when you call a full-fledged function, the contents of all the registers will be saved, but this is nothing compared to the time the body of this function works

    Register windows

    An alternative approach, processors using this technique are growing from the Berkeley RISC project (1980..1984).

    Intel i960 (1989) is a 32-bit processor with 16 local and 16 global general purpose registers. Parameters are passed through global registers; when a function is called, all local registers are saved by a special instruction. Virtually all local registers are non-volatile, but they are forcibly saved in the hope that hardware support will give it some kind of acceleration. However, by now, this is just one cache line .

    AMD 29K (1988) - 32-bit processor with 192 (sic!) Registers
    • 64 global and 128 local integer registers
    • local registers form the top of the stack, continued in RAM, the stack is accessed at offsets from the top of the stack (one of the global registers)
    • function input parameters are transmitted through local registers, return - through global
    • there is also a real stack in memory for data that does not fit into 16 words, as well as those for which someone may require an address, for example, for local arrays or anything that has this.

    SPARC (1987) may have a different number of registers (S in the name means Scalable)
    • a typical processor has 128 general purpose registers
    • of which only 32 - 8 global and 24 local ones are visible at once, which form a window
    • the window consists of 8 input (arguments), 8 local and 8 output (for the next call)
    • local registers form a circular buffer; when the function is called, the window is shifted by 16 registers. In this case, 8 output registers for the called function become input.
    • when crowding out registers get on the stack

    Itanium (2001) is a successor to the SPARC business.
    • a total of 128 general-purpose integer general-purpose registers (64-bit) and as many floating
    • 32 of them are considered global
    • 96 are local and they form the top of the register stack
    • the processor itself takes care of their loading and unloading, creating the illusion of an endless stack of registers (RSE, Register Stack Engine)
    • when a function is called, a register window is created for it with the special instruction alloc , and the compiler must explicitly set its size
    • the window is arranged similarly to SPARC, but the sizes of its parts are flexible and are also set by the compiler, the total size is not more than 96 registers
      • in part is intended for input function parameters, not more than 8
      • local part for local data
      • out - intended for parameters of functions that will be called by this one, not more than 8
    • when calling a function from a function, the register window is shifted and with a slight movement out the part turns into in
    • the regular stack is also present, the compiler puts everything in it that could not be placed on the local register stack

    Definitely, Itanium receives a viewer prize, it is a pity that this processor did not take off.

    Function call

    So, having considered all this architectural splendor, we can draw the following conclusions about calling functions.

    Yes, after the optimizer, the contents of the function body sometimes resemble the primary broth , where it is not always clear why this or that instruction is needed and how to find the value of a variable.

    However, by the time the function is called in a child, all this boisterous activity freezes.

    Regardless of the processor architecture, current data from the registers is somehow protected from loss. For case-window architectures, this happens naturally. For others, volatile registers are saved in memory, but if this is a temporary value and it does not have a place in memory, it will have to be re-calculated. Non-volatile registers either remain unchanged, or their values ​​are restored.

    Suppose an exception occurred in the underlying function and we want to transfer control to one of the catch blocks of some function. All the information that we need to restore the execution context is already on the stack or in the registers, the
    try block may not make any efforts, it does not need to allocate space on the stack and save anything there. All information has already been saved. But now the problem is how to save information about where we posted that information.

    Fortunately, this information is static and is determined at compile time. The compiler collects all this into tables, and so it turns out a cost-free table engine.

    Let's see how this is implemented in the MSVC (x64) and GCC (x64) compilers.

    MSVC (x64)

    MSVC creates a prologue and epilogue for each function, while the RSP value between them remains unchanged. RBP is considered a regular register until someone uses alloca . Let's take some non-trivial function for the experiments and look in the debugger for the significant part of its prologue for us:
    000000013F440850: mov rax, rsp
    000000013F440853 the push rbp
    000000013F440854 the push rdi
    000000013F440855 the push r12
    000000013F440857 the push r14
    000000013F440859 the push r15
    000000013F44085B lea rbp, [rax-0B8h] # initialize
    000000013F440862 the sub rsp, 190h
    000000013F440869: mov qword ptr [rbp + 20h], 0FFFFFFFFFFFFFFFEh # initialize
    000000013F440871 mov qword ptr [rax + 10h], rbx
    000000013F440875 mov qword ptr [rax + 18h], rsi
    000000013F440879 mov rax, qword ptr [__security_cookie (013F4C5020h)] # from here on the function body
    if possible, the optimizer dilutes the prolog code with initialization instructions.

    And an epilogue in which non-volatile registers are restored to their original state.
    000000013F4410C2 lea r11, [rsp + 190h]
    000000013F4410CA mov rbx, qword ptr [r11 + 38h]
    000000013F4410CE mov rsi, qword ptr [r11 + 40h]
    000000013F4410D2 mov rsp, r11
    000000013F4410D5 pop r15
    000000013F4410D7 pop r14
    000000013F4410D9 pop r12
    000000013F4410DB pop rdi
    000000013F4410DC pop rbp
    000000013F4410DD ret
    The compiler collects information for stack promotion in the .pdata section. Each function has a RUNTIME_FUNCTION structure , from which there is a link to an unwind table . Its contents can be pulled out using the link utility with the -dump -unwindinfo options. For the same function we find:
    ? 00001D70 00020880 0002110E 000946C0 write_table_header ...
    to Unwind version: 1
    to Unwind The flags: EHANDLER UHANDLER
    Size of prologue: 0x3A
    Count of codes: 11
    Unwind codes:
    29: SAVE_NONVOL, register = rsi offset = 0x1D0
    25: SAVE_NONVOL, register = rbx offset = 0x1C8
    19 : ALLOC_LARGE, size bed = 0x190
    0B: PUSH_NONVOL, register = r15
    09: PUSH_NONVOL, register = r14
    07: PUSH_NONVOL, register = r12
    05: PUSH_NONVOL, register = rdi
    04: PUSH_NONVOL, register = rbp
    the Handler: 0006BFD0 __GSHandlerCheck_EH
    the EH the Handler the Data: 00,087,578
    GS Unwind flags: UHandler
    Cookie Offset: 00000188
    We are interested in Unwind codes - they contain actions that must be performed when an exception is raised.
    • the number at the beginning of the line means a shift relative to the beginning of the function of the instruction address following the described one. If an exception occurs in the middle of the prologue (which is very strange), only the changes made can be rolled back.
    • then comes the type of instruction, for example, ALLOC_LARGE means allocating a certain amount of memory on the stack, SAVE_NONVOL - saving the register to the already allocated memory, PUSH_NONVOL - saving the register on the stack with decreasing RSP
    • the instructions go in reverse order, repeating the actions of the epilogue

    GCC (x64)

    Similarly, we will analyze the prolog and epilogue of the same function created by GCC.
    .cfi_personality 0x9b, DW.ref .__ gxx_personality_v0
    .cfi_lsda 0x1b, .LLSDA11339
    pushq% r15
    .cfi_def_cfa_offset 16
    .cfi_offset 15, -16
    pushq% r14
    .cfi_def_cfa_offset 24
    .cfi_offset 14, -24
    pushq% r13
    .cfi_def_cfa_offset 32
    .cfi_offset 13 , -32
    movq% rdi, r13%
    pushq% r12
    .cfi_def_cfa_offset 40
    .cfi_offset 12, -40
    pushq% rbp
    .cfi_def_cfa_offset 48
    .cfi_offset 6, -48
    pushq% rbx
    .cfi_def_cfa_offset 56
    .cfi_offset 3, -56
    subq $ 456,% rsp
    .cfi_def_cfa_offset 512
    addq $ 456,% rsp
    .cfi_def_cfa_offset 56
    popq% rbx
    .cfi_def_cfa_offset 48
    popq% rbp
    .cfi_def_cfa_offset 40
    popq% r12
    .cfi_def_cfa_offset 32
    popq% r13
    .cfi_def_cfa_offset 24
    popq% r14
    .cfi_def_cfa_offset 16
    popq% r15
    .cfi_def_cfa_offset 8
    CFI prefix means Call Frame Information, this directive assebleru how to record additional information for the promotion of the stack. This information is collected in the .eh_frame section, you can see it in readable form using the dwarfdump utility with the -F switch
    # prologue
    <0> <0x00000e08: 0x00000f4a> <> <fde offset 0x00000e00 the length: 0x00000060> <eh aug data len is 0x0>
    0x00000e08: <= 08 off cfa (r7)> <off r16 = -8 (cfa)>
    0x00000e0a: 〈Off cfa = 16 (r7)〉 〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e0c: 〈off cfa = 24 (r7)〉 〈off r14 = -24 (cfa)〉
        〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e0e: 〈off cfa = 32 (r7)〉 〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa) 〉
        〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e10: 〈off cfa = 40 (r7)〉 〈off r12 = -40 (cfa)〉 〈off r13 = -32 (cfa )〉
         〈Off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e11: 〈off cfa = 48 (r7)〉 〈off r6 = -48 ( cfa)〉
        〈off r12 = -40 (cfa)〉 〈off r13 = -32 (cfa)〉
        〈Off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa)〉
        〈off r16 = -8 (cfa)〉
    0x00000e12: 〈off cfa = 56 (r7)〉 〈off r3 = -56 (cfa) 〉
        〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa) 〉 〈Off r16 = -8 (cfa)〉
    # body
    0x00000e19: 〈off cfa = 64 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e51: 〈off cfa = 56 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = - 24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e52: 〈off cfa = 48 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa) 〉 〈Off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e53: 〈off cfa = 40 (r7)〉 〈off r3 = -56 (cfa )〉
        〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa )〉 〈Off r16 = -8 (cfa)〉
    0x00000e55: 〈off cfa = 32 (r7)〉 〈off r3 = -56 (cfa)〉 〈off r6 = -48 (cfa)
        〉 〈off r12 = -40 ( cfa)〉 〈off r13 = -32 (cfa)〉
        〈off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa)〉
        〈off r16 = -8 (cfa)〉
    0x00000e57: 〈off cfa = 24 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e59: 〈off cfa = 16 (r7) 〉 〈Off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa) 〉
        〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000e5b: 〈off cfa = 08 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa )〉 〈Off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa )〉
    0x00000e60: 〈off cfa = 64 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 ( cfa)〉 〈off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa)〉
        〈off r16 = -8 (cfa)〉
    0x00000f08: 〈off cfa = 56 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa) 〉 〈Off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000f09: 〈off cfa = 48 (r7)〉 〈off r3 = -56 (cfa )〉
        〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa )〉 〈Off r16 = -8 (cfa)〉
    0x00000f0a: 〈off cfa = 40 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 ( cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000f0c: 〈off cfa = 32 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa)〉
        〈off r16 = -8 (cfa)〉
    0x00000f0e: 〈off cfa = 24 (r7) 〉 〈Off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa) 〉
        〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000f10: 〈off cfa = 16 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa )〉 〈Off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa )〉
    0x00000f12: 〈off cfa = 08 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 ( cfa)〉 〈off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    0x00000f18: 〈off cfa = 64 (r7)〉 〈off r3 = -56 (cfa)〉
        〈off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)〉
        〈off r13 = -32 (cfa) 〉 〈Off r14 = -24 (cfa)〉
        〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)〉
    What we see here:
    • first number is the address of the instruction
    • for each address you can find the interval to which the entry corresponds
    • the record consists of a descriptor which register is a frame-pointer 〈off cfa = 48 (r7)〉 (r7 is% rsp, see dwarfdump.conf ),
    • and a list of register descriptors, for example 〈off r3 = -56 (cfa)〉 means that the% rbx register is stored at offset -56 from frame-pointer
    • the prologue is similar to assembly, the% r16 register was added, which the compiler uses for some of its purposes
    • there is no description of the epilogue, apparently, the compiler believes that there can be no exceptions when executing the epilogue
    • we see several branches of code in which the cfa value decreases monotonously. Why this happens, it is not clear, perhaps the compiler inlines the functions and places their temporary data on the stack, saves on the rollback of the stack until everything fits in the red zone .


    So we got to the end. In the process, it turned out that there was no magic. In order to be able to restore the state after catching an exception, no action is necessary; everything is saved by itself during the natural execution of the code.
    Here in order to restore the state requires a little help from the compiler, but everything is quite modest, without frills.

    In general, exception handling by modern compilers is an excellent example of how the most difficult problem can be solved calmly, without fuss, using completely "worker-peasant" methods. Respect for developers.

    Also popular now: