Implementing a hot boot of C ++ code in Linux and macOS: digging deeper


    * Link to the library and demo video at the end of the article. To understand what is happening, and who all these people are, I recommend reading the previous article .


    In the last article, we familiarized ourselves with an approach that allows for a hot reboot of c ++ code. The “code” in this case is the functions, data, and their coordinated work with each other. There are no special problems with functions, we redirect the flow of execution from the old function to the new one, and everything works. The problem arises with data (static and global variables), namely with the strategy of their synchronization in the old and new code. In the first implementation, this strategy was very clumsy: just copy the values ​​of all static variables from the old code to the new one, so that the new code, referring to the new variables, works with the values ​​from the old code. Of course, this is incorrect, and today we will try to correct this flaw by simultaneously solving a number of small but interesting problems.


    The article omits details on mechanical work, such as reading characters and relocations from elf and mach-o files. The emphasis is on the subtle points that I encountered in the implementation process, and which may be useful to someone who, like me recently, is looking for answers.


    The essence


    Let's imagine that we have a class (synthetic examples, please do not look for meaning in them, only the code is important):


    // Entity.hppclassEntity
    {public:
        Entity(conststd::string& description);
        ~Entity();
        voidprintDescription();
        staticintgetLivingEntitiesCount();
    private:
        staticint m_livingEntitiesCount;
        std::string m_description;
    };
    // Entity.cppint Entity::m_livingEntitiesCount = 0;
    Entity::Entity(conststd::string& description)
        : m_description(description)
    {
        m_livingEntitiesCount++;
    }
    Entity::~Entity()
    {
        m_livingEntitiesCount--;
    }
    int Entity::getLivingEntitiesCount()
    {
        return m_livingEntitiesCount;
    }
    void Entity::printDesctiption()
    {
        std::cout << m_description << std::endl;
    }

    Nothing special, except a static variable. Now imagine that we want to change the method printDescription()to:


    void Entity::printDescription()
    {
        std::cout << "DESCRIPTION: " << m_description << std::endl;
    }

    What happens after reloading the code? In the library with the new code, in addition to the methods of the class Entity, will get a static variable m_livingEntitiesCount. Nothing bad will happen if we just copy the value of this variable from the old code to the new one, and continue to use the new variable, forgetting about the old one, because all the methods that use this variable directly are in the library with the new code.


    C ++ is very flexible and rich. And let the elegance of solving some problems in c ++ borders on smelly code, I love this language. For example, imagine that rtti is not used in your project. At the same time, you need to have a class implementation Anywith an all-type safe interface:


    classAny
    {public:
        template <typename T>
        explicitAny(T&& value){ ... }
        template <typename T>
        boolis()const{ ... }
        template <typename T>
        T& as(){ ... }
    };

    We will not go into the details of the implementation of this class. What is important for us is that for implementation we need some kind of mechanism for unambiguous mapping of a type (compile-time entity) into a variable value, for example uint64_t(runtime entity), that is, "number" types. When using rtti available to us such things as type_info, and that more suits us, type_index. But we do not have rtti. In this case, a fairly common hack (or an elegant solution?) Is such a function:


    template <typename T>
    uint64_t typeId()
    {
        staticchar someVar;
        returnreinterpret_cast<uint64_t>(&someVar);
    }

    Then the implementation of the class Anywill look something like this:


    classAny
    {public:
        template <typename T>
        explicitAny(T&& value) 
            : m_typeId(typeId<std::decay<T>::type>())// copy or move value somewhere{}
        template <typename T>
        boolis()const{ return m_typeId == typeId<std::decay<T>::type>(); }
        template <typename T>
        T& as(){ ... }
    private:
        uint64_t m_typeId = 0;
    };

    For each type, the function will be instantiated exactly 1 time, respectively, each version of the function will have its own static variable, obviously with its own unique address. What happens when we reload code using this feature? Calls to the old version of the function will be redirected to the new one. The new one will have its own static variable, already initialized (we copied the value and the guard variable). But we are not interested in the value, we use only the address. And the address of the new variable will be different. Thus, the data became inconsistent: in the already created instances of the class, the Anyaddress of the old static variable will be stored, and the method is()will compare it with the address of the new one, and "this Anywill not be the same Any" ©.


    Plan


    To solve this problem, you need something smarter than just copying. Having spent a couple of evenings on googling, reading documentation, source codes and system api, the following plan was drawn up in my head:


    1. After the assembly of the new code, we are passing through relocations .
    2. From these relocations, we get all the places in the code that use static (and sometimes global) variables.
    3. Instead of addresses for new versions of variables, we substitute addresses of old versions into the relocation place.

    In this case, there will be no links to the new data, the entire application will continue to work with old versions of variables up to the address. That should work. It can not fail.


    Relocation


    When the compiler generates a machine code, it inserts several bytes sufficient for writing a real address of a variable or function to each place where a function call or loading a variable address occurs, and also generates a relocation. He cannot immediately write down the real address, since at this stage he does not know this address. Functions and variables after linking can appear in different sections, in different places of sections, in the end sections can be loaded at different addresses during execution.


    Relocation contains information:


    • What is the address to write the address of the function or variable
    • Address of which function or variable to write
    • The formula by which this address should be calculated
    • How many bytes are reserved for this address

    In different OS, relocations are presented differently, but in the end, they all work on the same principle. For example, in elf (Linux) relocations are located in special sections .rela(in the 32-bit version of this .rel), which refer to the section with the address that needs to be fixed (for example, the .rela.textsection where the relocations are applied to the section .text), and each entry stores information about the symbol whose address is to be inserted into the relocation location. In mach-o (macOS), everything is slightly the opposite; there is no separate section for relocations, instead, each section contains a pointer to a relocation table that should be applied to this section, and in each record of this table there is a reference to the relocated symbol.
    For example, for such a code (with option -fPIC):


    int globalVariable = 10;
    intveryUsefulFunction(){
        staticint functionLocalVariable = 0;
        functionLocalVariable++;
        return globalVariable + functionLocalVariable;
    }

    the compiler will create this section with Linux relocations:


    Relocation section '.rela.text' at offset 0x1a0 contains 4 entries:
        Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
    0000000000000007  0000000600000009 R_X86_64_GOTPCREL      0000000000000000 globalVariable - 4
    000000000000000d  0000000400000002 R_X86_64_PC32          0000000000000000 .bss - 4
    0000000000000016  0000000400000002 R_X86_64_PC32          0000000000000000 .bss - 4
    000000000000001e  0000000400000002 R_X86_64_PC32          0000000000000000 .bss - 4

    and such a relocation table on macOS:


    RELOCATION RECORDS FOR [__text]:
    000000000000001b X86_64_RELOC_SIGNED __ZZ18veryUsefulFunctionvE21functionLocalVariable
    0000000000000015 X86_64_RELOC_SIGNED _globalVariable
    000000000000000f X86_64_RELOC_SIGNED __ZZ18veryUsefulFunctionvE21functionLocalVariable
    0000000000000006 X86_64_RELOC_SIGNED __ZZ18veryUsefulFunctionvE21functionLocalVariable

    And here is the function veryUsefulFunction()(in Linux):


    0000000000000000 <_Z18veryUsefulFunctionv>:
       0:   55                      push   rbp
       1:   48 89 e5                mov    rbp,rsp
       4:   48 8b 05 00 00 00 00    mov    rax,QWORD PTR [rip+0x0]
       b:   8b 0d 00 00 00 00       mov    ecx,DWORD PTR [rip+0x0]
      11:   83 c1 01                add    ecx,0x1
      14:   89 0d 00 00 00 00       mov    DWORD PTR [rip+0x0],ecx
      1a:   8b 08                   mov    ecx,DWORD PTR [rax]
      1c:   03 0d 00 00 00 00       add    ecx,DWORD PTR [rip+0x0]
      22:   89 c8                   mov    eax,ecx
      24:   5d                      pop    rbp
      25:   c3                      ret    

    and so after linking the object library to the dynamic library:


    00000000000010e0 <_Z18veryUsefulFunctionv>:
        10e0:   55                      push   rbp
        10e1:   48 89 e5                mov    rbp,rsp
        10e4:   48 8b 05 05 21 00 00    mov    rax,QWORD PTR [rip+0x2105]
        10eb:   8b 0d 13 2f 00 00       mov    ecx,DWORD PTR [rip+0x2f13]
        10f1:   83 c1 01                add    ecx,0x1
        10f4:   89 0d 0a 2f 00 00       mov    DWORD PTR [rip+0x2f0a],ecx
        10fa:   8b 08                   mov    ecx,DWORD PTR [rax]
        10fc:   03 0d 02 2f 00 00       add    ecx,DWORD PTR [rip+0x2f02]
        1102:   89 c8                   mov    eax,ecx
        1104:   5d                      pop    rbp
        1105:   c3                      ret    

    There are 4 places in it, in which 4 bytes are reserved for the address of real variables.


    On different systems, the set of possible relocations is yours. On Linux, x86-64 as many as 40 types of relocations . On macOS on x86-64 there are only 9 of them . All types of relocations can be divided into 2 groups:


    1. Link-time relocations - relocations used in the process of linking object files to an executable file or dynamic library
    2. Load-time relocations - relocations used at the time of loading the dynamic library into the process memory

    The second group includes relocations of exported functions and variables. When a dynamic library is loaded into the process memory, for all dynamic relocations (including global global relocations), the linker searches for the definition of characters in all the libraries already loaded, including the program itself, and the address of the first suitable character is used for relocation. Thus, you don’t need to do anything with these relocations, the linker himself will find the variable from our application, because it will fall to it earlier in the list of loaded libraries and programs, and substitute its address into the new code, ignoring the new version of this variable.


    There is a subtle point related to macOS and its dynamic linker. MacOS implements the so-called two-level namespace mechanism. If it is rough, then when loading a dynamic library, the linker will first look for characters in this library, and if he does not find it, he will search for others. This is done for performance reasons, so that relocations are resolved quickly, which is, in general, logical. But it breaks our flow regarding global variables. Fortunately, in ld on macOS there is a special flag - -flat_namespaceand if you build a library with this flag, the character search algorithm will be identical to that in Linux.


    The first group includes the relocations of static variables - exactly what we need. The only problem is that these relocations are not in the compiled library, since they are already resolved by the linker. Therefore, we will read them from the object files from which the library was assembled.
    The possible types of relocations are also limited by the fact whether the assembled position-dependent code is or not. Since we collect our code in the PIC mode (position-independent code), relocations use only relative ones. The total relocation we are interested in is:


    • Relocations from the .rela.textLinux section and relocations, which the section refers to __textin macOS, and
    • Which uses the symbols of the sections .dataand .bssin Linux and __data, __bssand __commonin MacOS, and
    • Relocation are of the type R_X86_64_PC32and R_X86_64_PC64in Linux and X86_64_RELOC_SIGNED, X86_64_RELOC_SIGNED_1, X86_64_RELOC_SIGNED_2and X86_64_RELOC_SIGNED_4in macOS

    Subtle point associated with the section __common. Linux also has a similar section *COM*. In this section can get global variables. But, while I was testing and compiling a bunch of code fragments, on Linux, the relocation of characters from the *COM*section was always dynamic, like in ordinary global variables. At the same time, in macOS such symbols were sometimes relocated during linking, if the function and the symbol are in the same file. Therefore, on macOS, it makes sense to take this section into account when reading symbols and relocations.


    Great, now we have a set of all the relocations we need, what to do with them? The logic here is simple. When the linker links the library, it records the address of the symbol calculated by a certain formula at the relocation address . For our relocations on both platforms, this formula contains the symbol address as a term. Thus, the calculated address, already written into the function body, has the form:


    resultAddr = newVarAddr + addend - relocAddr

    At the same time, we know the addresses of both versions of the variables — the old, already living in the application, and the new. It remains for us to change it according to the formula:


    resultAddr = resultAddr - newVarAddr + oldVarAddr

    and write it to the relocation address. After that, all the functions in the new code will use the already existing versions of the variables, and the new variables will simply lie and do nothing. What you need! But there is one subtle point.


    Loading library with new code


    When the system loads the dynamic library into the memory of the process, it is free to place it in any place of the virtual address space. On Ubuntu 18.04, my application is downloaded to the address 0x00400000, and our dynamic libraries - right after ld-2.27.sothe addresses in the area 0x7fd3829bd000. The distance between the program and library load addresses is much larger than the number that would fit into the signed 32-bit integer. And in link-time relocations, only 4 bytes are reserved for addresses of target characters.


    Having smoked the documentation for compilers and linkers, I decided to try the option -mcmodel=large. It makes the compiler generate a code without any assumptions about the distance between characters, thus all addresses are 64-bit. But this option is not friendly with the PIC, as if -mcmodel=largeit cannot be used together with -fPIC, at least on macOS. I still do not understand what the problem is, perhaps on macOS there are no suitable relocations for this situation.


    In the library under windows, this problem is solved as follows. Hands allocated a piece of virtual memory near the place of loading the application, sufficient to accommodate the desired sections of the library. Then the sections are loaded into it by hands, the necessary rights are set up for the memory pages with the corresponding sections, all relocations are resolved by hands, and the rest is patched. I'm lazy. I really did not want to do all this work with load-time relocations, especially on Linux. And why do something that a dynamic linker can already do? After all, the people who wrote it know much more than I do.


    Fortunately, the documentation found the necessary options that allow you to specify where to load our dynamic library:


    • Apple ld: -image_base 0xADDRESS
    • Llvm lld: --image-base=0xADDRESS
    • GNU ld: -Ttext-segment=0xADDRESS

    These options need to be passed to the linker at the time of linking the dynamic library. There are 2 difficulties.
    The first is related to GNU ld. In order for these options to work, you need to:


    • At the time of loading the library area in which we want to load it, was free
    • The address specified in the option must be a multiple of the page size (on x86-64 Linux and macOS it 0x1000)
    • At least in Linux, the address specified in the option must be a multiple of the PT_LOADsegment alignment.

    That is, if the linker has set the alignment to 0x10000000, then this library will not be able to load at the address 0x10001000, even taking into account that the address is aligned to the page size. If one of these conditions fails, the library will load "as usual." I have a GNU ld 2.30 system, and, unlike LLVM lld, it defaults to segment alignment PT_LOADin 0x20000, which is very much out of the general picture. To get around this, you need to -Ttext-segment=...specify in addition to the option -z max-page-size=0x1000. I spent the day until I understood why the library is not loaded where it should be.


    The second difficulty is that the download address must be known at the linking stage of the library. It is not very difficult to organize. In Linux, it is enough to parse the pseudo-file /proc/<pid>/maps, find the unallocated piece closest to the program, which the library will fit into, and use the address of this piece to use when linking. The size of the future library can be roughly estimated by looking at the sizes of the object files, or by parsing them and calculating the sizes of all sections. In the end, we need not an exact number, but an approximate size with a margin.


    In macOS there is no /proc/*, instead it is proposed to use the utility vmmap. The output of the command vmmap -interleaved <pid>contains the same information as proc/<pid>/maps. But then another difficulty arises. If an application creates a child process that executes this command, and the <pid>identifier of the current process is specified, the program will hang tightly. As I understand it, it vmmapstops the process in order to read its memory mappings, and, apparently, if this is the calling process, then something goes wrong. In this case, you need to specify an additional flag -forkCorpseto vmmapcreate an empty child process from our process, remove the mapping from it and kill it, thereby not interrupting the program.


    In general, that's all we need to know.


    Putting it all together


    With these modifications, the final code reload algorithm looks like this:


    1. Compile new code into object files.
    2. According to the object files, we estimate the size of the future library.
    3. We read from object files of relocation
    4. We are looking for a free piece of virtual memory next to the application.
    5. We compile a dynamic library with the necessary options, ship through dlopen
    6. Patch code according to link-time relocations
    7. Patch functions
    8. Copy static variables that did not participate in step 6

    In step 8, only guard variables of static variables are included, so they can be safely copied (thereby preserving the "initialization" of the static variables themselves).


    Conclusion


    Since this is only a development tool, not intended for any production, the worst thing that can happen if another library with a new code does not fit into memory, or accidentally loads at a different address, it is a restart of the application being debugged. When tests are run into memory, 31 libraries are loaded in turn with the updated code.


    For the sake of completeness, the implementation lacks 3 more weighty pieces:


    1. Now the library with the new code is loaded into memory next to the program, although it can get code from another dynamic library that was loaded far. For fixing, it is necessary to track the belonging of translation units to various libraries and programs, and split up the library with new code if necessary.
    2. Reloading code in a multithreaded application is still unreliable (you can safely reload only code that runs on the same thread as the runloop library). For fixing, it is necessary to take part of the implementation into a separate program, and this program, before patching, should stop the process with all threads, perform patching, and return it to work. I do not know how to do this without an external program.
    3. Prevent accidental application crashes after reloading code. Having fixed the code, you can accidentally dereference an invalid pointer in the new code, after which you will have to restart the application. It's okay, but still. Sounds like black magic, I'm still in thought.

    But already the current implementation began to bring benefits to me personally, it is enough for use in my main job. Need a little getting used to, but the flight is normal.
    If I get to these three points and find a sufficient amount of interesting in their implementation, I will definitely share it.


    Demo


    Since the implementation allows you to add new broadcasting units on the fly, I decided to record a small video in which I write from scratch an indecently simple game about a spaceship plying the universe and shooting square asteroids. I tried to write not in the style of “all in one file”, but, if possible, putting everything on the shelves, thereby generating many small files (that is why there was so much writing up). Of course, for drawing, inputs, windows, and other things, the framework is used, but the code of the game itself was written from scratch.
    The main feature is that I only launched the application 3 times: at the very beginning, when there was only an empty stage in it, and 2 times after the fall due to my negligence. The whole game was incrementally added in the process of writing code. Real time - about 40 minutes. In general, you are welcome.



    As always, I will be glad to any criticism, thanks!


    Reference to implementation


    Also popular now: