Implementing a hot boot of C ++ code in Linux

From the sandbox

* Link to the library at the end of the article. The article itself outlines the mechanisms implemented in the library, with medium detail. The implementation for macOS is not yet complete, but it differs little from the implementation for Linux. This is mainly an implementation for Linux.

Walking on a githabab one Saturday afternoon, I came across a library that implements updating c ++ code on the fly for windows. I myself got off windows a few years ago, not a bit sorry, and now all programming happens on either Linux (at home) or macOS (at work). A little googling, I found that the approach from the library above is quite popular, and msvc uses the same technique for the "Edit and continue" function in Visual Studio. The only problem is that I have not found a single implementation under non-windows (I looked bad?). Asked the author of the library above, whether he would make a port for other platforms, the answer was negative.

I will say right away that I was only interested in the option in which I would not have to change the existing project code (as, for example, in the case of RCCPP or cr , where all potentially reloadable code should be in a separate dynamically loaded library).

"How so?" - I thought, and began to smoke incense.

What for?

I mainly do game devs. Most of my work time I spend on writing game logic and layout of any visual. In addition, I use imgui for auxiliary utilities. My code cycle, as you probably guessed, is Write -> Compile -> Run -> Repeat. Everything happens pretty quickly (incremental build, all sorts of ccache, etc.). The problem here is that this cycle has to be repeated quite often. For example, I am writing a new game mechanics, let it be a "Jump", a suitable, controlled Jump:

1. Wrote a draft implementation based on an impulse, assembled, launched. I saw that I accidentally put an impulse every frame, not just once.

2. Fixed, collected, launched, now normal. But it would be necessary to take the absolute value of the impulse more.

3. Fixed, assembled, launched, running. But somehow felt wrong. We must try on the basis of strength to do.

4. Wrote a draft implementation based on strength, assembled, launched, running. It would be necessary only to change the instantaneous speed at the time of the jump.
...

10. Fixed, collected, launched, works. But still not that. Probably need to try the implementation based on the change gravityScale.
...

20. Great, looks great! Now we take out all the parameters in the editor for gamediz, test and fill.
...

30. Jump ready.

And at each iteration, you need to collect the code and in the running application to get to the place where I can jump. It usually takes at least 10 seconds. And if I can jump only in open areas, which still need to get to? And if I need to be able to jump on blocks of height N units? Here I already need to collect a test scene, which also needs to be debugged, and for which I also need to spend time. For such iterations, a hot reload of the code would be ideal. Of course, this is not a panacea, it is far from being suitable for everything, and even after a reboot, sometimes you need to re-create part of the game world, and this should be taken into account. But in many things it can be useful and can save concentration and a lot of time.

Requirements and problem statement

When changing the code, the new version of all functions should replace the old versions of the same functions.
This should work on Linux and macOS.
This should not require changes to the existing application code.
Ideally, this should be a library, statically or dynamically linked to the application, without third-party utilities.
It is desirable that this library does not greatly affect the performance of the application.
It is enough if it works with cmake + make / ninja
It is enough if it works with debug builds (without optimizations, without cutting characters and other things)

This is the minimum set of requirements that an implementation must satisfy. Looking ahead, I will briefly describe what was implemented additionally:

Transferring the values of static variables to a new code (see the section "Transferring Static Variables" to find out why this is important)
Reboot based on dependencies (changed the heading -> reassembled ~~half-project~~ all dependent files)
Reload code from dynamic libraries

Implementation

Up to this point, I was very far from the data domain, so I had to collect and assimilate information from scratch.

At a high level, the mechanism looks like this:

Monitor the file system for changes in the source
When the source changes, the library rebuilds it using the compile command that this file has already collected.
All collected object books are linked to a dynamically loaded library.
The library is loaded into the process address space.
All functions from the library replace the same functions in the application.
Static variable values are transferred from application to library.

Let's start with the most interesting thing - the mechanism for reloading functions.

Reloading functions

Here are 3 more or less popular ways of replacing functions in (or almost at) runtime:

The trick with LD_PRELOAD - allows you to build a dynamically loadable library with, for example, a function strcpy, and make it so that when you start the application, it takes my version strcpyinstead of the library
Modifying PLT and GOT tables - allows you to "overload" exported functions
Function hooking - allows you to redirect the flow of execution from one function to another

The first 2 options are obviously not suitable, since they work only with exported functions, and we do not want to mark all the functions of our application with any attributes. Therefore, Function hooking is our option!

In short, hooking works like this:

The address of the function is located.
The first few bytes of the function are overwritten by unconditional transfer to the body of another function.
...
Profit!
In msvc for this there are 2 flags - /hotpatchand /FUNCTIONPADMIN. The first one to the beginning of each function records 2 bytes, which do nothing, for their subsequent rewriting with a "short jump". The second allows you to leave an empty space in front of the body of each function in the form of nopinstructions for a “long jump” to the required place, so in 2 jumps you can switch from the old function to the new one. You can read more about how this is implemented in windows and msvc, for example, here .

Unfortunately, in clang and gcc there is nothing similar (at least under Linux and macOS). In fact, this is not such a big problem, we will write directly on top of the old function. In this case, we risk getting into trouble if our application is multi-threaded. If usually in a multi-threaded environment, we restrict access to data by one stream while another stream modifies them, then we need to limit the ability to execute code by one stream, while another stream modifies this code. I have not figured out how to do this, so the implementation will behave unpredictably in a multithreaded environment.

There is one subtle point. On a 32-bit system, 5 bytes is enough for us to "jump" to any place. On a 64-bit system, if we don’t want to spoil registers, we will need 14 bytes. The bottom line is that 14 bytes in machine code scale is quite a lot, and if there is any stub function with an empty body in the code, it is likely to be less than 14 bytes in length. I don’t know the whole truth, but I spent some time behind the disassembler while I thought, wrote and debugged the code, and I noticed that all functions are aligned on a 16-byte boundary (debug build without optimizations, not sure about optimized code). And this means that between the beginning of any two functions there will be at least 16 bytes, which is enough for us to “snag” them. Superficial googling led herehowever, I don’t know for sure, I’m just lucky, or today all compilers are doing it. In any case, if in doubt, simply declare a couple of variables at the beginning of the stub function so that it becomes large enough.

So, we have the first bit - the mechanism for redirecting functions from the old version to the new one.

Search for functions in a shared program

Now we need to somehow get the addresses of all (not only exported) functions from our program or an arbitrary dynamic library. This can be done quite simply using system api, if characters are not cut out of your application. On Linux, this is api from elf.hand link.hon macOS - loader.hand nlist.h.

Using the dl_iterate_phdrpass through all the loaded libraries and, in fact, the program
Find the address where the library is loaded
From the section we .symtabget all the information about the symbols, namely the name, type, index of the section in which it lies, the size, and also we calculate its “real” address based on the virtual address and the library load address

There is one subtlety. When loading an elf file, the system does not load the section .symtab(correct it if it is wrong), and the section .dynsymdoes not suit us, because from it we will not be able to extract characters with visibility STV_INTERNALand STV_HIDDEN. Simply put, we will not see such features:

// some_file.cppnamespace
{
    intsomeUsefulFunction(int value)// <-----{
        return value * 2;
    }
}

and such variables:

// some_file.cppvoidsomeDefaultFunction(){
    staticint someVariable = 0;      // <-----
    ...
}

Thus, in the 3rd paragraph we work not with the program that we were given dl_iterate_phdr, but with the file that we downloaded from the disk and parsed with some elf parser (or on a bare api). So we won't miss anything. On macOS, the procedure is similar, only the function names from the system api are different.

After that we filter all characters and save only:

Functions that can be reloaded are type characters STT_FUNClocated in a section .textthat have a non-zero size. Such a filter only passes functions whose code is actually contained in this program or library.
Static variables whose values need to be transferred are characters of type STT_OBJECTlocated in the section.bss

Broadcast units

To reload the code, we need to know where to get the source code files and how to compile them.

In the first implementation, I read this information from the section .debug_infoin which the debug information in the DWARF format lies. In order for each translation unit (ET) within DWARF to contain the compilation line of this ET, it is necessary to transfer Flah when compiling -grecord-gcc-switches. DWARF itself I parsed the library libdwarfthat comes bundled with libelf. In addition to the compilation command from DWARF, you can also get information about the dependencies of our ETs on other files. But I refused this implementation for several reasons:

Libraries are quite weighty
Parsing DWARF applications assembled from ~ 500 ET, with dependency parsing, took a little more than 10 seconds

10 seconds at the start of the application - too much. After some deliberation, I rewrote the logic of parsing DWARF for parsing compile_commands.json. This file can be generated by simply adding set(CMAKE_EXPORT_COMPILE_COMMANDS ON)to your CMakeLists.txt. This way we get all the information we need.

Dependency handling

Since we have abandoned DWARF, we need to find another way to handle dependencies between files. Parse the files with your hands and includereally don’t want to search for them , and who knows more about dependencies than the compiler itself?

In clang and gcc there are a number of options that generate so-called depfiles almost for free. These files use the make and ninja build systems to resolve dependencies between files. Depfiles have a very simple format:

CMakeFiles/lib_efsw.dir/libs/efsw/src/efsw/DirectorySnapshot.cpp.o: \
  /home/ddovod/_private/_projects/jet/live/libs/efsw/src/efsw/base.hpp \
  /home/ddovod/_private/_projects/jet/live/libs/efsw/src/efsw/sophist.h \
  /home/ddovod/_private/_projects/jet/live/libs/efsw/include/efsw/efsw.hpp \
  /usr/bin/../lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/string \
  /usr/bin/../lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/x86_64-linux-gnu/c++/7.3.0/bits/c++config.h \
  /usr/bin/../lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/x86_64-linux-gnu/c++/7.3.0/bits/os_defines.h \
...

The compiler puts these files next to the object files for each ET, we need to parse them and put them into the hashmap. Total parsing compile_commands.json+ depfiles for the same 500 ET takes a little more than 1 second. In order for everything to work, we need to add a flag globally for all project files in the compilation option -MD.

There is one subtlety associated with ninja. This build system generates depfiles regardless of the presence of a flag -MDfor their needs. But after they are generated, it translates them into its binary format, and deletes the source files. Therefore, when running ninja, you must pass the flag -d keepdepfile. Also, for reasons unknown to me, in the case of make (with the option -MD) the file has a name some_file.cpp.d, while with ninja it is called some_file.cpp.o.d. Therefore, you need to check the availability of both versions.

Static variable transfer

Suppose we have such a code (a very synthetic example):

// Singleton.hppclassSingletor
{public:
    static Singleton& instance();
};
intveryUsefulFunction(int value);
// Singleton.cpp
Singleton& Singletor::instance()
{
    static Singleton ins;
    return ins;
}
intveryUsefulFunction(int value){
    return value * 2;
}

We want to change the function veryUsefulFunctionto this:

intveryUsefulFunction(int value){
    return value * 3;
}

When reloading, the dynamic library with the new code will, besides veryUsefulFunction, get both a static variable static Singleton ins;and a method Singletor::instance. As a result, the program will start calling up new versions of both functions. But the static one insin this library has not yet been initialized, and therefore the first time it is accessed, the class constructor will be called Singleton. We certainly do not want this. Therefore, the implementation transfers the values of all such variables that it finds in the compiled dynamic library from the old code to this very dynamic library with the new code along with their guard variables .

There is one subtle and generally insoluble moment.
Suppose we have a class:

classSomeClass
{public:
    voidcalledEachUpdate(){
        m_someVar1++;
    }
private:
    int m_someVar1 = 0;
};

The method calledEachUpdateis called 60 times per second. We change it by adding a new field:

classSomeClass
{public:
    voidcalledEachUpdate(){
        m_someVar1++;
        m_someVar2++;
    }
private:
    int m_someVar1 = 0;
    int m_someVar2 = 0;
};

If an instance of this class is located in the dynamic memory or on the stack, after reloading the code, the application is likely to fall. An allocated instance contains only a variable m_someVar1, but after a reboot, the method calledEachUpdatewill try to change m_someVar2, changing what actually does not belong to this instance, which leads to unpredictable consequences. In this case, the logic of transferring the state is transferred to the programmer, who must somehow save the state of the object and delete the object itself before reloading the code, and create a new object after reloading. The library provides events in the form of delegate onCodePreLoadand methods onCodePostLoadthat the application can handle.

I do not know how (and whether it is possible) to resolve this situation in a general way, I will think. Now this case "more or less normally" will work only for static variables, the following logic is used there:

void* oldVarPtr = ...;
void* newVarPtr = ...;
size_t oldVarSize = ...;
size_t newVarSize = ...;
memcpy(newVarPtr, oldVarPtr, std::min(oldVarSize, newVarSize));

This is not very correct, but it is the best that I came up with.

As a result, the code will behave unpredictably if the set and layout of fields in the data structures change in runtime. The same applies to polymorphic types.

Putting it all together

How it all works together.

The library is iterated by the headers of all the libraries dynamically loaded into the process and, in fact, by the program itself, it parses and filters the characters.
Further the library tries to find the file compile_commands.jsonin the application directory and in the parent directories recursively, and gets out all the necessary information about the ET from there.
Knowing the path to the object files, the library loads and parses the depfiles.
After that, the most common directory for all files of the program source code is calculated, and monitoring of this directory starts recursively.
When a file changes, the library looks to see if it is in a hashmap of dependencies, and if so, it starts in the background several compilation processes of modified files and their dependencies using the compile commands from compile_commands.json.
When the program asks to reload the code (in my application, a combination is assigned to it Ctrl+r), the library waits for the completion of the compilation processes and links all new objects to the dynamic library.
Then this library is loaded into the address space of the process function dlopen.
Information on symbols is loaded from this library, and the entire intersection of the set of symbols from this library and symbols already living in the process is either reloaded (if it is a function), or transferred (if it is a static variable).

It works very well, especially when you know what is under the hood and what to expect, at least at a high level.

Personally, I was very surprised by the lack of such a solution for Linux, is nobody really interested in this?

I will be glad to any criticism, thanks!

Reference to implementation

Tags: