OutOfLine - memory placement pattern for high-performance C ++ applications

Original author: Patrick Moran
  • Transfer

While working at Headlands Technologies, I was lucky to write several utilities to simplify the creation of high-performance C ++ code. This article offers a general overview of one of these utilities OutOfLine.


Let's start with an illustrative example. Suppose you have a system that deals with a large number of file system objects. These can be ordinary files, named UNIX sockets or pipes. For some reason, you open a lot of file descriptors at startup, then work intensively with them, and at the end close the descriptors and delete the links to the files (note lane means unlink function ).


The initial (simplified) version may look like this:


classUnlinkingFD {std::string path;
 public:
  int fd;
  UnlinkingFD(conststd::string& p) : path(p) {
    fd = open(p.c_str(), O_RDWR, 0);
  }
  ~UnlinkingFD() { close(fd); unlink(path.c_str()); }
  UnlinkingFD(const UnlinkingFD&) = delete;
};

And this is a good, logical design. It relies on RAII to automatically free the handle and remove the link. You can create a large array of such objects, work with them, and when the array ceases to exist, the objects themselves will clear everything that was needed in the process.


But what about performance? Suppose fdused very often, but pathonly when deleting an object. Now the array consists of objects of 40 bytes in size, but often only 4 bytes are used. This means there will be more misses in the cache, since you need to skip 90% of the data.


One of the frequent solutions to this problem is the transition from an array of structures to the structure of arrays. This will provide the desired performance, but at the cost of eliminating RAII. Is there an option combining the advantages of both approaches?


A simple compromise can be a replacement std::stringof 32 bytes in std::unique_ptr<std::string>size, which is only 8 bytes in size. This will reduce the size of our object from 40 bytes to 16 bytes, which is a great achievement. But this solution still loses to the use of several arrays.


OutOfLineThis is a tool that allows you to completely move rarely used (cold) fields outside an object without abandoning RAII. OutOfLine is used as the base class CRTP , therefore the first argument of the template must be a child class. The second argument is the type of rarely used (cold) data that is associated with the frequently used (main) object.


structUnlinkingFD :private OutOfLine<UnlinkingFD, std::string> {
  int fd;
  UnlinkingFD(conststd::string& p) : OutOfLine<UnlinkingFD, std::string>(p) {
    fd = open(p.c_str(), O_RDWR, 0);
  }
  ~UnlinkingFD();
  UnlinkingFD(const UnlinkingFD&) = delete;
};

So what is this class like?


template <classFastData, classColdData>
classOutOfLine {

The basic idea of ​​the implementation is to use a global associative container that maps pointers to main objects and pointers to objects containing cold data.


inlinestaticstd::map<OutOfLine const*, std::unique_ptr<ColdData>> global_map_;

OutOfLine can be used with any type of cold data, an instance of which is created and associated with the main object automatically.


template <class... TArgs>
  explicitOutOfLine(TArgs&&... args) {
    global_map_[this] = std::make_unique<ColdData>(std::forward<TArgs>(args)...);
  }

Deleting the main object automatically deletes the associated cold object:


  ~OutOfLine() { global_map_.erase(this); }

When moving (move constructor / move assignment operator) of the main object, the corresponding cold object will be automatically associated with the new main successor object. As a consequence, do not refer to the cold data of the moved-from object.


explicitOutOfLine(OutOfLine&& other){ *this = other; }
  OutOfLine& operator=(OutOfLine&& other) {
    global_map_[this] = std::move(global_map_[&other]);
    return *this;
  }

In the example implementation, it OutOfLineis made uncopyable for simplicity. If necessary, copying operations are easy to add, they just need to create and link a copy of a cold object.



  OutOfLine(OutOfLine const&) = delete;
  OutOfLine& operator=(OutOfLine const&) = delete;

Now, for this to be really useful, it’s good to have access to cold data. When inheriting from a OutOfLineclass, it gets constant and non-constant methods cold():


ColdData& cold()noexcept{ return *global_map_[this]; }
  ColdData const& cold()constnoexcept{ return *global_map_[this]; }

They return the appropriate type of cold data reference.


That's almost all. This option UnlinkingFDwill be 4 bytes in size, will provide cache-friendly field access fdand retain the advantages of RAII. All work related to the life cycle of an object is fully automated. When the primary frequently used object moves, rarely used cold data moves with it. When the main object is deleted, the corresponding cold object is also deleted.


Sometimes, however, your data conspires to complicate your life - and you are faced with a situation in which the basic data must be created first. For example, they are needed to construct cold data. It becomes necessary to create objects in the reverse order of what it offers OutOfLine. For such cases, we need a “spare run” to control the order of initialization and deinitialization.


structTwoPhaseInit {};
  OutOfLine(TwoPhaseInit){}
  template <class... TArgs>
  voidinit_cold_data(TArgs&&... args) {
    global_map_.find(this)->second = std::make_unique<ColdData>(std::forward<TArgs>(args)...);
  }
  voidrelease_cold_data(){ global_map_[this].reset(); }

This is another constructor OutOfLinethat can be used in child classes; it accepts a type tag TwoPhaseInit. If you create OutOfLinein this way, the cold data will not be initialized, and the object will remain half-constructed. To complete a two-phase design, you need to call the method init_cold_data(passing in the arguments necessary to create an object of the type ColdData). Remember that you cannot call .cold()an object whose cold data has not yet been initialized. By analogy, cold data can be deleted ahead of time, before the destructor is executed ~OutOfLine, by calling release_cold_data.


}; // end of class OutOfLine

Now that's it. So what do these 29 lines of code give us? They represent another possible tradeoff between performance and ease of use. In cases where you have an object, some of whose members are used much more often than others, it OutOfLinecan be an easy-to-use way to optimize the cache, at the cost of significantly slowing down access to rarely used data.


We were able to apply this technique in several places - quite often there is a need to supplement the heavily used working data with additional metadata, which is necessary when completing work, in rare or unexpected situations. Whether it is the information about the users who established the connection, the trading terminal from which the order came, or the descriptor of the hardware accelerator engaged in processing the exchange data - will OutOfLinekeep the cache clean when you are in the critical path.


I prepared a test so you can see and appreciate the difference.


ScenarioTime (ns)
Cold data in the main object (initial version)34684547
Cold data completely deleted (best scenario)2938327
Using OutOfLine2947645

I got about 10 times faster when used OutOfLine. Obviously, this test is designed to demonstrate the potential OutOfLine, but it also shows how much cache optimization can have a significant impact on performance, as well as what OutOfLineallows to get this optimization. Keeping the cache free from sparsely used data can provide a difficult-to-measure comprehensive improvement in the rest of the code. As always with optimization, trust the measurements more than assumptions, nevertheless I hope that it OutOfLinewill be a useful tool in your piggy bank of utilities.


Note from the translator


The code given in the article serves to demonstrate the idea and is not representative of the production code.


Also popular now: