Fast C / C ++ Cache, Thread Safety

Comparative testing of multi-threaded caches implemented in C / C ++ and a description of how the LRU / MRU cache of the O (n) Cache ** RU series is arranged


For dozens of years has developed many caching algorithms: the LRU, the MRU, ARC, and others ... . However, when a cache was needed for multi-threaded work, google on this topic gave one and a half options, and the question on StackOverflow generally remained unanswered. I found a solution from Facebook that relies on thread-safe containers from the Intel TBB repository . The latter also has a multi-threaded LRU cache is still in beta testing and therefore, to use it, you must explicitly specify in the project:


#define TBB_PREVIEW_CONCURRENT_LRU_CACHE true

Otherwise, the compiler will show an error since there is a check in the Intel TBB code:


#if ! TBB_PREVIEW_CONCURRENT_LRU_CACHE
    #error Set TBB_PREVIEW_CONCURRENT_LRU_CACHE to include concurrent_lru_cache.h
#endif

I wanted to somehow compare the performance of caches - which one to choose? Or write your own? Earlier, when I was comparing single-threaded caches ( link ), I received offers to try other conditions with other keys and realized that a more convenient extensible test bench was required. In order to make it more convenient to add competing algorithms to the tests, I decided to wrap them in the standard interface:


class IAlgorithmTester  {
 public:
  IAlgorithmTester()  =  default;
  virtual ~IAlgorithmTester()  {  }
  virtual void  onStart(std::shared_ptr  &cfg)  =  0;
  virtual void  onStop()  =  0;
  virtual void  insert(void  *elem)  =  0;
  virtual bool  exist(void  *elem)  =  0;
  virtual const char *  get_algorithm_name()  =  0;
 private:
  IAlgorithmTester(const  IAlgorithmTester&)  =  delete;
  IAlgorithmTester& operator=(const  IAlgorithmTester&)  =  delete;
};

Similarly, the interfaces are wrapped: working with the operating system, getting settings, test cases, etc. Sources are laid out in the repository . There are two test cases at the stand: insert / search up to 1,000,000 elements with a key from randomly generated numbers and up to 100,000 elements with a string key (taken from 10Mb of wiki.train.tokens lines). To evaluate the execution time, each test cache is first heated to the target volume without time measurements, then the semaphore unloads the flows from the chain to add and search data. The number of threads and test case settings are set in assets / settings.json . Step-by-step compilation instructions and a description of JSON settings are described in the WiKi repository. Time is tracked from the moment the semaphore is released until the last thread stops. Here's what happened:


Test case1 - a key in the form of an array of random numbers uint64_t keyArray [3]:


TestCase1.Nthread


Test case2 - key as a string:


TestCase2.Nthread


Please note that the volume of inserted / searched data at each step of the test case increases 10 times. Then, the time that was spent on processing the next volume, I divide by 10, 100, 1000, respectively ... If the algorithm behaves like O (n) in time complexity, then the timeline will remain approximately parallel to the X axis. Next, I will reveal sacred knowledge, as I managed get 3-5 times superiority over the Facebook cache in the O (n) Cache ** RU series algorithms when working with a string key:


  1. The hash function, instead of reading each letter of the string, simply casts a pointer to the string data to uint64_t keyArray [3] and counts the sum of integers. That is, it works like the program “Guess the melody” - and I guess the melody by 3 notes ... 3 * 8 = 24 letters if it’s Latin, and this already allows you to scatter lines fairly well in hash baskets. Yes, a lot of lines can be collected in a hash basket, and here the algorithm starts to accelerate:
  2. The Skip List in each basket allows you to quickly jump first in different hashes (basket id = hash% number of baskets, so different hashes can appear in one basket), then within the same hash along the vertices:
    skip
  3. The nodes in which the keys and data are stored are taken from the previously allocated array of nodes, the number of nodes coincides with the cache capacity. The Atomic identifier indicates which node to take next - if it reaches the end of the node pool, it starts with 0 = so the allocator goes in a circle overwriting the old nodes ( LRU cache in OnCacheMLRU ).

For the case when it is necessary that the most popular elements in search queries are stored in the cache, the second OnCacheMMRU class is created , the algorithm is as follows: in addition to the cache capacity, the constructor of the class also passes the second parameter uint32_t uselessness, the popularity limit is if the number of search requests that wish the current node from the cyclic pool is less If the boundaries are uselessness, then the node is reused for the next insert operation (it will be evicted). If on this circle the node’s popularity (std :: atomicused {0}) is high, then at the moment of requesting the allocator from the cyclic pool, the node will be able to survive, but the popularity counter will be reset to 0. So the node will exist another circle of the allocator’s passage through the node pool and will get a chance to gain popularity again in order to continue to exist. That is, it is a mixture of the MRU algorithms (where the most popular ones hang in the cache forever) and MQ (where the lifetime is tracked). The cache is constantly updated and at the same time it works very quickly - instead of 10 servers you can put 1 ..


By and large, the caching algorithm spends time on the following:


  1. Maintaining the cache infrastructure (containers, allocators, tracking the lifetime and popularity of elements)
  2. Hash calculation and key comparison operations when adding / searching data
  3. Search Algorithms: Red-Black Tree, Hash Table, Skip List, ...

It was just necessary to reduce the operating time of each of these items, given the fact that the simplest algorithm is temporally complex and often the most efficient, since any logic takes CPU cycles. That is, whatever you write, these are operations that should pay off in time in comparison with the simple enumeration method: while the next function is called, the enumeration will have to go through another hundred or two nodes. In this light, multi-threaded caches will always lose in single-threaded mode, since protecting baskets through std :: shared_mutex and nodes through std :: atomic_flag in_use is not free. Therefore, for issuing on the server, I use the OnCacheSMRU single-threaded cachein the main thread of the Epoll server (only long operations for working with the DBMS, disk, cryptography are taken out in parallel workflows). For a comparative assessment, a single-threaded version of test cases is used:


Test case1 - a key in the form of an array of random numbers uint64_t keyArray [3]:


TestCase1.1thread


Test case2 - key as a string:


TestCase2.1thread


In conclusion, I want to tell you what else interesting can be extracted from the sources of the test bench:


  • The JSON parsing library, which consists of a single specjson.h file, is a small simple fast algorithm for those who do not want to drag a few megabytes of someone else’s code into their project in order to parse the settings file or incoming JSON of a known format.
  • An approach with injecting the implementation of platform-specific operations in the form (class LinuxSystem: public ISystem {...}) instead of the traditional (#ifdef _WIN32). It’s more convenient to wrap, for example, semaphores, work with dynamically connected libraries, services - in classes there is only code and headers from a specific operating system. If you need another operating system, you inject another implementation (class WindowsSystem: public ISystem {...}).
  • The stand is going to CMake - it is convenient to open the CMakeLists.txt project in Qt Creator or Microsoft Visual Studio 2017. Working with the project through CmakeLists.txt allows you to automate some preparatory operations - for example, copy test case files and configuration files to the installation directory:

     # Coping assets (TODO any change&rerun CMake to copy):
     FILE(MAKE_DIRECTORY ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/assets)
     FILE(GLOB_RECURSE SpecAssets
         ${CMAKE_CURRENT_SOURCE_DIR}/assets/*.*
         ${CMAKE_CURRENT_SOURCE_DIR}/assets/*
     )
     FOREACH(file ${SpecAssets})
         FILE(RELATIVE_PATH
             ITEM_PATH_REL
             ${CMAKE_CURRENT_SOURCE_DIR}/assets
             ${file}
         )
         GET_FILENAME_COMPONENT(dirname ${ITEM_PATH_REL} DIRECTORY)
         FILE(MAKE_DIRECTORY ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/assets/${dirname})
         FILE(COPY ${file} DESTINATION ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/assets/${dirname})
     ENDFOREACH()

  • For those who are exploring the new features of C ++ 17, this is an example of working with std :: shared_mutex, std :: allocator, static thread_local in the templates (there are nuances - where to allocate?), launching a large number of tests in threads in different ways with measuring the execution time:

    //Prepare insert threads:
    for (i  = cnt_insert_threads;  i;  --i)  {
      std::promise  prom;
      fut_insert_results.emplace_back(prom.get_future());
      threads.emplace_back(std::thread (&TestCase2::insert_in_thread,
          this,  curSize,  std::move(prom),  p_tester));
    } // for insert
    //Prepare find threads:
    for (i  = cnt_find_threads;  i;  --i)  {
      std::packaged_task ta(
            [](TestCase2  *i, int  count,  IAlgorithmTester  *p_tester){
         return i->find_in_thread(count,  p_tester);
      });
      fut_find_results.emplace_back(ta.get_future());
      threads.emplace_back(
        std::thread (std::move(ta),  this,  curSize,  p_tester));
    } // for find
    //Banzai!!!
    auto  start  =  std::chrono::high_resolution_clock::now();
    l_cur_system.get()->signal_semaphore(cnt_find_threads  +  cnt_insert_threads);
    std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));
    auto  end  =  std::chrono::high_resolution_clock::now();

  • Step-by-step instructions on how to compile, configure and run this test bench - WiKi .
    If you don’t yet have step-by-step instructions for a convenient operating system, I will be grateful for the contribution to the repository for implementing ISystem and step-by-step compilation instructions (for WiKi) ... Or just write me - I’ll try to find the time to raise the virtual machine and describe the steps for the assembly.

Also popular now: