How cache affects multithreaded applications

    The theoretical component.

    It so happened, the bottleneck in many computers is the RAM - processor interface. This is due to the significant time it takes to form a request to memory and the fact that the memory frequency is lower than the processor frequency. In the general case, while the data is being received from memory, program execution is stopped. To improve the situation, a specialized high-speed memory is used - the cache.

    If there is a cache, if the processor needs to access a specific byte in the RAM, then the memory area containing this byte (the size of the area depends on the processor) is selected in the cache, and then the processor reads the data from the cache. If the processor writes data to memory, then it first goes to the cache, and then to RAM.

    In multi-core and multiprocessor systems, each core or processor has its own personal cache. If the data in the cache is modified by one processor, the changes must be synchronized with other processors that use the overlapping address space.
    If changes occur frequently and data areas overlap, performance will decline.

    I have a question how much things can be bad. In addition, it was interesting to see how the length of the data affects.


    Experiment.
    For verification, 2 multithreaded C ++ programs were written for two cases of intersection of workspaces:

    • The areas do not overlap and are on the stack of each thread;
    • The workspace for threads is common, it is an array of 1, 4 and 8 byte words, with a total length of 1024 bytes, each of the threads modifies its words (for example, only even or only odd).


    There are 2 working threads, each, changing its words, goes through the entire array and, having reached the border, returns to the beginning and so on. Each of the threads makes 100 million changes.

    UPD Thank you mikhanoid for the typo in the name of the CPU.
    For testing, we used the HP xw4600 workstation, CPU Core 2 Quad (Q9300), OS Linux Slamd64 12.1, kernel 2.6.24.5, and gcc 4.2.3 compiler. The result was a 64-bit program, therefore, any arithmetic operation for words 1.4 and 8 bytes long was performed in 1 clock cycle.

    The execution time was measured with the time (1) command, and averaged over 5 experiments.

    At first, a program was written that simply increments the word and puts it back, but it turned out that in the case of an array on the stack, the -O2 optimization does something indecent. Therefore, a second program was written that does something more complex with the data.

    Results.
    UPD: Thnx for commenting on mt_ and Google Charts for api charts. The result is in graphical form.
    Those who want to see the numbers can scroll further.

    On the Y axis, the real time of program execution in seconds. On the X axis, the word length (in pairs).

    Program 1.

    image

    Program 2.

    image

    Program 2 + optimization -O2.

    image

    conclusions
    We can say for sure that in the case of obviously non-overlapping areas it works faster, and significantly. Optimization somehow affects, but the trend continues.

    But the dependence on the length of the word is somehow not very noticeable.

    The idea of ​​this opus is inspired by the article www.ddj.com/architect/208200273 .

    UPD Trying to bring Perfect Code to my blog, it seems to me that this is the right place.

    The results are in tabular form.
    Results are presented in seconds.

    Program 1 (shared memory).
    Word length 64 32 8
    Real time 1,297 1,532 0.846
    User Time regime 2,461 3,049 1,664


    Program 1 (stack)
    Word length64328
    Real time0.4530.4310.441
    User Time regime0.8510.8420.866


    Program 2 (shared memory).
    Word length64328
    Real time9,1919,0338,921
    User Time regime18,36518,03917,824


    Program 2 (stack)
    Word length64328
    Real time with8,4328,4128,808
    User Time Mode c16,84316,79617,548


    Program 2 + optimization -O2 (shared memory).
    Word length64328
    Real time4,2474,3803,473
    User Time regime8,4158,9606,781


    Program 2 + optimization -O2 (stack)
    Word length64328
    Real time3,2823,7183,308
    User Time regime6,5507,3845,565


    Source.

    Program 1. Program 2.

    #include

    #include
    #include

    #include

    #define TBYTES 1024
    //
    #define TOTAL 2
    // определяет тип слова u_int8_t u_int32_t u_int64_t
    #define MTYPE u_int8_t
    //
    #define MAXAR TBYTES/sizeof(MTYPE)
    #define DOIT 100000000

    MTYPE *array;

    pthread_mutex_t mux = PTHREAD_MUTEX_INITIALIZER;
    pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
    int workers = 2;

    // Определяет используется ли общая память
    //#define SHARED 1

    void*
    runner(void* args){
      unsigned int which = *(unsigned int*)args; delete (unsigned int*)args;
      unsigned int index = which;
      MTYPE array1[MAXAR];
      //
      for ( unsigned int i = 0; i < DOIT; ++i){
        if ( index >= MAXAR)
          index = which;
        //
    #if defined(SHARED)
        array[index] = array[index] + 1;
    #else
        array1[index] = array1[index] + 1;
    #endif
        //
        index += TOTAL;
      }
      //
      pthread_mutex_lock(&mux);
      -- workers;
      pthread_mutex_unlock(&mux);
      pthread_cond_broadcast(&cond);
      //
      return 0;
    };  

    int
    main() {
      //
      array = (MTYPE*)calloc(MAXAR, sizeof(MTYPE));
      //
      unsigned int *which;
      pthread_t t1;
      pthread_attr_t attr;
      //
      pthread_attr_init(&attr);
      pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
      //
      which = new unsigned int(0);
      pthread_create(&t1, &attr, runner, (void*)which);
      //
      which = new unsigned int(1);
      pthread_create(&t1, &attr, runner, (void*)which);
      //
      pthread_mutex_lock(&mux);
      while(!pthread_cond_wait(&cond, &mux)){
        if ( !workers)
          break;
      };
      pthread_mutex_unlock(&mux);
      //
      return 0;
    };
    //  

    * This source code was highlighted with Source Code Highlighter.





    #include

    #include
    #include

    #include

    #define TBYTES 1024
    //
    #define TOTAL 2
    // определяет тип слова u_int8_t u_int32_t u_int64_t
    #define MTYPE u_int64_t
    //
    #define MAXAR TBYTES/sizeof(MTYPE)
    //#define DOIT 100000000 * (sizeof(u_int64_t) / sizeof(MTYPE))
    #define DOIT 100000000

    MTYPE *array;

    pthread_mutex_t mux = PTHREAD_MUTEX_INITIALIZER;
    pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
    int workers = 2;

    template
    RTYPE scramble(u_int16_t *seed, RTYPE a){
      int d;
      //
      for ( d = 0; d < 8; ++d){
        u_int16_t mask;
        //
        mask = (*seed & 0x1) ^ ( (*seed >> 1) &0x1);
        mask = mask & 0x1;
        *seed = (mask << 15) | (*seed >> 1);
        //
        a = a ^ ((RTYPE)mask << d);  
      }
      //
      return a;
    };

    // Определяет используется ли общая память
    //#define SHARED 1

    void*
    runner(void* args){
      unsigned int which = *(unsigned int*)args; delete (unsigned int*)args;
      unsigned int index = which;
      u_int16_t seed = 0x4a80;
      MTYPE array1[MAXAR];
      //
      for ( unsigned int i = 0; i < DOIT; ++i){
        MTYPE data;
        if ( index >= MAXAR)
          index = which;
        //
    #if defined(SHARED)
        data = array[index];
        array[index] = scramble(&seed, data);
    #else
        data = array1[index];
        array1[index] = scramble(&seed, data);
    #endif
        //
        index += TOTAL;
      }
      //
      pthread_mutex_lock(&mux);
      -- workers;
      pthread_mutex_unlock(&mux);
      pthread_cond_broadcast(&cond);
      //
      return 0;
    };  

    int
    main() {
      //
      array = (MTYPE*)calloc(MAXAR, sizeof(MTYPE));
      //
      unsigned int *which;
      pthread_t t1;
      pthread_attr_t attr;
      //
      pthread_attr_init(&attr);
      pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
      //
      which = new unsigned int(0);
      pthread_create(&t1, &attr, runner, (void*)which);
      //
      which = new unsigned int(1);
      pthread_create(&t1, &attr, runner, (void*)which);
      //
      pthread_mutex_lock(&mux);
      while(!pthread_cond_wait(&cond, &mux)){
        if ( !workers)
          break;
      };
      pthread_mutex_unlock(&mux);
      //
      return 0;
    };
    //  

    * This source code was highlighted with Source Code Highlighter.



    Also popular now: