Understanding Shared Bank Conflicts in NVIDIA CUDA

    Shared (shared) memory is a very effective means of optimization due to very fast access (100 times faster than global memory). However, if used improperly, bank conflicts are possible, which significantly slow down the performance. This article will discuss how these conflicts arise and how to avoid them.

    How do shared memory conflicts arise?



    Conflicts arise when 2 or more streams from the same warp (for devices version 2.0) or half of the warp (for device version 1.3 and lower) access bytes that belong to different 32 bit words located in the same memory bank. In case of conflict, access is sequential. The number of threads accessing the bank is called the degree of conflict. If the degree of conflict is N, then access is N times slower than if there was no conflict.

    Broadcast access mechanism


    On devices of version 1.x, a conflict can be avoided if several threads access the same word belonging to the same bank, and only if this request is single - in this case, the broadcast access mechanism is used.

    On devices of version 2.x, there can be several such requests and they will be carried out in parallel (different streams can access different bytes of the word).

    Access features on devices version 2.0


    With 64 bit access, a bank conflict occurs only if 2 or more streams from any of the half of the warp access the addresses belonging to the same bank.

    With 128-bit access, conflicts of banks of the second degree usually arise.

    Access with a capacity greater than 32 is broken into requests with a capacity of 32, 64 and 128 bits.

    How memory is allocated to banks


    The memory is distributed among banks in such a way that each 32 bit word in a sequence is sequentially assigned to one of 32 banks in case of device version 2.0 and 16 banks in case of device version 1.3 and lower. Accordingly, the bank number can be calculated by the following formula:

    Bank number = (Address in bytes / 4)% 32 - for device version 2.0
    Bank number = (Address in bytes / 4)% 16 - for device version 1.x

    Conflicting Memory Access Examples



    For devices version 1.x

    1. 8 and 16 bit access In this example, the first 4 bytes are in the same bank, so the first 4 bytes will conflict during access. The problem is solved by adding redundant data (padding) and changing the access scheme: For 16-bit access : In this example, the first 2 shorts are in the same bank, so the first 2 flows will conflict during access. The problem is solved similarly to 8-bit access: 2. 32-bit access For this type of access, bank conflicts are less obvious, but can occur when, for example such a scheme to stupa: In this case, the 0th and 8th stream are read from 0 and 1 banks, respectively, thus creating a conflict of the 2nd degree.

    __shared__ char shmem8[32];
    char data = shmem8[threadIdx.x];






    __shared__ char shmem8[32*4];
    char data = shmem8[threadIdx.x*4];




    __shared__ short shmem16[32];
    short data = shmem16[threadIdx.x];






    __shared__ short shmem16[32*2];
    short data = shmem16[threadIdx.x*2];






    __shared__ int shmem32[64];
    int data1 = shmem32[threadIdx.x*2];
    int data2 = shmem32[threadIdx.x*2+1];




    This problem can be solved, for example, as follows: For devices of version 2.0 Due to the features of broadcast access, 8 and 16 bit access schemes on these devices do not cause bank conflicts, however, a conflict can arise in the following case: The conflict occurs if s is even. If s is odd, but no conflicts arise.

    __shared__ int shmem32_1[32];
    __shared__ int shmem32_2[32];
    int data1 = shmem32_1[threadIdx.x];
    int data2 = shmem32_2[threadIdx.x];






    __shared__ int shared[64];
    int data = shared[threadIdx.x*s];



    Bank Conflict Tracking



    NVIDIA Banck Checker


    Conflicts can be tracked by using the CUT_BANK_CHECKER (array, index) macro, which is part of the CUDA Utility Toolkit. To do this, you must use this macro to access memory and run the application in emulation mode. When the application is completed, a conflict report will be printed.

    For example, like this:

    __shared__ int shared[64];
    int data = CUT_BANK_CHECKER(shared, threadIdx.x*s);


    CUDA Profiler


    You can also use a profiler to track conflicts. This information is displayed in the warp serialize section . This counter shows the number of warps that need to serialize their access when addressing constant or shared memory, in other words, this counter shows bank conflicts.

    Conclusion



    In conclusion, I note that the most effective method for resolving bank conflicts is to develop access schemes that minimize their occurrence and subsequent analysis of the application by the profiler (which is never superfluous).

    Also popular now: