New x86 SIMD intrinsic library - immintrin debug

    With each new generation of Intel processors, new and increasingly complex vector instructions are emerging. Although the length of the vector (512 bits) will not grow in the near future, new types of data and types of instructions will appear. For example, who can understand at a glance what such an intrinsic (and the corresponding processor instruction) does?

    Bitwise ternary logic that provides the capability to implement any three-operand binary function; the specific binary function is specified by value in imm8.

    __m512i _mm512_mask_ternarylogic_epi32 (__m512i src, __mmask8 k, __m512i a, __m512i b, int imm8)
    FOR j := 0 to 15
        i := j*32
        IF k[j]
            FOR h := 0 to 31
                index[2:0] := (src[i+h] << 2) OR (a[i+h] << 1) OR b[i+h]
                dst[i+h]   := imm8[index[2:0]]
            ENDFOR
        ELSE
            dst[i+31:i] := src[i+31:i]
        FI
    ENDFOR
    dst[MAX:512] := 0
    

    OK, let's say we figured out how it works. The next level of complexity is debugging code that intensively uses such intrinsics.

    Those who regularly use intrinsics know such a very useful site - Intel intrinsics guide . If you carefully look at how it works, it is easy to notice that the javascript front-end downloads the data-3.xxxml file, which describes in detail all intrinsics, with code similar to Matlab. (For example, the one I copied in the post title.)

    But when we use intrinsics to speed up the code, we write not in Matlab, but in C and C ++! Three months ago, one client asked me if there is an implementation of vector intrinsics in C for debugging, and I decided to write a parser that translates the code from the Intrinsics Guide to C. It turns out a library that implements almost all intrinsics so that you can go inside using a step-by-step debugger ( or add debug printf).

    For example, an operation from a post title turns into

    for (int j = 0; j <= 15; j++) {
      if (k & (1 << j)) {
        for (int h = 0; h <= 31; h++) {
          int index =  ((((src_vec[j] & (1 << h)) >> h) << 2) |
                       (((a_vec[j] & (1 << h)) >> h) << 1) |
                       ((b_vec[j] & (1 << h)) >> h)) & 0x7;
          dst_vec[j] = (dst_vec[j] & ~(1 << h)) |
                       ((((imm8 & (1 << index)) >> index)) << h);
        }
      } else {
        dst_vec[j] = src_vec[j];
      }
    }
    

    True, this is much more understandable? Not really? Well, I just chose a complex function as an example. Usually, when you debug code with intrinsics, (for example, DSP) you have to keep in mind both the algorithm and the features of each instruction. Considering that the instructions work with long vectors, and DSP algorithms are often based on serious mathematics, my head does not cope - there is not enough short-term memory and concentration. I suspect that I am not alone - several times I even thought that I had found a bug in the instructions. Then, of course, each time it turned out that I was wrong, and it did not work to open a new FDIV bug. But if I could, in those cases, step by step debug inside the instructions, I would immediately understand under what conditions a value appears in the component of my vector that I did not expect.

    Customers told me that they use this library to debug individual functions with AVX-512 intrinsics on a laptop that only supports AVX2. Of course, Intel SDE is much better suited for this - because it extremely accurately imitates all instruction sets. I have a set of unit tests (also automatically generated) that for each intrinsic from the library compare the result of its work with the result of the execution of the corresponding assembler instruction. As befits unit tests, most work as expected. But some debugging intrinsics with a floating point (both double and single precision) do not always work 100% correctly. I would say that sometimes it’s kind of -ffast-math. And there are different rounding mechanisms! There are many subtleties in IEE754 ...

    There is another important feature of using immintrin debug instead of SDE (which I do not approve of in any way, but I can’t stop it). If you compile gcc or clang with the option, for example, -march = nehalem , then gcc and clang return 512-bit vectors from the functions on the stack from the functions, and ICC still returns them to ZMM0. So the Intel compiler cannot be used in this mode. And gcc has a useful option -Og , which helps with debugging, including with immintrin debug.

    There are several intrinsics whose main action is to change the contents of the register, for example, or flags. I did not implement such instructions. Well, while my parser is not quite ready, the implementation of about 10% of intrinsics is not yet available.

    Using immintrin debug is very simple - you do not need to change the source, but you have to add conditional compilation to include immintrin_dbg.h instead of immintrin.h in case of debugging build.

    You can download it on github .

    Also popular now: