The developer at the crossroads: how to vectorize ?!


    A lot of interesting things have been written on the topic of vectorization. Let’s say, an excellent post , which explains a lot of useful things about the work of auto-vectorization, would highly recommend reading it. I am interested in another question. Now developers have a large number of ways to create "vector" code - from pure assembler to the same auto-vectorizer. What way to stop? How to find a balance between necessary and sufficient? We’ll talk about this.

    So, you can get the "cherished" vector instructions in several ways. We represent schematically in the form of the following table:



    If we are experienced gurus and can afford to write in pure assembler, then perhaps this way will give us 100% confidence in using the maximum in our code. Still, we will immediately write on the necessary instructions and use all the capabilities of the processor. That's just "sharpened" it will be for a specific set of instructions, and therefore, for a specific "hardware". The release of the next new instructions (and progress does not stand still) will require global processing and new labor costs. Obviously, you should consider something easier to use. And on the next step, intrinsic functions appear.

    This is no longer pure assembler, however, it will still take a lot of time to rewrite the code. Let's say a simple loop in which two arrays are added will look like this:

    #include 
    double A[100], B[100], C[100];
    for (int i = 0; i < 100; i += 4) {
              __m256d a = _mm256_load_pd(&A[i]);
              __m256d b = _mm256_load_pd(&B[i]);
              __m256d c = _mm256_add_pd(a, b);
              _mm256_store_pd(&C[i], c);
    }
    

    In this case, we use the AVX intrinsic functions. Thus, we guarantee the generation of appropriate AVX instructions, that is, again tied to a specific hardware. Labor costs have decreased, but we will not be able to use this code in the future - sooner or later it will have to be rewritten again. And this will always be the case, as long as we explicitly choose instructions, “prescribing” them in the source code. Be it pure assembler, intrinsic functions or SIMD intrinsic classes. Also, by the way, an interesting thing, representing the next level of abstraction.

    The same example is rewritten as follows:

    #include 
    // 4 elements per vector * 25 = 100 elements
    F64vec4 A[25], B[25], C[25];
    for(int i = 0; i < 25; i++)
      C[i] = A[i] + B[i];
    

    In this case, we no longer need to know which functions to use. The code itself looks quite elegant, and it is enough for the developer to create data of the desired class. In this example, F64 means a float type of size 64 bits, and vec4 talks about using Intel AVX (vec2 for SSE).

    I think everyone understands why this method cannot be called the best in terms of price / quality ratio. That's right, portability is still imperfect. Therefore, a reasonable solution is to use a compiler to solve such problems. With it, rebuilding our code, we can create binaries for the architecture we need, whatever it may be, and use the latest sets of instructions. In doing so, we need to make sure that the compiler is able to vectorize the code.

    As we walked down the tablet, discussing the “complex” ways of vectorizing code. Let's talk about simpler ways.
    Obviously, the easiest is to shift all responsibility to the compiler and enjoy life. But not so simple. No matter how smart the compiler is, there are still many cases where it is powerless to do anything with a loop without additional data or hints. In addition, in some cases, code that has been successfully vectorized with one version of the compiler is no longer vectorized with another. It's all about cunning compiler heuristics, so you can’t rely on 100% auto-vectorization, although the thing is certainly useful. For example, a modern compiler can vectorize such code:

    double A[1000], B[1000], C[1000], D[1000], E[1000];
    for (int i = 0; i < 1000; i++)
      E[i] = (A[i] < B[i]) ? C[i] : D[i];
    

    If we tried to create an analog code on intrinsic functions, guaranteeing vectorization, we would get something like this:
    double A[1000], B[1000], C[1000], D[1000], E[1000];
    for (int i = 0; i < 1000; i += 2) {
      __m128d a = _mm_load_pd(&A[i]);
      __m128d b = _mm_load_pd(&B[i]);
      __m128d c = _mm_load_pd(&C[i]);
      __m128d d = _mm_load_pd(&D[i]);
      __m128d e;
      __m128d mask = _mm_cmplt_pd(a, b);
      e = _mm_or_pd(
              _mm_and_pd (mask, c),
              _mm_andnot_pd(mask, d));
      _mm_store_pd(&E[i], e);
    }
    

    It’s good when the compiler can do this for us! It’s a pity that it’s not always ... and in those cases when the compiler fails, the developer can help him himself. To do this, you can use special "tricks" in the form of directives. For example, #pragma ivdep will tell you that there are no dependencies in the loop, and #pragma vector always will allow you to ignore the “efficiency policy” of vectorization (often if the compiler believes that vectorizing the cycle is inefficient, say, because of memory access, it does not do this). But these directives from the category of "maybe help." If the compiler is sure that there are dependencies, then it will not vectorize the loop, even if there is pragma ivdep.

    Therefore, I highlighted another way, which is based on directives, but several other principles of work. These are directives from the new OpenMP 4.0 standard and Inte Cilk Plus #pragma omp simd and #pragma simd respectively. They allow you to completely “forget” the compiler about your own checks, and rely entirely on what the developer says. Responsibility, in this case, naturally rests on his shoulders and head, so you need to act carefully. Hence the need for another method.

    How to make sure that the checks still remain, but the code is guaranteed to be vectorized? Unfortunately, with the syntax existing in C / C ++, there is no way. But using the capabilities of the special syntax for working with arrays (array notation), which is part of Cilk Plus, (see the previous post to understand how much everything is there), this is possible. Moreover, the syntax is very simple, somewhat reminiscent of Fortran, and has the following form:

    base[first:length:stride]
    

    Set the name, start index, number of elements, step (optional) and forward. The previous example will rewrite with it like this:

    double A[1000], B[1000], C[1000], D[1000], E[1000];
    E[:] = (A[:] < B[:]) ? C[:] : D[:];
    

    The colon means that we are referring to all the elements. You can also perform more complex manipulations. Let's say this code

    for (i = 0; i < 5; i++)
      A[(i*2) + 1] = B[(i*1) + 1];
    

    will correspond more compactly, and most importantly, it guarantees vectorization:

    int A[10], *B;
    A[1:5:2] = B[1:5];
    

    Thus, we see that there are really many ways to achieve vectorization. If we talk about the balance, since we have listed all the ways in the tablet “from simple to complex”, then from the point of view of the required costs and the output, the middle ground converges on Cilk Plus. But this does not mean that everything is so obvious. If the patient is sick, they are not always given antibiotics right away, right? So it is here. For some, autovectrization may be sufficient; for some, the ivdep and vector always directives will be a perfectly reasonable solution. It’s more important to involve the compiler so that you don’t have a headache when new instructions and hardware are released, but here Intel always has something to offer. So up to new posts, friends!

    Also popular now: