Help compiler in vectorization? - It's better not to interfere.

Published on October 25, 2011

Help compiler in vectorization? - It's better not to interfere.

    This is a free translation of my recent post on the English version of Intel Software Network. So thosewho likes Victoria Zhislina more than vikky13Those who have already seen this post can immediately read the first and last paragraphs that are not in the original.

    - Hello everyone, I need a translator from Russian to C ++ program code. Well, that is, I am writing a task, and the translator implements its solution in C ++. Where can I find one? If there is no for C, maybe there are for other languages?

    - Yes, it’s called the head of the development department. If you write a task in Russian, you give it to subordinates and that’s it, the code is ready! Even in C, even in Delphi, even in Java. I checked it works!


    They say that this is not a joke, but a real question on the programming forum. They also say that a person is much smarter than a machine, which means that he can help her - share her mind. But there are many cases when this is definitely not worth doing. The result will be the opposite of what was expected.

    Here is a good example from the well-known open source OpenCV library :
    
    cvtScale_( const Mat& srcmat, Mat& dstmat, double _scale, double _shift )
    {
        Op op;
        typedef typename Op::type1 WT;
        typedef typename Op::rtype DT;
        Size size = getContinuousSize( srcmat, dstmat, srcmat.channels() );
        WT scale = saturate_cast<WT>(_scale), shift = saturate_cast<WT>(_shift);
        for( int y = 0; y < size.height; y++ )
        {
            const T* src = (const T*)(srcmat.data + srcmat.step*y);
            DT* dst = (DT*)(dstmat.data + dstmat.step*y);
            int x = 0;
            for(; x <= size.width - 4; x += 4 )
            {
                DT t0, t1;
                t0 = op(src[x]*scale + shift);
                t1 = op(src[x+1]*scale + shift);
                dst[x] = t0; dst[x+1] = t1;
                t0 = op(src[x+2]*scale + shift);
                t1 = op(src[x+3]*scale + shift);
                dst[x+2] = t0; dst[x+3] = t1;
            }
            for( ; x &lt; size.width; x++ )
                dst[x] = op(src[x]*scale + shift);
          }
    }

    This is a simple template function working with char, short, float and double.
    Its authors decided to help the compiler with SSE-vectorization by deploying an inner loop of 4 and processing the remaining data tail separately.
    Do you think modern compilers (for Windows) will generate optimized code in accordance with the intention of the authors?
    Let's check by compiling this code using Intel Compiler 12.0, with the / QxSSE2 switch (it is verified that using other SSEx and AVX options will give the same result)

    And the result will be quite unexpected. Assembler listing at the output of the compiler conclusively shows that the deployed loop is NOT vectorized. The compiler generates SSE instructions, but only scalar, not vector. But the rest of the data - the “tail”, containing only 1-3 data elements in an undeployed cycle, is vectorized according to the full program!

    If we remove the loop unwrapping:
    
    for( int y = 0; y < size.height; y++ )
        {
            const T* src = (const T*)(srcmat.data + srcmat.step*y);
            DT* dst = (DT*)(dstmat.data + dstmat.step*y);
            int x = 0;
            for( ; x < size.width; x++ )
                dst[x] = op(src[x]*scale + shift);
        }

    ... and look at the assembler again (I won’t scare you with it), we will find that now the cycle is fully vectorized for all data types, which undoubtedly increases productivity.

    Conclusion : More work - less productivity. Less work, more. That would always be so.

    Note that Microsoft Compiler, Visual Studio 2010 and 2008 with the / arch: SSE2 switch do NOT vectorize the above code in either expanded or collapsed form. The code they produced is very similar in both appearance and performance in both cases. That is, if the deployment of the loop for the Intel compiler is harmful, then for the Microsoft it is simply useless :).

    But what if you still want to preserve the deployment of the loop - is it expensive for you as memory , but also want vectorization?

    Then use Intel compiler pragmas as shown below:
    #pragma simd
    
            for(x=0; x <= size.width - 4; x += 4 )
            {
                DT t0, t1;
                t0 = op(src[x]*scale + shift);
                t1 = op(src[x+1]*scale + shift);
                dst[x] = t0; dst[x+1] = t1;
                t0 = op(src[x+2]*scale + shift);
                t1 = op(src[x+3]*scale + shift);
                dst[x+2] = t0; dst[x+3] = t1;
            } 

    #pragma novector
    
    for( ; x <size.width; x++ )
         dst[x] = op(src[x]*scale + shift);
    }


    And the last one. Cycle unrolling itself can have a positive effect on performance. But, firstly, the possible gain from vectorization will still exceed this positive effect, and, secondly, the deployment can be entrusted to the compiler, then vectorization will not suffer from this. Among other things, I plan to touch on this topic at a webinar on October 27.