# Vectorization Advisor, another example - breaking a fractal

We recently wrote about the new Vectorization Advisor. Read about what it is and why you need it in the first article . The same post is devoted to the analysis of a concrete example of application optimization using this tool.

The app is taken from Intel Threading Building Blocks (Intel TBB) library examples. It draws a Mandelbrot fractal and is parallelized in streams using Intel TBB. Those. it uses the advantages of a multi-core processor - let's see how things are with vector instructions.

Test system:

• Notebook with Intel Core i5-4300U (Haswell, 2 cores, 4 hardware threads)

• IDE: Microsoft Visual Studio 2013

• Compiler: Intel Composer 2015 update 2

• Intel Advisor XE 2016 Beta

• Application: Mandelbrot fractal, slightly modified . See <tbb_install_dir> \ examples \ task_priority

So, the first step: we measure the basic version, the operating time: 4.51 seconds. Then we start the Survey analysis in Intel Advisor XE: the

“Loop Type” column indicates that all cycles are scalar, i.e. Do not use SIMD instructions. The most expensive cycle in the fractal.cpp file is on line 74. It spends more than 13 seconds of CPU time - we will do it. The column “Why No Vectorization” contains a message stating that the compiler could not estimate the number of iterations of the loop, and therefore did not vectorize it. Double click on the highlighted loop, look at the code:

It becomes clear why the number of iterations is not known at compile time. Before diving into the details of the algorithm and trying to rewrite while, let's try to go in a simpler way. Let's see where our loop is called from - the Top Down tab:

The next loop in the call stack is on line 164. This is a normal-looking for, which the compiler did not vectorize either (Scalar in the Loop Type column), probably not considering it effective. The diagnostic message suggests trying to force vectorization, for example, using the SIMD directive:

Follow the advice:

We rebuild the program and run Survey again. With

SIMD cycles, as you know, can be divided into peel, body and remainder. Body is, in fact, a vectorized part, where several iterations are performed at once in one instruction. Peel appears if the data is not aligned to the length of the vector, remainder - multiple iterations remaining at the end.

We only have Remainder. This means that the loop has been vectorized, but the body has not been executed. The bookmark with recommendations indicates the inefficiency of this situation - it is necessary that more iterations be performed in the body, since the remainder here is a regular sequential loop. Let's check how many iterations are there using Trip Counts analysis:

There are only 8 iterations in our loop, which is quite small. If there were more, a vectorized body would have something to execute. The line number has changed after the code modifications, now it is line 179. Let's see how the number of iterations is determined:

The loop we are interested in is called from the parallel loop of the Intel Threading Building Blocks (Intel TBB) library. Iterations of the outer loop are distributed between different threads, each of which receives an object of type

We find in the code this very grain size (there are two of them for two nested parallel loops):

We are trying to increase inner_grain_size to 32. Nothing bad is expected from this, just dividing the work by Intel TBB streams will be coarser. We look at the result:

Now a vectorized body appears in the cycle, we finally achieved the real use of SIMD instructions, the cycle is vectorized. But it's too early to rejoice - productivity has not grown in any way.

We look at the recommendations of Advisor XE - one of them talks about calling a serialized (sequential) function. The fact is that if a vector cycle calls a function, then it needs a vectorized version that can take parameters by vectors, rather than ordinary scalar variables. If the compiler could not generate such a vector function, then the usual, sequential version is used. And it is also called sequentially, nullifying the entire vectorization.

Again, look at the loop code:

Fortunately, there is only one function call:

Analysis of the Memory Access Patterns of our cycle (together with the function being called) tells us that all read or write operations are unit stride with step 0. That is, when accessing external data from each iteration, the same variable is read or written:

This knowledge is useful to us for manual vectorization of a function. We will use the directive

By adding a directive on the definition of the function, we enabled the compiler to generate its vector version, which can process arrays of data instead of the declared scalar ones. We get a noticeable increase in speed - the cycle runs 7.5 seconds instead of 12:

Let's try to deal with other reasons for the inefficiency of the vector cycle. The presence of the peel part indicates that the data is not aligned. Add

After that, peel disappears:

In the Advisor XE diagnostics table, one more thing can be noted - the “Transformations” column says that the compiler unrolled with a factor of 2:

That is, in order to optimize, the number of iterations is halved. This in itself is not bad, but in the third step we specifically tried to increase them, and it turns out that the compiler does the opposite. Let's try disabling the sweep using

After that, the number of iterations expectedly doubled:

Manipulations with unrolling allowed to increase the number of iterations, but there was no improvement in performance. Let's see what happens if

It turns out to squeeze a little more, albeit quite a bit:

After all the manipulations, the running time of the test application was reduced from 4.51 to 1.92 seconds, i.e. we achieved acceleration of about 2.3 times. Let me remind you that our computational test has already been partially optimized - stream-parallel to Intel TBB, achieving good acceleration and scalability on multi-core processors. But due to the fact that vectorization was not fully utilized, we still lost half of the possible performance. The first conclusion is that

Now let's see the effect of various changes:

Forcing vectorization and increasing the number of iterations (steps 2 and 3) by themselves did not speed up the program. The most important increase in speed we received in step 4 after the vectorization of the function. However, this is the cumulative effect of steps 2-4. For a real vectorization of the loop, it was necessary to increase the number of iterations, and force the vectorization, and vectorize the function.

And the subsequent steps did not have a special effect, in our case they could be completely skipped. Step 7 can be more or less attributed to successful ones, and this is due to the fact that in step 3 we did not immediately set a larger number of iterations. However, the purpose of the post was to show the capabilities of Advisor XE with a specific example, therefore, all the steps taken are described.

The app is taken from Intel Threading Building Blocks (Intel TBB) library examples. It draws a Mandelbrot fractal and is parallelized in streams using Intel TBB. Those. it uses the advantages of a multi-core processor - let's see how things are with vector instructions.

Test system:

• Notebook with Intel Core i5-4300U (Haswell, 2 cores, 4 hardware threads)

• IDE: Microsoft Visual Studio 2013

• Compiler: Intel Composer 2015 update 2

• Intel Advisor XE 2016 Beta

• Application: Mandelbrot fractal, slightly modified . See <tbb_install_dir> \ examples \ task_priority

# 1. Assess the situation

So, the first step: we measure the basic version, the operating time: 4.51 seconds. Then we start the Survey analysis in Intel Advisor XE: the

“Loop Type” column indicates that all cycles are scalar, i.e. Do not use SIMD instructions. The most expensive cycle in the fractal.cpp file is on line 74. It spends more than 13 seconds of CPU time - we will do it. The column “Why No Vectorization” contains a message stating that the compiler could not estimate the number of iterations of the loop, and therefore did not vectorize it. Double click on the highlighted loop, look at the code:

```
while ( ((x*x + y*y) <= 4) && (iter < max_iterations) ) {
xtemp = x*x - y*y + fx0;
y = 2 * x*y + fy0;
x = xtemp;
iter++;
}
```

It becomes clear why the number of iterations is not known at compile time. Before diving into the details of the algorithm and trying to rewrite while, let's try to go in a simpler way. Let's see where our loop is called from - the Top Down tab:

# 2. We force a vectorization

The next loop in the call stack is on line 164. This is a normal-looking for, which the compiler did not vectorize either (Scalar in the Loop Type column), probably not considering it effective. The diagnostic message suggests trying to force vectorization, for example, using the SIMD directive:

Follow the advice:

```
#pragma simd // forced vectorization
for (int y = y0; y < y1; ++y) {
fractal_data_array[x - x0][y - y0] = calc_one_pixel(x, y, tmp_max_iterations, tmp_size_x, tmp_size_y, tmp_magn, tmp_cx, tmp_cy, 255);
}
```

We rebuild the program and run Survey again. With

*#pragma simd, the*cycle became not “Scalar”, but “Remainder”:SIMD cycles, as you know, can be divided into peel, body and remainder. Body is, in fact, a vectorized part, where several iterations are performed at once in one instruction. Peel appears if the data is not aligned to the length of the vector, remainder - multiple iterations remaining at the end.

We only have Remainder. This means that the loop has been vectorized, but the body has not been executed. The bookmark with recommendations indicates the inefficiency of this situation - it is necessary that more iterations be performed in the body, since the remainder here is a regular sequential loop. Let's check how many iterations are there using Trip Counts analysis:

# 3. Increase the number of iterations

There are only 8 iterations in our loop, which is quite small. If there were more, a vectorized body would have something to execute. The line number has changed after the code modifications, now it is line 179. Let's see how the number of iterations is determined:

```
tbb::parallel_for(tbb::blocked_range2d<int>(y0, y1, inner_grain_size, x0, x1, inner_grain_size),
[&] (tbb::blocked_range2d<int> &r){
int x0 = r.cols().begin(),
y0 = r.rows().begin(),
x1 = r.cols().end(),
y1 = r.rows().end();
...
for (int x = x0; x < x1; ++x) {
for (int y = y0; y < y1; ++y) { // цикл на строке 179
fractal_data_array[x - x0][y - y0] = calc_one_pixel(x, y, tmp_max_iterations, tmp_size_x, tmp_size_y, tmp_magn, tmp_cx, tmp_cy, 255);
}
}
...
```

The loop we are interested in is called from the parallel loop of the Intel Threading Building Blocks (Intel TBB) library. Iterations of the outer loop are distributed between different threads, each of which receives an object of type

*tbb :: blocked_range2d*- its*own*local iteration space. How small the number of iterations in this space can be depends on the parameter,*inner_grain_size*. Those. the value of*r.rows (). end ()*, which determines the number of iterations of the loop on line 179, depends on*inner_grain_size*.We find in the code this very grain size (there are two of them for two nested parallel loops):

```
int grain_size = 64;
int inner_grain_size = 8;
```

We are trying to increase inner_grain_size to 32. Nothing bad is expected from this, just dividing the work by Intel TBB streams will be coarser. We look at the result:

Now a vectorized body appears in the cycle, we finally achieved the real use of SIMD instructions, the cycle is vectorized. But it's too early to rejoice - productivity has not grown in any way.

# 4. Vectorize the function

We look at the recommendations of Advisor XE - one of them talks about calling a serialized (sequential) function. The fact is that if a vector cycle calls a function, then it needs a vectorized version that can take parameters by vectors, rather than ordinary scalar variables. If the compiler could not generate such a vector function, then the usual, sequential version is used. And it is also called sequentially, nullifying the entire vectorization.

Again, look at the loop code:

```
for (int y = y0; y < y1; ++y) { // цикл на строке 179
fractal_data_array[x - x0][y - y0] = calc_one_pixel(x, y, tmp_max_iterations, tmp_size_x, tmp_size_y, tmp_magn, tmp_cx, tmp_cy, 255);
}
```

Fortunately, there is only one function call:

*calc_one_pixel*. Since the compiler could not generate a vector version, we will try to help it. But first, let's look at the memory access patterns, this is useful for explicit vectorization:Analysis of the Memory Access Patterns of our cycle (together with the function being called) tells us that all read or write operations are unit stride with step 0. That is, when accessing external data from each iteration, the same variable is read or written:

This knowledge is useful to us for manual vectorization of a function. We will use the directive

*#pragma omp simd*from the OpenMP 4 standard. It has options defining a template for accessing parameters. For example, “linear” is used for monotonically increasing quantities, mainly iterative variables. Uniform is suitable for us - for access to the same data.```
#pragma omp declare simd uniform(x0, max_iterations, size_x, size_y, magn, cx, cy, gpu)
color_t fractal::calc_one_pixel(float x0, float y0, int max_iterations, int size_x, int size_y, float magn, float cx,
```

By adding a directive on the definition of the function, we enabled the compiler to generate its vector version, which can process arrays of data instead of the declared scalar ones. We get a noticeable increase in speed - the cycle runs 7.5 seconds instead of 12:

# 5. Align the data

Let's try to deal with other reasons for the inefficiency of the vector cycle. The presence of the peel part indicates that the data is not aligned. Add

*__declspec (align (32))*before defining the array to which*calc_one_pixel*results are*written*:```
__declspec(align(32)) color_t fractal_data_array[delta_x][(delta_y / 16 + 1) * 16]; // aligned data
for (int x = x0; x < x1; ++x) {
#pragma simd
for (int y = y0; y < y1; ++y) {
fractal_data_array[x - x0][y - y0] = calc_one_pixel(x, y, tmp_max_iterations, tmp_size_x, tmp_size_y, tmp_magn, tmp_cx, tmp_cy, 255);
}
}
```

After that, peel disappears:

# 6. Remove the scan loop (unrolling)

In the Advisor XE diagnostics table, one more thing can be noted - the “Transformations” column says that the compiler unrolled with a factor of 2:

That is, in order to optimize, the number of iterations is halved. This in itself is not bad, but in the third step we specifically tried to increase them, and it turns out that the compiler does the opposite. Let's try disabling the sweep using

*#pragma nounroll*:```
#pragma simd
#pragma nounroll // added unrolling
for (int y = y0; y < y1; ++y) {
fractal_data_array[x - x0][y - y0] = calc_one_pixel(x, y, tmp_max_iterations, tmp_size_x, tmp_size_y, tmp_magn, tmp_cx, tmp_cy, 255);
}
```

After that, the number of iterations expectedly doubled:

# 7. Increase the number of iterations even more

Manipulations with unrolling allowed to increase the number of iterations, but there was no improvement in performance. Let's see what happens if

*grain_size is*increased even more , as in step 3. We just empirically test the optimal value:```
int grain_size = 256; // increase from 64
int inner_grain_size = 64; // increase from 8
```

It turns out to squeeze a little more, albeit quite a bit:

# 8. Results

After all the manipulations, the running time of the test application was reduced from 4.51 to 1.92 seconds, i.e. we achieved acceleration of about 2.3 times. Let me remind you that our computational test has already been partially optimized - stream-parallel to Intel TBB, achieving good acceleration and scalability on multi-core processors. But due to the fact that vectorization was not fully utilized, we still lost half of the possible performance. The first conclusion is that

**vectorization can be of great importance**; do not neglect it.Now let's see the effect of various changes:

Forcing vectorization and increasing the number of iterations (steps 2 and 3) by themselves did not speed up the program. The most important increase in speed we received in step 4 after the vectorization of the function. However, this is the cumulative effect of steps 2-4. For a real vectorization of the loop, it was necessary to increase the number of iterations, and force the vectorization, and vectorize the function.

And the subsequent steps did not have a special effect, in our case they could be completely skipped. Step 7 can be more or less attributed to successful ones, and this is due to the fact that in step 3 we did not immediately set a larger number of iterations. However, the purpose of the post was to show the capabilities of Advisor XE with a specific example, therefore, all the steps taken are described.