Bubbles, caches, and branch predictors
This article was written based on a curious post , a short comment by its author to which I was able to understand what is happening in more detail. It is proposed to compare two variations of the bubble sorting algorithm. The first of them is a regular bubble, with a little optimization - the inner loop can be finished a little earlier, knowing that the rest of the array is already sorted:
In the second option, the inner loop passes through the other part of the array, but algorithmically this option is equivalent to the first (details below):
Run ( code ), for example, for N = 100,000 on an array of ints, and we get about 30 seconds in the first case, and less than 10 seconds in the second, that is, a difference of 3 times ! So where does this difference come from?
First, we check that it is not a matter of compiler optimizations. This can be easily verified by generating and comparing the assembler codes in both cases. Or you can turn off optimization. Or you can rewrite the code in assembler (for example, using assembler inserts). In any case, the effect is preserved.
Next, how to verify that both options are equivalent? For example, you can display the compared values a [j], a [j + 1] and the corresponding value j, sort the resulting list, and then compare them line by line (of course, it is reasonable to do this for small N). It turns out that the lists are the same, which means that in both cases the same comparisons and the same assignments are performed, and the difference is only in the order in which these operations are performed.
Yeah, some will say, but we know other examples where, depending on the order of the operations, the time of their execution depends. For example, there is a difference between sequential and random access to memory, even if we read the same data. In this case, the fact is that accessing the memory is a rather long process, and the processor is almost idle at a time when it could execute hundreds of instructions. However, with sequential access, these delays are amortized due to processor caches, where a large amount of data is copied immediately, and access to which is much faster.
But note that in our case the data volume is not so big - 100'000 ints are only 400Kb, while modern processors have L2 caches from megabytes or more. In addition, in both cases, we sequentially access the elements of the array, so memory access delays are again amortized. Finally, if we decrease N, then the same difference is observed 3 times. Therefore, although caches play a role here, they are not the main reason for the observed effect.
To understand what is happening and check the previous statement, we analyze both options using processor counters. These counters count the number of instructions executed, cache accesses, the number of branching events, and many other parameters. To take the testimony, I used the Intel (trial version - yes, don’t wait for a freebie, you don’t google it!) VTune Performance Analyzer, but there are other ways. We start, analyze, and get the output (P4 Prescott, L1 16Kb, L2 2Mb, the first option vs. the second):
Completed instructions: 40 vs. 35 (· 10 9 )
L2 cache misses: 1.1 vs. 1.8 (· 10 9 )
L1 cache misses: 1.6 vs. 0.4 (· 10 6 )
So what follows from here? Firstly, there are very few misses in the first level cache per instruction, and therefore the corresponding delays are not so significant. Secondly, there are still quite a lot of misses in the second level cache, but it turns out that in the second variant the misses happen even more often than in the first! The results on other processors may be slightly different (in particular, there may also be an L3 cache that is not on my processor), but I think that the overall picture will remain approximately the same.
So it's not about caches. But look at the additional analysis results:
Number of transitions (branches): 1.0 vs. 1.0 (· 10 10 )
Errors of the predictor of transitions (mispredicted branches): 0.16 vs. 0.00009 (10 10 )
A small digression for those who, like I was, until recently, not familiar with such an interesting thing as a predictor ( branch Predictor ) - oddly enough, I did not find topics about him Habré.
You probably know that today's processors at the expense of the conveyor instructions ( instruction pipeline) execute not one instruction after another sequentially, but several instructions in parallel, and then combine the results. However, the sequence of instructions may depend on the results of conditional transitions, as in our example, if a [j]> a [j + 1], then the elements are rearranged, otherwise not. Here the transition predictor comes to the rescue, who tries to guess whether the transition will be completed or not, and in accordance with this, instructions for the pipeline are selected.
Transition predictor can be static and dynamic. Dynamic tries to predict a transition based on the history of previous transitions. If this story has not yet been typed, a static predictor is used, but it is not important in our case. On the topic of static and dynamic predictors, I liked this article (English), although in reality everything, of course, is more complicated.
How important is the transition predictor and its errors? Wikipedia reports that on modern processors, the delay is tens of clock cycles, which isnot so much in itself a lot ( leotsarev) However, in addition, the absence of such errors can mean a very good benefit, since due to the large length of the pipeline, the processor can “peek” at a lot of instructions ahead. You can verify this with the following code:
Here p [] is an array that determines whether to conditionally jump or not, and the counter simply serves to distinguish between these two events. If all p [j] values are the same, then after several iterations the transition is already well predictable. If p [j] is generated in some, for example, periodic manner, then the predictability of the transition will depend on the implementation of the predictor. But if the array is randomly filled, it is impossible to predict the next transition, under certain restrictions. It should be noted that the size of the array M is important - if the array is too large, then it can be badly cached, and if it is too small, then the transition can be predicted.
On my computer, the execution time of this code varies 4-6 times depending on the degree of randomness of filling the array. If you perform a more complex operation than increasing the counter, for example, rearranging the elements of an array, the difference decreases, but not by much. Thus, the presence or absence of errors of the transition predictor can lead to several times difference in execution time, which is observed in our original problem.
According to the above analysis results, in the first version of the sorting, the predictor errors occur in 16% of cases, and in the second - 1000 times less. Why is this so? This can be understood by considering how sorting occurs in both cases.
In the first case, for small i, the inner cycle along j runs almost to the end of the array, without touching only the sorted “tail”. Initially, the data in this part of the array is disordered, and therefore the condition a [j]> a [j + 1] is fulfilled almost by accident. As i increases, there is some ordering of the elements due to permutations, but still a large fraction of randomness remains ( animation ). Therefore, it is rather difficult to predict the transition, which leads to a large number of predictor errors.
In the second case, for i = 0 the inner loop only sorts a [0] and a [1], for i = 1 it adds a [2], and so on. In fact, this is sorting by inserts (but not binary) - at the i-th step, the element a [i + 1] is inserted into the already sorted subarray a [0..i] ( animation) Since the element is inserted from the end of the subarray, in most cases this element will be sequentially moved to the required position in the array, and the value of the condition p [j]> p [j + 1] will be the same before it reaches it. Thus, after several iterations, the transition is easy to predict, which the transition predictor seems extremely happy about.
for (i=0; i
for (j=0; j
if (a[j] > a[j+1])
swap(a[j], a[j+1]);
for (i=0; i
for (j=i; j>=0; j--)
if (a[j] > a[j+1])
swap(a[j], a[j+1]);
First, we check that it is not a matter of compiler optimizations. This can be easily verified by generating and comparing the assembler codes in both cases. Or you can turn off optimization. Or you can rewrite the code in assembler (for example, using assembler inserts). In any case, the effect is preserved.
Next, how to verify that both options are equivalent? For example, you can display the compared values a [j], a [j + 1] and the corresponding value j, sort the resulting list, and then compare them line by line (of course, it is reasonable to do this for small N). It turns out that the lists are the same, which means that in both cases the same comparisons and the same assignments are performed, and the difference is only in the order in which these operations are performed.
Yeah, some will say, but we know other examples where, depending on the order of the operations, the time of their execution depends. For example, there is a difference between sequential and random access to memory, even if we read the same data. In this case, the fact is that accessing the memory is a rather long process, and the processor is almost idle at a time when it could execute hundreds of instructions. However, with sequential access, these delays are amortized due to processor caches, where a large amount of data is copied immediately, and access to which is much faster.
But note that in our case the data volume is not so big - 100'000 ints are only 400Kb, while modern processors have L2 caches from megabytes or more. In addition, in both cases, we sequentially access the elements of the array, so memory access delays are again amortized. Finally, if we decrease N, then the same difference is observed 3 times. Therefore, although caches play a role here, they are not the main reason for the observed effect.
To understand what is happening and check the previous statement, we analyze both options using processor counters. These counters count the number of instructions executed, cache accesses, the number of branching events, and many other parameters. To take the testimony, I used the Intel (trial version - yes, don’t wait for a freebie, you don’t google it!) VTune Performance Analyzer, but there are other ways. We start, analyze, and get the output (P4 Prescott, L1 16Kb, L2 2Mb, the first option vs. the second):
Completed instructions: 40 vs. 35 (· 10 9 )
L2 cache misses: 1.1 vs. 1.8 (· 10 9 )
L1 cache misses: 1.6 vs. 0.4 (· 10 6 )
So what follows from here? Firstly, there are very few misses in the first level cache per instruction, and therefore the corresponding delays are not so significant. Secondly, there are still quite a lot of misses in the second level cache, but it turns out that in the second variant the misses happen even more often than in the first! The results on other processors may be slightly different (in particular, there may also be an L3 cache that is not on my processor), but I think that the overall picture will remain approximately the same.
So it's not about caches. But look at the additional analysis results:
Number of transitions (branches): 1.0 vs. 1.0 (· 10 10 )
Errors of the predictor of transitions (mispredicted branches): 0.16 vs. 0.00009 (10 10 )
A small digression for those who, like I was, until recently, not familiar with such an interesting thing as a predictor ( branch Predictor ) - oddly enough, I did not find topics about him Habré.
You probably know that today's processors at the expense of the conveyor instructions ( instruction pipeline) execute not one instruction after another sequentially, but several instructions in parallel, and then combine the results. However, the sequence of instructions may depend on the results of conditional transitions, as in our example, if a [j]> a [j + 1], then the elements are rearranged, otherwise not. Here the transition predictor comes to the rescue, who tries to guess whether the transition will be completed or not, and in accordance with this, instructions for the pipeline are selected.
Transition predictor can be static and dynamic. Dynamic tries to predict a transition based on the history of previous transitions. If this story has not yet been typed, a static predictor is used, but it is not important in our case. On the topic of static and dynamic predictors, I liked this article (English), although in reality everything, of course, is more complicated.
How important is the transition predictor and its errors? Wikipedia reports that on modern processors, the delay is tens of clock cycles, which is
for (i=0; i
for (j=0; j
if (p[j]) с++;
Here p [] is an array that determines whether to conditionally jump or not, and the counter simply serves to distinguish between these two events. If all p [j] values are the same, then after several iterations the transition is already well predictable. If p [j] is generated in some, for example, periodic manner, then the predictability of the transition will depend on the implementation of the predictor. But if the array is randomly filled, it is impossible to predict the next transition, under certain restrictions. It should be noted that the size of the array M is important - if the array is too large, then it can be badly cached, and if it is too small, then the transition can be predicted.
On my computer, the execution time of this code varies 4-6 times depending on the degree of randomness of filling the array. If you perform a more complex operation than increasing the counter, for example, rearranging the elements of an array, the difference decreases, but not by much. Thus, the presence or absence of errors of the transition predictor can lead to several times difference in execution time, which is observed in our original problem.
According to the above analysis results, in the first version of the sorting, the predictor errors occur in 16% of cases, and in the second - 1000 times less. Why is this so? This can be understood by considering how sorting occurs in both cases.
In the first case, for small i, the inner cycle along j runs almost to the end of the array, without touching only the sorted “tail”. Initially, the data in this part of the array is disordered, and therefore the condition a [j]> a [j + 1] is fulfilled almost by accident. As i increases, there is some ordering of the elements due to permutations, but still a large fraction of randomness remains ( animation ). Therefore, it is rather difficult to predict the transition, which leads to a large number of predictor errors.
In the second case, for i = 0 the inner loop only sorts a [0] and a [1], for i = 1 it adds a [2], and so on. In fact, this is sorting by inserts (but not binary) - at the i-th step, the element a [i + 1] is inserted into the already sorted subarray a [0..i] ( animation) Since the element is inserted from the end of the subarray, in most cases this element will be sequentially moved to the required position in the array, and the value of the condition p [j]> p [j + 1] will be the same before it reaches it. Thus, after several iterations, the transition is easy to predict, which the transition predictor seems extremely happy about.