Search and solve scalability problems using the example of multi-core Intel Core 2 processors (part 1)
- Transfer
Adaptation of software for the effective use of all available processors is the most critical in the light of the emerging multi-core future of modern computer technology. In addition to all the other obstacles that may be encountered along this path, there are problems associated with sharing the ultimate bandwidth of existing platforms and processors. The correct use of Intel Core2 processor performance events will allow you to determine the exact reason that stops the application on the path to full use of all available cores in the system.
Before we start studying the problems of software scalability in a parallel computing environment, we need to determine the meaning of the term itself. Perfect scalability means that the total execution time of a task will decrease linearly with an increase in the number of cores used by the application. However, in accordance with Amdahl’s law for parallel execution of the algorithm, the total execution time of the program can be reduced only by a segment of the code optimized accordingly. Thus, the first point in our search is to determine the degree of parallelism of the algorithms achieved.
Processor performance events based on Intel Core2 microarchitecture provide us with convenient metrics for this purpose. Each core has a Performance Monitoring Unit (PMU), and the CPU_CLK_UNHALTED.CORE event allows you to determine the number of work cycles on each core. If VTune ™ Performance Analyzer is used to collect information, this number is summed up for all cores that performed the process of interest to us. That is, this number is the number of effective cycles spent on executing the application. Call this number "effective_cycles".
A useful feature of the PMU is the ability to compare the value of the event with some given number (cmask) on each cycle and determine whether it is greater or equal (inv = 0) or whether it is less than this number (inv = 1). If such a condition is specified, the PMU will only count cycles if it is met. However, this is only possible for general purpose meters. Fixed counters, for example, for events CPU_CLK_UNHALTED.CORE, INST_RETIRED.ANY and CPU_CLK_UNHALTED.REF, cannot be subjected to this operation. If the value of the event CPU_CLK_UNHALTED.CORE_P (generalized version of the counter of duty cycles) is compared with the number 2 with the condition “less than” (inv = 1), then the counter will count all the cycles. If we sum this number for all processes and divide by the number of cores in the system, we will get the total process execution time.
To ensure the correctness of the obtained values, it is necessary that the Speed Stepping function be disabled in the BIOS and OS, otherwise the unloaded kernels may work at a lower frequency, which will distort the obtained value of the total time. The ratio effective_cycles / total_cycles will be equal to the number of cores used for perfectly scalable code, and equal to 1 for absolutely consistent code. Moreover, the result does not depend on which part of all the system cores was involved during execution. However, the value of this technique can be leveled if the processor contains time-consuming active wait cycles, so that the wait cycles must be implemented correctly. For a more detailed analysis, it is recommended to use Intel Thread Profiler.
Deviations from ideal scalability can be caused by features of the code structure and synchronization logic, but they can also be caused by high competition for the total resources of the system. The purpose of this article is precisely the systematic search for such problems and their solution.
First of all, it is necessary to determine the list of resources to which high competition in parallel computing is possible. These include:
The last component of the list has a slightly different impact on performance than all the others, since the synchronization of threads and processes depends on blocked access to cache lines. The difference is most obvious from the point of view that a faster / larger cache is not always able to circumvent the impact of a drop in scalability, as is the case with other elements of this list.
For these limitations, there are two main scenarios for scaling an application, which will be discussed below, splitting a fixed amount of work between cores and linearly increasing the amount of work performed by all cores. These two scenarios have slightly different meanings when it comes to scalability issues, and, therefore, differ in the desired attributes.
In order to understand what contribution the bus traffic makes to the performance drop, it is necessary to determine what traffic actually goes on the bus during program execution (total and by individual components) and what is the peak bandwidth of the platform used in the analysis. The peak throughput of the platform depends on a large number of third-party factors. This is the configuration of hardware prefetchers (hardware prefetchers), and the number, type and location of RAM slots in the slots, and the frequency of the system bus, the processor frequency, and the conditions that make it possible to achieve cache coherence. Therefore, no metric can be considered to determine the throughput of a platform based on Intel Core 2. The only acceptable method to determine it is to conduct a synthetic throughput test. A single long cycle of the TRIAD algorithm seems best suited for these purposes. Since the throughput limit for multi-core calculation is likely to be different from single-core, the aforementioned test should support parallel counting either due to threads or due to splitting into separate processes.
The system bandwidth limit affects performance a little differently than most of the slowing factors. The influence of the majority grows linearly with the growth of their defining metric, as in the case of cache misses at the last cache level, where the impact on performance is defined as the corresponding number of wait events. In the same case, performance drops, as if bumping into a barrier that is not noticeable until the application has exhausted all the resources of the platform. That is, the dependence of performance on the use of system bandwidth is more likely stepwise than linear, as is the case with other events. So, for example, the access time to the memory increases nonlinearly, since the bandwidth limit has been reached. You can observe this effect by measuring the delay in accessing the bus while increasing the number of triads calculated simultaneously. Bus access latency can be measured by performance events in counting mode (as opposed to sampling) as a ratio:
In most cases, the ifetch component (instruction fetch) is unimportant, especially in the case of peak throughput, and therefore can be ignored.
There are many sources that contribute to system bus traffic. Performance events of Intel Core 2 processors allow you to use many methods to determine both full and individual bus loading by these components. System bus saturation is very easy to determine. It can be represented as part of the bus cycles used to transfer data:
Or directly, as the number of bytes transmitted over the bus as part of the cache lines:
A convenient metric in this case is simply the number of cache lines per cycle, since the appetite of N parallel versions of this application / stream will be N times from this value. So if the platform limit is defined in these values, the most probable bus load in parallel counting can be easily determined during the analysis of a single stream.
BUS_TRANS_ * events can be used hierarchically to isolate bus components. Their brief description is presented in the table below, and they are also described in great detail in the VTune Performance Analyzer online help.
There are many standard ways to reduce bus traffic that can be applied if you reach the platform limit. These include the following methods:
The foregoing does not in any way pretend to be an exhaustive list of possible options; it is rather a list of the most obvious ones. Anyone who actually tried will say that the latter is generally easier said than done.
I apologize for the multi-letter and some confusion, but the material is really somewhat dry. I did the translation last year, so I apologize for some loss of relevance.
Continuation of the article: part 2 , part 3 , part 4
What is scalability?
Before we start studying the problems of software scalability in a parallel computing environment, we need to determine the meaning of the term itself. Perfect scalability means that the total execution time of a task will decrease linearly with an increase in the number of cores used by the application. However, in accordance with Amdahl’s law for parallel execution of the algorithm, the total execution time of the program can be reduced only by a segment of the code optimized accordingly. Thus, the first point in our search is to determine the degree of parallelism of the algorithms achieved.
Processor performance events based on Intel Core2 microarchitecture provide us with convenient metrics for this purpose. Each core has a Performance Monitoring Unit (PMU), and the CPU_CLK_UNHALTED.CORE event allows you to determine the number of work cycles on each core. If VTune ™ Performance Analyzer is used to collect information, this number is summed up for all cores that performed the process of interest to us. That is, this number is the number of effective cycles spent on executing the application. Call this number "effective_cycles".
A useful feature of the PMU is the ability to compare the value of the event with some given number (cmask) on each cycle and determine whether it is greater or equal (inv = 0) or whether it is less than this number (inv = 1). If such a condition is specified, the PMU will only count cycles if it is met. However, this is only possible for general purpose meters. Fixed counters, for example, for events CPU_CLK_UNHALTED.CORE, INST_RETIRED.ANY and CPU_CLK_UNHALTED.REF, cannot be subjected to this operation. If the value of the event CPU_CLK_UNHALTED.CORE_P (generalized version of the counter of duty cycles) is compared with the number 2 with the condition “less than” (inv = 1), then the counter will count all the cycles. If we sum this number for all processes and divide by the number of cores in the system, we will get the total process execution time.
To ensure the correctness of the obtained values, it is necessary that the Speed Stepping function be disabled in the BIOS and OS, otherwise the unloaded kernels may work at a lower frequency, which will distort the obtained value of the total time. The ratio effective_cycles / total_cycles will be equal to the number of cores used for perfectly scalable code, and equal to 1 for absolutely consistent code. Moreover, the result does not depend on which part of all the system cores was involved during execution. However, the value of this technique can be leveled if the processor contains time-consuming active wait cycles, so that the wait cycles must be implemented correctly. For a more detailed analysis, it is recommended to use Intel Thread Profiler.
Deviations from ideal scalability can be caused by features of the code structure and synchronization logic, but they can also be caused by high competition for the total resources of the system. The purpose of this article is precisely the systematic search for such problems and their solution.
Limited resources
First of all, it is necessary to determine the list of resources to which high competition in parallel computing is possible. These include:
- System throughput
- Memory bandwidth
- Inter-Socket Bandwidth
- Disk subsystem utilization
- Total Cache Volume
- Data Address Translation Buffer Size (DTLB)
- Individual access to cache lines
- Common elements in the line.
- Shared lines without shared elements (false sharing)
The last component of the list has a slightly different impact on performance than all the others, since the synchronization of threads and processes depends on blocked access to cache lines. The difference is most obvious from the point of view that a faster / larger cache is not always able to circumvent the impact of a drop in scalability, as is the case with other elements of this list.
For these limitations, there are two main scenarios for scaling an application, which will be discussed below, splitting a fixed amount of work between cores and linearly increasing the amount of work performed by all cores. These two scenarios have slightly different meanings when it comes to scalability issues, and, therefore, differ in the desired attributes.
Throughput
In order to understand what contribution the bus traffic makes to the performance drop, it is necessary to determine what traffic actually goes on the bus during program execution (total and by individual components) and what is the peak bandwidth of the platform used in the analysis. The peak throughput of the platform depends on a large number of third-party factors. This is the configuration of hardware prefetchers (hardware prefetchers), and the number, type and location of RAM slots in the slots, and the frequency of the system bus, the processor frequency, and the conditions that make it possible to achieve cache coherence. Therefore, no metric can be considered to determine the throughput of a platform based on Intel Core 2. The only acceptable method to determine it is to conduct a synthetic throughput test. A single long cycle of the TRIAD algorithm seems best suited for these purposes. Since the throughput limit for multi-core calculation is likely to be different from single-core, the aforementioned test should support parallel counting either due to threads or due to splitting into separate processes.
The system bandwidth limit affects performance a little differently than most of the slowing factors. The influence of the majority grows linearly with the growth of their defining metric, as in the case of cache misses at the last cache level, where the impact on performance is defined as the corresponding number of wait events. In the same case, performance drops, as if bumping into a barrier that is not noticeable until the application has exhausted all the resources of the platform. That is, the dependence of performance on the use of system bandwidth is more likely stepwise than linear, as is the case with other events. So, for example, the access time to the memory increases nonlinearly, since the bandwidth limit has been reached. You can observe this effect by measuring the delay in accessing the bus while increasing the number of triads calculated simultaneously. Bus access latency can be measured by performance events in counting mode (as opposed to sampling) as a ratio:
Bus Latency = BUS_REQUEST_OUSTANDING.SELF/(BUS_TRANS_BRD.SELF - BUS_TRANS_IFETCH.SELF)
In most cases, the ifetch component (instruction fetch) is unimportant, especially in the case of peak throughput, and therefore can be ignored.
There are many sources that contribute to system bus traffic. Performance events of Intel Core 2 processors allow you to use many methods to determine both full and individual bus loading by these components. System bus saturation is very easy to determine. It can be represented as part of the bus cycles used to transfer data:
BUS_DRDY_CLOCKS.ALL_AGENTS/CPU_CLK_UNHALTED.BUS
Or directly, as the number of bytes transmitted over the bus as part of the cache lines:
Cacheline_Bandwidth (байт/с) ~ 64*BUS_TRANS_BURST.ALL_AGENTS*core_freq / CPU_CLK_UNHALTED.CORE
A convenient metric in this case is simply the number of cache lines per cycle, since the appetite of N parallel versions of this application / stream will be N times from this value. So if the platform limit is defined in these values, the most probable bus load in parallel counting can be easily determined during the analysis of a single stream.
BUS_TRANS_ * events can be used hierarchically to isolate bus components. Their brief description is presented in the table below, and they are also described in great detail in the VTune Performance Analyzer online help.
Event | Description |
BUS_TRANS_ANY | All transactions on the bus: Mem, IO, Def, Partial |
BUS_TRANS_MEM | All cache lines, partial and invalid |
BUS_TRANS_BURST | All cache lines: Brd, RFO, WB, Combined Write |
BUS_TRANS_BRD | All cache line reads: Data, Ifetch |
BUS_TRANS_IFETCH | All instruction cache lines |
BUS_TRANS_RFO | Total cache lines when requesting exclusive use |
BUS_TRANS_WRB | All cache line entries (modified cache lines) |
There are many standard ways to reduce bus traffic that can be applied if you reach the platform limit. These include the following methods:
- Use all bytes of all cache lines while they are in the cache
- It is necessary to arrange the elements of structures in descending order of their size in order to avoid alignment by the compiler
- Define structures by actual use, and not for the sake of object orientation or thematic connectivity
- Cross leading dimensions of arrays should be in the innermost of nested loops
- Wherever possible, cache components of structures
- Shared structure components next to each other
- Use streaming instructions past the cache for large assignment cycles
- Do not prefetch unused cache lines
- Whenever possible, it is necessary to combine cycles that depend on throughput with cycles that depend on processor performance.
This is usually easy to achieve if the number of iterations of the cycles is equal, but in principle is feasible with a different number of iterations - Try to break the data in blocks to make maximum use of it while they are in the cache.
The foregoing does not in any way pretend to be an exhaustive list of possible options; it is rather a list of the most obvious ones. Anyone who actually tried will say that the latter is generally easier said than done.
I apologize for the multi-letter and some confusion, but the material is really somewhat dry. I did the translation last year, so I apologize for some loss of relevance.
Continuation of the article: part 2 , part 3 , part 4