Improving software performance with Intel tools for developers. Numerical modeling of astrophysical objects
We begin a series of articles describing various situations in which the use of Intel tools for developers has significantly increased the speed of the software and improved its quality.
Our first story happened at Novosibirsk University, where researchers developed a software tool for numerically simulating magnetohydrodynamic problems during hydrogen ionization. This work was carried out as part of a global project for modeling astrophysical objects AstroPhi ; Intel Xeon Phi processors were used as a hardware platform . Using Intel Advisor and Intel Trace Analyzer and Collector computing productivity increased 3 times, and the speed of solving one problem decreased from a week to two days.
Mathematical modeling plays an important role in modern astrophysics, as in any science; This is a universal tool for studying nonlinear evolutionary processes in the universe. High-resolution modeling of complex astrophysical processes requires huge computational resources. AstroPhi project NSU is developing astrophysical software code for supercomputers based on Intel Xeon Phi processors. Students learn to write simulation programs for an extremely parallelized runtime, gaining important knowledge that they will then need when working with other supercomputers.
The numerical modeling method used in the project had a number of important advantages:
The first three factors are key to realistic modeling of significant physical effects in astrophysical problems.
The research team has created a new modeling tool for multi-parallel architectures based on Intel Xeon Phi. Its main task was to avoid bottlenecks in the exchange of data between nodes and to simplify code refinement as much as possible. The parallelization solution uses MPI, and for vectorization, Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions add support for 512-bit SIMD and allow the program to pack 8 double-precision floating-point numbers or 16 single-precision numbers (32-bit) ) to vectors 512 bits long. Thus, twice as many data elements are processed per instruction than when using AVX / AVX2 and four times as much as when using SSE.
Picture before optimization. Each point is a processing cycle. The larger and redder the point, the longer the cycle continues and the more noticeable is the effect of its optimization. The red dot lies well below the DRAM bandwidth limit and is calculated with a performance of less than 1 GFLOP. It has very great potential for improvement.
Before optimization, the code had certain problems with dependencies and vector sizes. The optimization goal was to remove the dependencies of vectors and improve the operations of loading data into memory using the optimal size of vectors and arrays for Xeon Phi. For optimization we used Intel Advisor and Intel Trace Analyzer and Collector , two tools from Intel Parallel Studio XE .
Intel Advisor- it is, as its name implies, an adviser - a software tool that evaluates the degree of optimization - vectorization (using AVX or SIMD instructions) and parallelization to achieve maximum performance. Using this tool, the team was able to do an overview analysis of cycles, highlighting those with low productivity, indicating the potential for improvement and determining what could be improved and whether the game was worth the candle. Intel Advisor sorted the cycles by potential, added messages to the source for better readability of the compiler report. He also provided important information such as cycle times, data dependencies, and memory access patterns for safe and efficient vectorization.
Intel Trace Analyzer and Collector- Another way to optimize the code. It includes profiling MPI communications and analysis functionality to improve weak and strong scaling. This graphical tool helped the team understand the MPI behavior of the application, quickly find bottlenecks and, most importantly, increase performance on the Intel architecture.
Picture after optimization. During the red loop optimization, vectorization dependencies were removed, loading operations into memory were optimized, vector and array sizes were adapted for Intel Xeon Phi and AVX-512 instructions. Productivity increased to 190 GFLOPS, i.e. about 200 times. Now it is above the DRAM limit and most likely limited by the characteristics of the L2 cache
So, after all the improvements and optimizations, the team achieved 190 GFLOPS performance with an arithmetic intensity of 0.3 FLOP / b, 100% utilization and 573 GB / s memory bandwidth.
Optimized code snippet
Our first story happened at Novosibirsk University, where researchers developed a software tool for numerically simulating magnetohydrodynamic problems during hydrogen ionization. This work was carried out as part of a global project for modeling astrophysical objects AstroPhi ; Intel Xeon Phi processors were used as a hardware platform . Using Intel Advisor and Intel Trace Analyzer and Collector computing productivity increased 3 times, and the speed of solving one problem decreased from a week to two days.
Task description
Mathematical modeling plays an important role in modern astrophysics, as in any science; This is a universal tool for studying nonlinear evolutionary processes in the universe. High-resolution modeling of complex astrophysical processes requires huge computational resources. AstroPhi project NSU is developing astrophysical software code for supercomputers based on Intel Xeon Phi processors. Students learn to write simulation programs for an extremely parallelized runtime, gaining important knowledge that they will then need when working with other supercomputers.
The numerical modeling method used in the project had a number of important advantages:
- lack of artificial viscosity,
- Galilean invariance,
- guarantee of non-reduction of entropy,
- simple parallelization
- potentially infinite extensibility.
The first three factors are key to realistic modeling of significant physical effects in astrophysical problems.
The research team has created a new modeling tool for multi-parallel architectures based on Intel Xeon Phi. Its main task was to avoid bottlenecks in the exchange of data between nodes and to simplify code refinement as much as possible. The parallelization solution uses MPI, and for vectorization, Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions add support for 512-bit SIMD and allow the program to pack 8 double-precision floating-point numbers or 16 single-precision numbers (32-bit) ) to vectors 512 bits long. Thus, twice as many data elements are processed per instruction than when using AVX / AVX2 and four times as much as when using SSE.
Picture before optimization. Each point is a processing cycle. The larger and redder the point, the longer the cycle continues and the more noticeable is the effect of its optimization. The red dot lies well below the DRAM bandwidth limit and is calculated with a performance of less than 1 GFLOP. It has very great potential for improvement.
Code optimization
Before optimization, the code had certain problems with dependencies and vector sizes. The optimization goal was to remove the dependencies of vectors and improve the operations of loading data into memory using the optimal size of vectors and arrays for Xeon Phi. For optimization we used Intel Advisor and Intel Trace Analyzer and Collector , two tools from Intel Parallel Studio XE .
Intel Advisor- it is, as its name implies, an adviser - a software tool that evaluates the degree of optimization - vectorization (using AVX or SIMD instructions) and parallelization to achieve maximum performance. Using this tool, the team was able to do an overview analysis of cycles, highlighting those with low productivity, indicating the potential for improvement and determining what could be improved and whether the game was worth the candle. Intel Advisor sorted the cycles by potential, added messages to the source for better readability of the compiler report. He also provided important information such as cycle times, data dependencies, and memory access patterns for safe and efficient vectorization.
Intel Trace Analyzer and Collector- Another way to optimize the code. It includes profiling MPI communications and analysis functionality to improve weak and strong scaling. This graphical tool helped the team understand the MPI behavior of the application, quickly find bottlenecks and, most importantly, increase performance on the Intel architecture.
Picture after optimization. During the red loop optimization, vectorization dependencies were removed, loading operations into memory were optimized, vector and array sizes were adapted for Intel Xeon Phi and AVX-512 instructions. Productivity increased to 190 GFLOPS, i.e. about 200 times. Now it is above the DRAM limit and most likely limited by the characteristics of the L2 cache
Result
So, after all the improvements and optimizations, the team achieved 190 GFLOPS performance with an arithmetic intensity of 0.3 FLOP / b, 100% utilization and 573 GB / s memory bandwidth.
Optimized code snippet