Behind the scene of TOP-1 supercomputer
This is the story of how we c mildly_parallelslowed downaccelerated calculations on the most powerful supercomputer in the world .
In April of this year, our team participated in the finals of the Asia supercomputer challenge 2017 , one of the tasks of which was to accelerate the Masnum-Wave program for modeling ocean waves on the Chinese Sunway TaihuLight supercomputer.
It all started with a qualifying round in February: we got access to a supercomputer and met our new friend for the next couple of months.
Computing nodes are exposed in the form of two ovals, between which there is a network iron. The nodes use Shenwei 26010 processors. Each of them consists of 4 heterogeneous processor groups, which include one control and 64 processing cores with a clock frequency of 1.45 GHz and a local cache size of 64 Kb.
All this is cooled with the help of a water system located in a separate building.
Of the software at our disposal were Fortran and C compilers with “support” OpenACC 2 and athreads (an analogue of POSIX Threads) and a task scheduler that combines the capabilities of a regular scheduler and mpirun.
Access to the cluster was through a special VPN, plug-ins for working with which were available only for Windows and Mac OS. All this added a special charm at work.
The task was to accelerate Masnum-Wave on this supercomputer. We were provided with the source code of Masnum-Wave, several tiny readme files describing the basics of working on a cluster and data for measuring acceleration.
Masnum-Wave is a program for simulating the movement of waves around the globe. It is written in Fortran using MPI. In a nutshell, it iteratively performs the following: reads the data of the external influence functions, synchronizes the boundary regions between MPI processes, calculates the wave propagation, and saves the results. We were given a workload for 8 model months in increments of 7.5 minutes.
On the very first day, we found an article on the Internet: “The Sunway TaihuLight supercomputer:
system and applications, Haohuan Fu” describing Masnum-Wave acceleration on Sunway TaihuLight architecture using pipelining. The authors of the article used the full power of the cluster (10649600 cores), at our disposal there were 64 computing groups (4160 cores).
Masnum-Wave consists of several modules:
We first encountered so much code generated by science. Since the code is a mixture of two versions of Fortran 90 and 77, this sometimes interrupted the work of their own compiler. In large numbers there are wonderful goto constructions, pieces of commented-out code, and, of course, comments in Chinese.
Short code example for clarity:
do 1201 j=1,jl
js=j
j11=jp1(mr,j)
j12=jp2(mr,j)
j21=jm1(mr,j)
j22=jm2(mr,j)
!****************************************************************
! do 1202 ia=ix1,ix2
! do 1202 ic=iy1,iy2
eij=e(ks,js,ia,ic)
!if (eij.lt.1.e-20) goto 1201
if (eij.lt.1.e-20) cycle
ea1=e(kp ,j11,ia,ic)
ea2=e(kp ,j12,ia,ic)
! ...
eij2=eij**2
zua=2.*eij/al31
ead1=sap/al11+sam/al21
ead2=-2.*sap*sam/al31
! fcen=fcnss(k,ia,ic)*enh(ia,ic)
fcen=fconst0(k,ia,ic)*enh(ia,ic)
ad=cwks17*(eij2*ead1+ead2*eij)*fcen
adp=ad/al13
adm=ad/al23
delad =cwks17*(eij*2.*ead1+ead2) *fcen
deladp=cwks17*(eij2/al11-zua*sam)*fcen/al13
deladm=cwks17*(eij2/al21-zua*sap)*fcen/al23
!* nonlinear transfer
se(ks ,js )= se(ks ,js )-2.0*ad
se(kp2,j11)= se(kp2,j11)+adp*wp11
se(kp2,j12)= se(kp2,j12)+adp*wp12
se(kp3,j11)= se(kp3,j11)+adp*wp21
se(kp3,j12)= se(kp3,j12)+adp*wp22
se(im ,j21)= se(im ,j21)+adm*wm11
se(im ,j22)= se(im ,j22)+adm*wm12
se(im1,j21)= se(im1,j21)+adm*wm21
se(im1,j22)= se(im1,j22)+adm*wm22
!...
! 1202 continue
! 1212 continue
1201 continue
First of all, we determined the bottlenecks in the code and the candidates for optimization using the output time of each function. Of greatest interest to us was the warcor module, which is responsible for the numerical solution of the model equation and the function of recording control points.
Having studied the scarce documentation for Chinese compilers, we decided to use OpenACC, as this is a standard from Nvidia with examples and specifications. In addition, the code from readme to athreads seemed unnecessarily complicated to us and simply did not compile. How wrong we were.
One of the first ideas that comes to mind when optimizing code on accelerators is the use of local memory. In OpenACC, this can be done with several directives, but the result should always be the same: the data must be copied to local memory before starting the calculation.
To check and select the necessary directive, we wrote several test programs on Fortran, on which we made sure that they work and that you can get acceleration in this way. Next, we set up these directives in Masnum-Wave, giving them the names of the most used variables. After compilation, they began to appear in the logs, the accompanying inscriptions in Chinese were not highlighted in red, and we considered that everything was copied and working.
But it turned out that not everything is so simple. The OpenACC compiler did not copy arrays to Masnum-Wave, but it worked properly with test programs. After spending a couple of days with Google Translate, we realized that it does not copy objects that are defined in files that are connected via preprocessor directives (include)!
Over the next week, we transferred the Masnum-Wave code from the included files (and there are more than 30) to the files with the source code, and we had to make sure that everything was detected and linked in the correct order. But, since none of us had any experience working with Fortran, and it all came down to “let’s go and see what happens,” it also could not do without replacing some basic functions with crutch versions.
And so, when all the modules were shoveled, and we, in the hope of getting our miserable acceleration, launched a freshly compiled code, we got another batch of disappointment! Directives written according to all the canons of the OpenACC 2.0 standard give errors in runtime. At this moment, the idea began to creep into our heads that this wonderful supercomputer supports its own special standard.
Then we asked the organizers of the competition for documentation, and on the third attempt they provided it to us.
A couple of hours with Google Translate was confirmed by our concerns: the standard they support is called OpenACC 0.5, and it is fundamentally different from OpenACC 2.0, which comes with the pgi compiler.
For example, our main idea was to reuse data on the accelerator. To accomplish this in standard 2.0, you need to wrap the parallel code in a data block. Here's how to do it in the examples from Nvidia:
!$acc data copy(A, Anew)
do while ( error .gt. tol .and. iter .lt. iter_max )
error=0.0_fp_kind
!$omp parallel do shared(m, n, Anew, A) reduction( max:error )
!$acc kernels
do j=1,m-2
do i=1,n-2
Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
A(i ,j-1) + A(i ,j+1) )
error = max( error, abs(Anew(i,j)-A(i,j)) )
end do
end do
!$acc end kernels
!$omp end parallel do
if(mod(iter,100).eq.0 ) write(*,'(i5,f10.6)'), iter, error
iter = iter +1
!$omp parallel do shared(m, n, Anew, A)
!$acc kernels
do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
!$acc end kernels
!$omp end parallel do
end do
!$acc end data
But on the cluster, this code is not compiled, because in their standard this operation is done by specifying an index for each data block:
#include
#include
#define NN 128
int A[NN],B[NN],C[NN];
int main()
{
int i;
for (i=0;i
Коря себя за выбор OpenACC, мы все же продолжили работу, так как времени оставалось всего пару дней. В последний день отборочного тура нам наконец-то удалось запустить наш “ускоренный” код. Мы получили замедление в 3.5 раза. Нам ничего не оставалось, кроме как написать в отчете к заданию все, что мы думаем о их реализации OpenACC в цензурной форме. Не смотря на это, мы получили множество позитивных эмоций. Когда еще придется удаленно отлаживать код на самом мощном компьютере в мире?
P.S.: В результате мы все же прошли в финальную часть конкурса и съездили в Китай.
Последним заданием финала была презентация с описанием решений. Лучшего результата добилась местная команда, которая написала свою библиотеку на Си c использованием athread т.к. OpenACC, по их словам, не работает.