ASC'18: Perseverance and regular training as a way to achieve a goal
Student supercomputer competitions are held annually in various points of the world, and aim to attract young talent in the field of high-performance computing in industry and science. This year, our team took part in Asian competitions, and in this article we will discuss the experiences and impressions received at this event.
Tasks and the progress of the qualifying stage
This year, for the first time, there were no tasks to be solved using the equipment provided by the organizers: all the tasks had to be run on their hardware. Thanks to the persistence and perseverance of the professors, shortly before the start of the qualifying stage, our team had access to several nodes with NVIDIA P100 and P6000 video cards, which greatly helped us in the preparation. Tasks differed little from last year . They are described below.
- Build a cluster configuration and describe why certain components were selected.
- Measure cluster performance with Lynpack and HPCG. The difference from last year was only that last year Lynpak had to be optimized for the cluster provided by the organizers with Intel Xeon Phi processors, and in this for any available cluster.
- Optimize Relion (software for image recognition from a cryoelectron microscope) under a video card.
- Build a neural network to respond to user search queries using the CNTK framework and MS MARCO dataset .
Linpack and HPCG. Due to the emergence of new nodes with video cards and Vadim, who was engaged only in performance tests, we have advanced significantly in the first and second assignments. Vadim was able to make as many test runs as needed to confidently fit the parameters to a specific system. Also, on the new nodes it became possible to regulate power consumption, which made it possible to select the cluster configuration taking into account changes in the frequency of the processor and the graphics chip. The emergence of new nodes was the biggest event for the team.
Relion. The code written by biomedical chemists did not differ in a well-thought-out architecture and contained difficult-to-read files of several thousand lines of code. Synchronization was provided by a system call
sleep(). There were dozens of gigabytes of input data, even more days off, one iteration took an average of forty minutes, and it was impossible to understand how to optimize all this. After two weeks of searching, my own memory allocator for a video card was written; Fourier transform and some other subprograms were transferred to video cards. Due to the complexity of the code and the limited time, the remaining optimizations were made after the preliminary stage.
CNTK. As usual, in the task of machine learning the basic configuration of the neural network was given, from which it is worth repelling. The framework and the network itself out of the box did not work. CNTK required a special version of OpenMP, in the utility for checking the result, the functions had incompatible types of parameters, and their number did not match. When everything was finally launched, they began to understand the network architecture. Alas, neural networks are still a weak point of our team, so we didn’t make any very complex changes. The percentage of discarded neurons was changed, the learning rate, the initial value, experimented using GRU instead of LSTM in the recurrent part.
Preparing for the final and finding a sponsor
So, cheers, we went to the final!
This time we immediately wrote to our last year's sponsor and began to prepare. Soon two events happened: last year's sponsor refused us, and the university allocated funds to cover part of the cost of the flight. Next was the generation of ideas where to find the balance. In the end, we volunteered to help the company Devexperts , engaged in the development of financial software for exchanges, brokers and investment companies. In the qualifying round, the team solves problems on the equipment that is available to it. Sometimes part of the problem is solved on the equipment provided by the organizers. In the final, everything is exactly the same except for one but ... this very cluster the team has yet to assemble!
Last year, none of the team members had any experience in setting up a cluster, which is why we have little time left to run competitive tasks, so this year we conducted a series of training sessions on the training cluster. At each training session, we created backup copies of the nodes, fully configured one node, then copied its image over the network to other nodes. As practice has shown, this is the fastest and painless way to configure a cluster from scratch, which does not require in-depth knowledge of low-level technologies. Several trainings were enough to completely debug and automate the process.
Final: Day One and Day Two
In the first two days of the competition, teams assemble and set up a cluster on which all applications will be subsequently launched. As a rule, the more video cards you have in your system, the more you will be able to get performance on most tasks, and the fewer nodes you can install into a rack due to a power limit (the power should not exceed 3 kW, otherwise the results of the task are not counted). However, there are applications in which video cards are not used in any way, and the presence of a large number of nodes is beneficial.
This year, the competition sponsors provided four NVIDIA V100 to each team. The first to find the right person and get the cherished boosters, we set about the installation. No one in the team (including the trainer) had ever had experience installing video cards into the server. After studying the instructions kindly provided by the organizers, we coped with the task, and we did not even have to completely pull out and disassemble the server, as most teams do (see video).
Next, it was necessary to install and configure the operating system. As a rule, students have the least confidence in this area, since it uses niche technologies, the knowledge of which is not useful in other areas, so we had previously conducted several training sessions on the complete configuration of the system of five nodes from scratch on the training cluster using Clonzilla.
Configuration scripts were debugged on a virtual cluster using Vagrant, since only in it one can easily raise several identical virtual machines. Due to the low level of software we use, Docker and other technologies based on Linux namespaces are not suitable for us. Configuration scripts simply do not work on them.
Armed with the experience gained in training, we deployed the operating system and the rest of the packages even faster than in training - the performance of the servers cannot be compared with our training cluster. One of the features of the competition is the lack of access to the Internet from the servers, so we downloaded the repository with the packages in advance and recorded them on two USB disks that we took with us.
After setting up the cluster, each team member began setting up and testing his application for the new system, and here we were in for an unpleasant surprise. The version of Lynpak, which remained from the last year of the competition, refused to work correctly on the new system. Installing different versions of CUDA, going through various options and kernel settings did not give the desired effect. As a result, we decided to launch the usual non-optimized version in order not to lose points for the task. (This is due to the new scoring system this year: even if your result is the best in speed or performance, but the output is incorrect, you will receive only half of the maximum possible points. The second half of the points are scored for correctness of the output.)
To understand the essence of the problem, it is worth telling what Lynpak is. Linpack is used to measure the performance of supercomputers and make a list of TOP500most powerful supercomputers in the world. The easiest way to take a high place in this list is to purchase a cluster with a large number of video cards (the number of processors is not so important, because 99% of the task is given to a video card). For each accelerator, there is an optimized version of Lynpack, whose code is usually closed. You can get a binary only if you have a supercomputer that can rank on the TOP500 list, or if you participate in a supercomputer competition. Despite this, the organizers of the competition did not provide the binary, the Russian branch of NVIDIA also refused to do so. There are no clusters with V100 in Russia that could be included in the TOP500 list, so searches for familiar colleagues were also unsuccessful. The fact that Linpack is not used anywhere else adds to the incomprehensibility of the situation, except for performance testing: neither in science nor in technology. If you want to help the team and know how to get the coveted program, you are welcome in PM. Well, we, with the inherent spontaneity, noted this story in the final presentation, which could please the jury members.
Final: third day
Lynpak, HPCG, Relion and the secret application were waiting for us on the third day of the competition, and this day became the hardest for the team. Having quickly dealt with Linpack (see the previous section) and HPCG, we received work orders (input data) for a secret application. It turned out to be a program for calculating the molecular dynamics of siesta . The first disappointment was that on the part of the tasks Siesta gave an error to the address (despite the fact that it was written in Fortran, in which such an error is not so easy to get), and it was not possible to debug it. Nevertheless, the remaining tasks earned and at the end of the day we successfully passed them.
In parallel with Siesta, we had to launch a previously prepared Relion. All nodes without video cards were given to Siesta, and nodes with video cards were given to Relion, so the programs did not interfere with each other.
Relion's code, we have changed a lot in the preliminary stage to work effectively on video cards. Among other things, we parallelized many functions, rewrote the memory allocator on the video card, transferred the most resource-intensive subroutines to the video card and added the ability to use nodes with and without video cards at the same time. This greatly accelerated the program, and it worked perfectly on university technology. However, at the competition we got video cards with a smaller memory size, which is why Relion crashed with an error. A deeper analysis of the error showed that the code will work only if it is rewritten for a new system. We had no time for this, and this was the second disappointment of the third day.
Final: fourth day
On the fourth day of the competition, CFL3D and MSMARCO remained, and this day passed much more calmly. Freed from the applications assigned to them, the team members began to help each other. For CFL3D, which has a very complex input file format, Ruslan wrote a script that generates it. Since we had a lot of nodes compared to teams with a large number of video cards, we launched several tasks in parallel and after several starts of each task we were able to select the optimal parameters.
Running a previously prepared MSMARCO also did not cause serious problems. The preliminary data processing took several hours, which is why there was no time left for a long study, but thanks to more powerful video cards it was possible to complete it, albeit with a smaller number of epochs. We still have a model trained in more epochs from the qualifying stage (the input data changed in the final, and there was no new file for verification), but according to the rules a model trained during the final was needed, and we decided to turn in an honestly trained model. Despite the coordinated work and the absence of surprises, we used all the allotted time and finished late in the evening.
Final: Day Five
The next day, we were waiting for the presentation. In the evening of the fourth day, we inserted the results into a pre-prepared template and wrote a speech. The presentation was easy, we were never asked any interesting questions, but for some reason we were allowed to shoot only the speaker and slides.
A few hours later the award ceremony began. The sensations were mixed: on the one hand, we performed much better than last year, on the other - we could have performed even better if it were not for the annoying errors with the applications. As a result, despite the fact that our cluster was not distinguished by a large number of video cards, due to the greater number of nodes and perseverance, we went around the other teams in CFL3D, for which we were awarded a separate prize for the competition. In the overall standings, we took the eleventh place out of twenty teams that reached the final (and out of three hundred teams that participated in the preliminary stage). The overall champion, like last year, was Xinhua University. For our team, this was a victory over us: we performed better than last time, gained invaluable experience, which we will use next year, and went around the others in one of our assignments.
Conclusions and general impressions
The cluster configuration, in which there are many more video cards than processors, is advantageous in most cases, but not universal. Nodes become smaller, and not every application can, in principle, work on a video card. Such applications include programs in Fortran, which, because of their venerable age, are not rewritten as a video card, and most often they do not even use all the processor cores. For such applications, the presence of a large number of nodes allows you to run more parallel tasks, and therefore more optimize applications.
The team may not know all the subtleties of installing operating systems and image casting, but this gap is easily replaced by training. Of course, participants will not know all the subtleties, but they will surely perform the installation point by point. Installation scripts are easily debugged on virtual machines.
During the competition you can meet the most wonderful open source software. Programs that are compiled by illegible scripts, programs that use library functions rewritten with errors, programs written in Fortran with C inserts, programs with hard-coded dependencies and compilation flags. I can not remember a single program that would be assembled from the first time or produced a clear error during assembly. (Fresh example: the old version of OpenMPI on new systems tries to connect the library with an empty name. The problem is reliably solved only by auto-replacement in the generated make-files.) The contest teaches not to be surprised at anything and to overcome the difficulties that arise. I want to believe that a person who has worked with such software will never create something similar in his life.
At the competition you can’t stop wondering about Chinese ingenuity. This year, under the place in which the competition is to be held, the Chinese redid the square-shaped conference room with cut corners. It brought racks with servers and a cooling system with a liquid outlet to the nearest bathroom (I am not an expert in this subject and do not know the exact name of the equipment). When they realized that the temperature in the hall did not fall below thirty degrees Celsius, they brought huge ice blocks in the basins. This situation, of course, did not change, but it provided the team with chilled drinks.
Participation in the competition would be impossible without our sponsor - the company Devexperts ( http://devexperts.com/ ). The company assumed the cost of airfare to China.
Impressions of China
Some team members were in China for the first time, which caused a slight cultural shock. Says Anton.
Acquaintance with the land of the rising sun began with the fact that some of us had the batteries taken off for charging, because they did not have a power mark. Apart from this plot, everything else was friendly. We were met by two volunteers with a sign of our university, and then taken by bus to the hotel. It is worth noting that after Peter, where the snow was about to melt, it was quite hot in China (although, of course, the real people of Volgograd did not even feel this). Upon arrival, we were settled in rooms. After daily flights, I decided to flop with fatigue on a huge mattress.
Hard was the mattress.
At that time it was about ten in the morning, so an hour later we went to inspect the local university. To say that he is big is to say nothing. If we compare the territory of the local campus and St. Petersburg State University, then Nanchang University is five times more. We were shown the local canteen, where for the next five days we ate noodles and rice. For the most part, the first communication with ordinary Chinese, whose knowledge of English is not so good, began right here.
The dining room is designed in such a way that each window is a mini outlet where you can buy something. Payment is made by a special card, which you, like the NFC card, simply attach to the reader. Everything happens quickly, and in the queue for a long time do not have to stand. Understanding exactly what you take can be problematic. We have to apply the old grandfather's methods and point the finger at the desired dish. On the third day in the stand with noodles, they began to recognize us, which made life much easier for us. Some even learned to count to ten, so as not to point fingers and show respect to the inhabitants. If we talk about the food itself, then there were quite good and healthy such dumplings. With the help of a volunteer, I managed to get a delicious soup, but since to do this you need to interact with the seller verbally, it was only once.
On the second day, we met a third volunteer named Ksenia. (The Chinese, as a rule, come up with easily pronounced names to communicate with foreigners.) She has been studying Russian for two years, so she was assigned to us, a kind of useful experience.
The competition itself has already been described in detail in other sections, but here I would like to note that there were only two chairs per team, so you had to sit on the floor, after which the legs only asked for mercy, because it took ten hours to sit all four days of the competition .