ilnuribat January 30, 2015 at 12:29

How not to conduct an olympiad or setting up ejudge with the distribution of calculations

From the sandbox

In one ~~off-center~~ remote region of our vast country, the next regional stage of the All-Russian Olympiad of schoolchildren in computer science and programming was once held. Until 2014, everything was fine, they held the Olympiad on the old system, written back in 2004 by a very gifted programmer, on Delphi. Since then no one has changed it - it worked, well, okay. In 2014, we decided to try ejudge. They didn’t start raising everything from the source, they decided to take a ready-made image for a virtual machine. Everything was fine, everything worked.

But then came the year 2015, in which some points of the Olympiad a little bit, a little bit changed, and the necessary "people" learned about these changes only 1-2 days before the start ...

Here the fun begins.

The fact is that almost all of these changes concerned only the two of us (I + ripatti ).
I was responsible for the server (fedora19, ejudge) and its performance, he was responsible for the preparation of tests, the configuration of tours in general. He has quite a wealth of experience in this .

So, I’ll go in chronological order.

January 21, Wednesday

They ask me if I can raise a server for the Olympiad on the basis of dedicated university machines, which I answer negatively, because there was not enough time, and the environment may be unfamiliar to me (I thought that there was VMWare, but I could only use Virtual Box). In general, I could not guarantee that everything will be fine.

January 22, Thursday

I find out that there is such a thing as tokens. This meant only one thing: the decisions of the participants should be checked during the tour, and not after. Remembering last year’s tour, I decided that one server will pull everything. Last year, nothing fell, everything worked, everyone was happy. I started working on the server. He brought a car (iron) to the walls of the university.

explanation

I raised the server with ejudge in the walls of my lyceum, last year, in advance, before the Olympics. Therefore, at the last regional stage, it was decided to try a ready-made solution.

In the evening I learn from my partner that the previous version of ejudge (2.3) does not meet the requirements. Just at this time, Alexander Chernov posted a working version. Even specially launched a new repository with all the settings of the trial tour. It was very tempting, because I had the idea in my head to configure the old version. We decided to build a new version from the source, since there was no ready-made image. Here the first problems began.

Problem: how to start ssh on port 22 not?

background, decision (partial)

The point is at the university. They, like any organization, block port 22 from the outside. We could work quietly in the walls of the university, but problems would start outside the walls. Thank God, my supervisor was the administrator of the cluster, which had an external IP, but access was denied. I asked him to help, as a result, he completely tuned everything for us. Actually, I asked to give me ssh access to the cluster (from where I calmly got to my server on port 22), but he really did not want to give access left-to-right. They decided to "radically solve the problem." I give him all the passwords, logins, and he promised to see. Yes, I'm a gullible person.

In fact, I myself tried to do this, but could not.

Clippings from what he then sent:

... thirdly, the settings of the ssh server are stored in / etc / ssh / sshd_config, and not ssh_config, I added PermitRootLogin no in the first

Port 22
Port 5000 and everything hung out as it should: [root @ localhost ssh] # service sshd status Redirecting to / bin / systemctl status sshd.service sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled) Active: active (running) since Thu 2015-01-22 21:01 : 38 YEKT; 4min 53s ago

Hurray!
Port 5000 for ssh is free, I can go to it.
But neither github nor yum update, nothing ...
More precisely, at night I could not configure these things.

At 7 in the morning he called (woke up his partner), he told everything. The problem was that we stupidly could not compile the source code, because some libraries were missing, I can not compress them (ssh 5000). I tried one at a time, but there, damn it, the dependencies are very good.

We decided to create another server with full ejudge (3.3) settings so that then I would not have to go to the server (it was located in the server room, under the lock, it was problematic to get physical access to the machine).

January 23, Friday, the start of a test tour at 16:00

At 9 in the morning I’m going to take a colloquium on Funkhan, the dean set something, did not look. It seems not to be “unsuccessful."
At 10 o'clock I begin to collect a new ejudge in parallel with Artyom. He does it a little faster, but I stopped at a small step and stopped thinking further.
The second problem.

Build the ejudge of Wesria 3.3 from the fedora19 image with ejudge 2.3

They did not begin to delete the old version, they just started to install a new one.
We pull the source code from the github, launch.

git clone https://github.com/blackav/ejudge.git
cd ejudge/
./fedora-configure
make
su
make install
#Вроде теперь надо просто запустить ejudge-control, но:
ejudge-control
Tue Jan 27 01:24:35 2015:info:ej-users 2.3.29, compiled Sat Dec 14 07:58:33 2013
mysql: SELECT config_val FROM config WHERE config_key = 'version' ;
Tue Jan 27 01:24:35 2015:info:ej-super-server 2.3.29, compiled Sat Dec 14 07:58:33 2013
Tue Jan 27 01:24:35 2015:info:configuration file parsed ok
Tue Jan 27 01:24:36 2015:info:ej-jobs 2.3.29, compiled Sat Dec 14 07:58:33 2013
Tue Jan 27 01:24:36 2015:info:ej-contests 2.3.29, compiled Sat Dec 14 07:58:33 2013
Tue Jan 27 01:24:36 2015:info:using files as the new-server database

Yes, exactly, ejudge-conrtol picked up the old version.
Everything worked, go to the web version - we see the old.
Renamed the folder where the old version of the binary was located. At the same time, he pursued 2 goals: to make sure that he disappeared from the paths and backup the old version.

Now run ejudge-control again, which is located in / usr / bin / ejudge-control:

[ejudge@localhost ~]$ ejudge-control start
2015-01-27T19:03:18Z:info:ej-users 3.3.1, compiled 2015-01-23 09:25:21
mysql: SELECT config_val FROM config WHERE config_key = 'version' ;
2015-01-27T19:03:18Z:info:ej-super-server 3.3.1, compiled 2015-01-23 09:25:21
2015-01-27T19:03:18Z:info:configuration file parsed ok
2015-01-27T19:03:19Z:info:ej-jobs 3.3.1, compiled 2015-01-23 09:25:21
2015-01-27T19:03:19Z:info:ej-contests 3.3.1, compiled 2015-01-23 09:25:21
2015-01-27T19:03:19Z:info:using files as the new-server database

A little more shamanism, and the trial tour is ready!

We said this when the time was about 5 p.m.

I ran with the distribution in the server room. I come, and there the screen only went out. I thought the monitor fell asleep. It’s getting worse - the system administrator just cut off the power of my iron for no reason. Now I wait until windows server 2008 boots up, then copy, import into the virtual box, start, put down static addresses, configure ssh. Due to the fact that the last time I was set up by my scientific scientist (Yuldashev Arthur Vladimirovich), this time I had to spend a lot of time. All this was aggravated by the fact that in the server room I did not have the opportunity to google.

The time is 17:45, the trial tour is almost over, our server still hasn’t got up ... A lot of calls come in - we answer, they say, everything is rounded off, we won’t have time to raise the server.

Time 18:00, the server has not yet risen. Gathered with other juries, we think how to get out of this situation.

The following was decided: We don’t sleep with Artem, we finish the trial tour and the first one, we will prepare everything by 10, from 10:00 to 11:00 we will start the trial tour, and at 11:00 we will start the 1st round. So we lost sleep for 2 nights.

They said goodbye and drove home. Houses began to set everything up anew, tuned. By morning, everything was ready.

January 24, Saturday, round 1 (official schedule)

The trial tour begins, and here we finally realized what we were dealing with.

Tokens

What it is?
Last year there was the following situation: the participant sends the source code to the testing system, which, in turn, checks only on the tests, which are shown in the example to the task. If the parcel them fails, it does not get in the queue for a complete check. Therefore, our honorable one server calmly coped with the entire load (there were 150 participants in total).

This year we had to test the solution immediately on all tests. To ensure that participants do not abuse it, this concept was introduced - tokens. This, so to speak, is the right to see the result of your package. It was equal to 10. That is, I can send a solution to the problem as many times as I like, but I can only see 10 times. Subsequent packages at your own risk.

The trial tour has begun, and we have a server delay of 15 minutes. That is, the participant sends the solution to the server, and it is checked there only after 15 minutes. We were not afraid of this. But in vain. We thought what would pass.

I am doing a Reload contest, dropping the entire queue of packages. However, he did not inform anyone about this. As a result, 10 minutes before the end of the test tour we are again bombarded with parcels. Quietly close the contest, open the 1st round contest.

11:00, round 1

Literally in 15-20 minutes several packages arrive, a bad queue appears. Artem immediately made it clear. In the first task, in the lightest, as expected, only 48 tests. The solution is in the forehead, which is gaining 50 points out of 100, but there is a good solution that needs to be thought out. But the majority should have learned about this only after their decision got TLE. As you understand, one premise of task A, solved in the forehead, took 24 seconds at the server. Such parcels became more and more, questions began to come to the jury regarding the time of testing. Artyom explained everything correctly, sent a message to everyone. But even so, almost everyone sent at least one “free” decision A. And then the line naturally began to increase. First 15 minutes, then sharply 45. Everyone, especially the participants, were worried, tense, and unhappy. First of all, by us. Artem was at home at that time, I was in place and heard almost everything I had to hear. We started to think, we need to somehow try to get out of the situation. Found the right article in the documentation, but could not use. After that, we simply closed our eyes to 30 questions and waited for it to end.

Finally over! Check delay - 1 hour. The participant had to send a decision an hour before the end in order to have time to see the verification protocol.

16:00, I go to the assembly hall. I meet dissatisfied eyes. Why, I just deprived the children of reaching the final. How you could still look at me. Crossed with one very famous teacher, told what the problem is, what are the solutions - to parallelize. He wished me good luck.

Everyone announced the problem openly. They said that we did not expect such loads and the like. Immediately began to think, to seek a way out of the situation.

Option number 1. In each display class, put 1 server, in large classes - 2. After the Olympics, we will collect all the results, no one will have problems with the network, the load can be reduced in order, which will make it possible to meet all requirements by 100%. The flaws are obvious: it's Saturday, almost all display classes are already closed, including the server one. We have no servers at hand, images of 2 rounds too. Display classes are too far apart, in 3 buildings. About access via ssh, you can not talk. Round 2 starts on Monday at 9 a.m., train. We cannot do such a thing on Monday morning, for there are only two of us.

Option number 2: connect the computing nodes to the main server. This case is perfect. Nothing needs to be changed in terms of organizing the Olympiad. The only problem is to create these compute nodes.

Nothing was at hand then. 1 call - and in an hour we have 13 laptops, core-i7, 8 GB of RAM. The only image of the car that I had was the image of a test tour.

20:00, sitting at the department, setting up a server for 1 laptop. They called Artem, let him come, helps me set everything up (I did not know how to set up a tour). Suddenly, the organizer's head comes up with the thought - the house is empty (the wife and grandchildren arrive only Sunday afternoon), let’s go to me for the night.

Everyone is happy, more precisely, Artyom and I. Another teacher is coming with us to help us.

January 25, night is day

We took 7 laptops with us, arrived, unpacked. They prepared us delicious food, and we, gaining strength, began.
We set up the 2nd round, threw the image into the drive and thought about it, or maybe try to parallelize it?
There is a lot of time, forces, like, too.

And now for the fun part. How ejudge works.
There is a service (daemon) responsible for compiling, starting, testing programs - ej-super-run. He takes the data from / home / judges /, where configuration files, tests, checkers and sent solutions are usually located.
I don’t know which process is responsible for the web interface, but we ran ejudge-control, which ran the whole system. I did not go into details.
Under parallelization, it was proposed to share the folder / home / judges /. And it doesn’t matter how - SSHFS, Samba, NFS.
But for this you need to reassemble the working nodes with a certain key, as they are called in distributed systems - slaves. Lab work on the OS included creating network folders using NFS and Samba. I easily took up the samba and immediately ran into the first problem, which was already too lazy to solve. abandoning it, set to NFS. It was natural to expect that here I will also meet many problems. There is one last, more familiar SSHFS. It is familiar because I was somehow friends with SSH, often worked with him.

Opened the first tutorial, set everything up.

First, make sure the directory / home / judges / is empty, otherwise we clear it.

sshfs ejudge@192.168.1.11:/home/judges/ /home/judges/

After that, the directory / home / judges / becomes shared with the server. For complete convenience, you can mount it, but we did not do this, because it is already morning.

If you need to specify a different port, you must add the -p option

sshfs -p 5000 ejudge@192.168.1.11:/home/judges/ /home/judges/

In the case of our server, this was relevant.

And, thank God, it worked!

We chose one laptop as the server, and another as the slave. We are talking about virtual machines raised on them.
Through the web interface, I launched 2 packages (with while (true), so that it would display TLE on all tests), which the server itself performed and recorded the time. We started ej-super-run on the work node, again sent 2 tasks for rechecking - happiness.
The work node picked up the package, began to test. The scan time is almost 2 times less, 30 seconds against 50.

The next step was to connect the work node to a real server, because now the 5000 port is not a problem for us.

They began to fill in the rest of the laptops, simultaneously optimizing the settings. We wanted to write a beautiful script that could easily prescribe all the settings, but, alas, crooked hands and curves that such things can’t do right away. I registered all the settings with my hands. On the server, the ej-super-run process was stopped, let it only deal with the web interface.
Next, we thought: on each laptop there are 4 cores, 1 working node can only be checked in single-threaded mode.

Give a man a mountain of gold, he will want one more

Either we raise 1 virtual machine, give it a lot of resources, and in it we parallelize the cores, or just raise 2 virtual machines, 2 cores each.
We didn’t care how much the system accelerated - 2 or 3 times, if we still have a lot of cars. We decided to stop there, to raise 2 cars on a laptop. When all 7 laptops were ready, we decided to reward ourselves with a sleep at 12 o’clock.

January 26, 08:30

Already fresh, at the university, Artem also arrived. We got all 13 laptops, the guys from the “network service” quickly squeezed the wires, set up the network, as a result, 12 of them were already on the network, the 13th laptop did not get Internet connection, the wire was apparently old. He quickly picked up the first 7, after which the web interface began to slow down terribly, apparently sshfs downloaded the entire directory to itself, which was pretty chubby.

Round 2 has begun, we already have 14 work nodes! Quietly he began to connect the nodes to the system, one at a time, so as not to overload the system.

The queue on the server did not exceed 10 simultaneous tests. That is, in principle, 5 laptops are enough to conduct a full tour.

They came from television, they were told that we have 24 work nodes. I had to raise everything before the end of the Olympiad to keep my word.

As a result, the 2nd round participants wrote much better than the 1st, although in the 1st round there was a participant who wrote on 400, and on the 2nd round, they scored only 370.

Alternative

In fact, everyone knew everything for a long time, and for this they even resorted to Yandex. The latter accepted applications from regions that could not independently carry out the regional stage according to new requirements. Applications had to be submitted 10 days before the Olympics, so we did not consider this method. 26 regions addressed to Yandex.
They also said that in other regions, too, everything is not good, scores are low in general.

That's how we, the technical jury, shamefully held the regional stage of the All-Russian Olympiad.

Conclusion

UPD :
List of errors:
1) As a student who has parallelization of tasks in the curriculum (openMP, MPI), I had to understand that it is impossible to conduct an olympiad on 1 machine
2) It seems that you need to start to get interested even earlier, to walk, to find out what's new, not wait until they call. The fact is that this year I specifically, for a month, or even 2, was interested in about ROI, but they did not tell me anything sensible.

Tags: