shortcircuit April 21, 2015 at 10:59

Friday Night to Monday: How We Launched Skyforge

As many of you know, on March 26, Allods Team (Studio Mail.Ru Group) launched an open beta test (MBT) of the new Skyforge MMORPG project. My name is Sergey Zagursky, I work as a server team and I want to talk about how the launch went, what incidents we encountered and how the winners got out of the situation.

Early MBT phase

On March 26, MBT was launched for early access kits owners. During the week they had exclusive access to the game, and they could use it to get a small handicap in front of other players. This phase did not bring us big surprises, because loads, in general, were comparable to loads on the closed beta test and stress test. The interesting thing started later ...

Access is open to all.

On the 2nd of April, on Thursday, during the planned preventive maintenance, the checkbox “Entrance is allowed only to owners of early access kits” was unchecked. Despite the delays in entering the game, everything went smoothly enough for the first day. There was night prophylaxis dedicated to eliminating irregularities in the configuration of iron / software.

Friday!

Interesting things started on Friday, April 3rd. In the morning, after the end of night prophylaxis, the players began to fill server capacities. 20 minutes after the start of the servers, we passed the stress bar on the stress test and entered uncharted territory. Of course, we carried out significantly larger synthetic load tests, but the player is an unpredictable creature, from which you can expect button presses that bots have not yet learned to press. The first signs of impending difficulties were not long in coming - already 2 hours after opening the server stopped users for the first time in the game.

One of the main focuses in server development is quality of service. Therefore, we have developed several mechanisms to evaluate it for players who are already in the game. If this estimate falls below a certain threshold, then we suspend the entry of new players, guided by the consideration of "better, less is better." When this indicator was exceeded, two hours after the start, we closed for half a minute, and the players began to observe the entry line, which was moving only at the expense of those who decided to leave it. Many of the developers were in the game at that time and were somewhat perplexed, because subjectively, the quality of service was not affected. There were several reasons for this. The load analysis revealed several nodes with an abnormal load. The most heavily loaded was the service serving the in-game market. In the game itself, the market feels like worked with an acceptable delay. The second by load were several nodes storing data of game characters. It takes a little excursion into the recent history of these nodes.

At the closed beta stage, we were somewhat concerned about the risks that our main nodes with bases might not overpower the load. Therefore, shortly before the start of MBT, we used several additional machines for the base. By configuration, they were suitable in everything except the disk subsystem. Therefore, SSD disks from bins were installed on these machines and connected to the server stand in a new form.

The feature of balancing characters by nodes with a database played a small joke with us. The specific node number on which the player’s character will be stored is determined when the user creates the account. Even if this account was created from the game portal. It is easy to guess that all the players who showed interest in the game, logged into the game portal and registered an account there, were balanced on the "old" nodes. Among them were all the developers. Subsequently, this prevented us from compiling a picture of the quality of service from the point of view of the player, as on these nodes, the load was within normal limits. Of course, we knew about these features in advance and took measures to ensure that new users are registered on the "old" nodes only after the load on the "old" and "new" nodes is aligned. So in our game appeared, as we called them,

Back to the chronicle. As I already said, not all nodes with bases were loaded, but only a few. Moreover, on the nodes "for old-timers" the quality of service was higher, despite the fact that there were much more players at that moment. At this time, the server was closing its doors for the Nth time, and, guided by the fact that there was nothing lagging inside the game, we decided to adjust the maximum load limit, after which the server stops players from entering the game. There were several such corrections, each of which temporarily improved the situation with entering the game. We tried to find the value at which the load stabilizes. After the next correction, the load on the nodes with the database stabilized, and the market continued to degrade.

It happened. Dashboard reported that the server has started the emergency stop procedure. A brief investigation showed that the server was extinguished in strict accordance with the algorithm laid down in it. What is called, by design.

In the distributed system of the game server, there are many nodes that are responsible for storing certain data about the characters and the game as a whole in the database. The quick recovery strategy for failures includes the periodic creation of consistent restore points on all nodes with databases. Once in a certain period of time, a special service coordinates the creation of these restore points and notifies about problems with their creation. The system is configured in such a way as to exclude the rollback of the progress of the characters in case of violation of the integrity of the data, more than 10 minutes. In our case, the service reported on the impossibility of creating restore points for the database of the market and initiated the stop of the game server. During the forced prevention, it was decided to deactivate the gaming market. The most reliable solution was to give the client empty lists of possible operations.

In the process of all this, the reason for the abnormal load on some nodes storing the data of the game characters was clarified. Due to a fateful combination of circumstances, disks of lower performance were installed on new nodes with bases. Replacement disks were found; replacement was planned for routine maintenance on Monday. Also, colleagues managed to make a patch that corrects the abnormal load on the market, the layout of which was also planned on Monday.

After restart

The situation seemed to stabilize. The load on some database nodes was high, but stable. Most players observed this as delays in raising loot, operations with adepts and boosting the development graph. The market was completely disconnected, and this blocked a tangible part of the user experience. A deeper analysis showed that from the point of view of the load, sales operations have much more weight, and with the help of the same hotswap we included the possibility of buying in the market. The results of the analysis did not deceive us, the market began to answer for an acceptable time.

The server was working, the team was going to hold a small corporate party at 19:00 dedicated to the launch of MBT. And here in the logs suspicious messages from the database skip about the violation of the integrity of the databases on nodes with less efficient disks. A small dialogue with the operating team, checking the status of the replicas - and at 19:00, during the solemn part of the corporate party, to the cries of "Hurray!" And splashes of champagne, suspicious nodes are sent for prevention.

Pg_dump on the replicas and the master showed the same disappointing error message. Everything indicated the loss of some of the game character data. Plans are changing dramatically. Routine prophylaxis is postponed from Monday to night from Friday to Saturday. Urgent courier leaves to deliver the SSD to the desired data center. We quickly begin assembling a new version of the server, which includes all the corrections and optimizations made at that time. We, gray-haired, are sitting and sorting out the failed bases.

The analysis showed that the integrity of not all bases was violated. The operator reconfigured the server so that it starts without corrupted databases. A calmer and deeper analysis revealed that as a result of Postgres crash, only indexes were corrupted, which, to everyone’s satisfaction, were correctly recreated.

New crashes

At the moment when I approached the operator’s team with this news, they remotely coordinated the replacement of slow disks with normal ones. Telemetry shows a failure and a drop in the number of players (yes, those players whose characters were on enabled nodes continued to play in Skyforge all this time). Employee Error. At 3 a.m., the administrator on duty pulled out disks not from the 12th, but from the 11th unit. In the 11th unit there is a node with an authentication service. Hysterical laughter.

Fortunately, the RAID from which the disks were pulled out did not crumble, and this incident only slightly extended the work. By the way, until the server was completely stopped, there were 6 people who played it. After stopping the authentication service, you can continue to play until you switch to another card (which, by the way, is a bug, and in future versions the transition will not depend on the authentication service).

Run at 7 a.m.

In the morning, the server was launched on all bases. The results of these nightly work sessions included: a) replacing slow SSDs with ordinary SSDs on some nodes, b) temporarily transferring synchronous replicas to a RAM disk to reduce the load on SSDs. The load on the nodes storing character data was no longer a concern.

Shaw, again?

At 13:45 Saturday, the load on nodes with the data of the game characters flies to infinity, and the server crashes to the restore point service we already know. Reason: RAM disks with synchronous replicas were full, and the nodes were blocked on the commit. Because at that time we already knew that there was no urgent need for RAM disks for replicas, then during unscheduled preventive maintenance we switched synchronous replicas back to SSD and restarted the server. On Saturday, it also turned out that in the heat of assembling the new version of the server, optimization by the market did not get into it, so the sale on the market had to be turned off again.

To summarize

Top problem areas because of which we spent more time than we could:

The synthetic load on the market load tests was spread over all market positions, but in reality there was a serious contention in one of the positions. This led to an abnormally high load on the market service.
Underestimation of the possibility of failure on the part of iron and RDBMS. This was a kind of surprise for us and, if we were prepared, unscheduled prevention could be avoided altogether.

What we, in the context of the above adventures, are proud of:

The toolset for hotswap allowed us to respond quickly in various situations. This concerned both turning on / off the features of the game and expanding diagnostics without stopping the server.
The server has confirmed its stability when disconnecting / dropping individual nodes.
Diagnostics and automatic server operation reports helped us a lot in detecting and determining the causes of failures.

Tags: