
Load testing Skyforge. One year later
More than a year has passed since the publication of articles on load testing Skyforge - the new MMORPG from the studio Allods Team. Since then, much has changed: the design of Habr, Ubuntu was updated to 04/14/1 LTS, Java 8 was released, and most importantly, the stage of development of the project has changed. The first closed testing took place on external users, and soon there will be a stress test - an invitation to the maximum possible number of "live users" to the server as part of the CBT or MBT. But I will not take away the work of our marketing team, I’ll tell you better about what is new in stress testing, what we rethought, and what can be useful to the general public from this.

These tests showed themselves very well as stressful ones, they did not expect more from them. They are not able to find errors beyond simple “does not work”. But they differ in very good repeatability of the result, much cheaper in development and, as a result, in support. If we talk about savings also in the glands, it turns out somehow like this:
This is primarily due to the fact that the farther we are from the combat configuration, the more we can afford. For example, in the test of the statistics system, in principle, there is not a single spare part related to the game itself, only the applications that process the data themselves. In tests of a chat room or database, we deliberately do not load the game mechanics, keeping the game realm in the minimum launch configuration, and only the load object is in fully combat mode. It is also worth noting that the less subsystems participate in the test, the higher the stability of the test.
Also, the optimization of the game client itself did not pass by our bots, as their memory consumption decreased significantly. And now we can launch twice as many bots from one physical machine - 2k instead of 1k.
We are currently conducting client tests according to the following scheme: everyone passes the start of the game (the most important moment for us in terms of load), everyone somehow plays (the profile of players participating in different activities is taken from the head), everyone plays on a specific map. This allows us to find bad, in terms of load, cards and quickly intervene in the process of creating them. Watch what load profile we have in a quiet time, and be sure that everything is fine with the start of the game.

You can then open this dump using JMC. The dump will provide all the necessary information: statistics of allocations, who ate processor time, the contribution of the process to the overall cpu load of the server, and much more. JMC is good, but since we cannot afford it on combat servers, we use the grandfather method - GC logs, from which we get the following information: how much time we spent per minute in gc, the total application stop time for the same period, which objects were before FullGC, which ones after:
Example graph: Example statistics before - after: Just in case, we start all the servers with the option of remote debugging. This saves a lot of time when something goes wrong, and from the logs the exact cause of the problems is unclear:



We keep similar statistics for database operations, we know not only what operations were performed: But also the time of their execution: In order to optimize traffic, we also have to make our own decisions. Therefore, we measure exactly which messages were sent, taking into account both their quantity and volume.



Also, the selection of a separate service for building reports contributed to the emergence of a single entry point for viewing data from load tests, battle servers or other test benches.
Secondly, if you have a large and complex distributed system, then, in addition to integration load tests, it may also be advisable to carry out load tests on individual components. This is generally cheaper, and such tests can be made more flexible.
And, thirdly, load tests are also useful in that a significant part of the strapping created for their implementation can even work very well in combat conditions.
That's all. As always, I will be happy to answer the questions in the comments.

Summary of previous parts
- Skyforge is an MMORPG set in the sci-fantasy world. The world in the game will be the same for all territories. That is, all the players of Russia and other countries of the former USSR will be able to complete tasks together, save the world and become gods. There will be no division by server.
- Skyforge server is written in Java, the architecture is described in great detail in the corresponding randll post .
- Databases - PostgreSQL + distributed transactions.
- A bot is a program written in C ++ that simulates the actions of a real player. Bots operate on the same protocol as an honest game client, use the same set of commands, and, in general, from the point of view of the server, they differ slightly from a regular client.
- Load testing - a set of measures aimed at obtaining information about whether the server is able to hold the load. We run various types of load tests several times a day. The average test lasts 40 minutes, while the net test time is in the range from 60 to 80 minutes.
More stress tests
For quite a long time, “client” load tests remained the only load tests that we performed. But time passed, ambitions grew, needs changed and tasks appeared that required testing the load more than we could give using client bots. The restriction was caused primarily by the fact that client bots were engaged in a very large number of "third-party" things - they made decisions, honestly checked some conditions, played, in the end. So server bots, written in Java, began to appear, devoid of any logic and just giving heat. Now we have three types of such "bots":- Database - send blindly Database operations, using as a source profile the profile of real players from closed tests, and random data;
- chat bots - do the same as the database ones, only for chat services;
- statistics generators - the idea is exactly the same as in the two previous cases, but for the statistics subsystem.
These tests showed themselves very well as stressful ones, they did not expect more from them. They are not able to find errors beyond simple “does not work”. But they differ in very good repeatability of the result, much cheaper in development and, as a result, in support. If we talk about savings also in the glands, it turns out somehow like this:
- for testing 10k CCU by client bots we need 7 (load objects) + 10 (bots) = 17 servers in total;
- to test a 50k CCU database server: 4 + 2 = 6 servers;
- 100k CCU chatika: 4 + 2 = 6 servers;
- 100k CCU statistics system: 2 + 1 = 3 servers.
This is primarily due to the fact that the farther we are from the combat configuration, the more we can afford. For example, in the test of the statistics system, in principle, there is not a single spare part related to the game itself, only the applications that process the data themselves. In tests of a chat room or database, we deliberately do not load the game mechanics, keeping the game realm in the minimum launch configuration, and only the load object is in fully combat mode. It is also worth noting that the less subsystems participate in the test, the higher the stability of the test.
Client bots
But no matter how beautiful the server bots may be, we do not intend to refuse client bots. Because the benefits from them are much greater, and the load profile is as close as possible to the real one. Therefore, over the past year they have also been significantly improved. Now they can almost completely honestly pass a significant part of the content of the game. At the same time, support is required in a minimal amount. It looks something like this: the bot appears on the map, looks at its quest tracker, sees there an instruction to run to point A and runs. Due to the fact that the bot is trained to interact with the world around it, at point A it will try to talk to someone, interact with something or kill all the aggressors. Almost like that bike: can it eat me? And I him? But can I copulate with this? Is it with me? :)Also, the optimization of the game client itself did not pass by our bots, as their memory consumption decreased significantly. And now we can launch twice as many bots from one physical machine - 2k instead of 1k.
We are currently conducting client tests according to the following scheme: everyone passes the start of the game (the most important moment for us in terms of load), everyone somehow plays (the profile of players participating in different activities is taken from the head), everyone plays on a specific map. This allows us to find bad, in terms of load, cards and quickly intervene in the process of creating them. Watch what load profile we have in a quiet time, and be sure that everything is fine with the start of the game.

Without these tools, stress tests would be 10 times dull
This is perhaps the most useful part of the article. When conducting load tests, it is not enough to know whether the server is holding the load or not. The most important thing is the ability to quickly understand what exactly is going wrong. Java Mission Control and its feature Flight Recorder make an invaluable contribution here. Unfortunately, this option on combat servers is quite expensive ($), so we use it only in tests. It looks something like this: You can read more on the Oracle website .-XX:+UnlockCommercialFeatures # Включение поддержки JMC
-XX:+FlightRecorder # Включение режима отложенной записи профиля
-XX:StartFlightRecording=name=skyforge,filename=skyforge.jfr,delay=40m,duration=10m,settings=jmc.jfc
You can then open this dump using JMC. The dump will provide all the necessary information: statistics of allocations, who ate processor time, the contribution of the process to the overall cpu load of the server, and much more. JMC is good, but since we cannot afford it on combat servers, we use the grandfather method - GC logs, from which we get the following information: how much time we spent per minute in gc, the total application stop time for the same period, which objects were before FullGC, which ones after:
-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintClassHistogramBeforeFullGC -XX:+PrintClassHistogramAfterFullGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -Xloggc:memory/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=memory/heap.dump
Example graph: Example statistics before - after: Just in case, we start all the servers with the option of remote debugging. This saves a lot of time when something goes wrong, and from the logs the exact cause of the problems is unclear:


-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=51003
Own statistics
In addition to using ready-made profiling tools, we actively developed our own. So, for example, we log every spell that a player conjures, measuring how much processor time was spent on it. This allows you to make decisions about which particular abilities and mechanics you need to optimize in the first place.
We keep similar statistics for database operations, we know not only what operations were performed: But also the time of their execution: In order to optimize traffic, we also have to make our own decisions. Therefore, we measure exactly which messages were sent, taking into account both their quantity and volume.



Optimization when building test reports
With the increase in the number of tests and the number of graphs, it became clear that doing the test preparation, conducting it and analyzing it in one process was an inadmissible luxury. In this regard, the direct analysis of the test results and the construction of the report were submitted to a separate service that was not connected with the CI system. This freed up time to run additional tests.Also, the selection of a separate service for building reports contributed to the emergence of a single entry point for viewing data from load tests, battle servers or other test benches.
Our rake
During the tests it is very important to control the infrastructure on which these tests are conducted. I already mentioned in previous articles that we had problems with the CPU Frequency Governors when the process clock frequency was artificially lowered in order to save energy. So, we again fell for it. Now we’re thinking how to embed the verification of these flags in the server. And in the database services, for example, we added a check that a synchronous replica is configured on the databases. Because its sudden "shutdown" gives a noticeable performance boost. In general, I advise you to add environment checks directly to the services themselves. This ensures that your servers are operated and tested in the exact environment for which they are designed.conclusions
First of all, I want to note that stress testing, like any other means of improving the quality of software, brings maximum benefit only when used constantly. Yes, testing support takes effort, but it's worth it. It is better to spend these efforts in a calm environment than in a fire mode.Secondly, if you have a large and complex distributed system, then, in addition to integration load tests, it may also be advisable to carry out load tests on individual components. This is generally cheaper, and such tests can be made more flexible.
And, thirdly, load tests are also useful in that a significant part of the strapping created for their implementation can even work very well in combat conditions.
That's all. As always, I will be happy to answer the questions in the comments.