Highly loaded systems: solving the main problems

  • Tutorial
Hello, Habr!

Today I want to talk about some solutions to problems that arise when using highly loaded systems. Everything that will be discussed in this material has been tested on my own experience: I am Social Games Server Team Lead at Plarium, which develops social, mobile, browser games.

First, some statistics. Plarium has been developing games since 2009. At the moment, our projects are launched in all the most popular social networks (Vkontakte, My World, Odnoklassniki, Facebook), several games are integrated into major gaming portals: games.mail.ru, Kabam. Separately, there is a browser and mobile (iOS) version of the "Rules of War" strategy. The databases include more than 80 million users (5 games, localization in 7 languages, 3 million unique players per day), as a result, all our servers receive an average of about 6,500 requests per second and 561 million requests per day.

As a hardware platform on combat servers, two server CPUs with 4 cores (x2 HT), 32-64 GB RAM, 1-2 TB HDD are mainly used. The servers are running Windows Server 2008 R2. Content is distributed via CDN with a bandwidth of up to 5 Gbps.
Development is under the .NET Framework 4.5 in the C # programming language.

Given the specifics of our work, it is necessary not only to ensure the stable functioning of the systems, but also to ensure that they can withstand a large jump in load. Alas, many popular approaches and technologies do not stand the test of high load and quickly become a bottleneck in the system. Therefore, having analyzed a lot of problems, we found optimal (for me) ways to solve them. I will tell you which technologies we have chosen, which ones have been abandoned, and explain why.

NoSQL vs. Relational

In this battle, pure NoSQL proved to be a weak fighter: the solutions that existed at that time did not support sane data consistency and did not have sufficient resistance to falls, which made itself felt in the process. Although in the end, the choice fell on relational DBMSs, which made it possible to use transactionality in the necessary places, in general, NoSQL is used as the main approach. In particular, tables often have a very simple key-value structure, where data is presented in the form of JSON, which is stored in a packed form in the BLOB column. As a result, the scheme remains simple and stable, while the structure of the data field can be easily expanded and changed. Oddly enough, this gives a very good result - in our solution we combined the advantages of both worlds.


Given the fact that pure ADO.NET has a minimal overhead, and all requests are created manually, are familiar and warm the soul, it sends any ORM to a deep knockout. And all because the object-relational mapping in our case has a number of minuses, such as low performance and low query control (or lack thereof). When using many ORM solutions, you have to struggle with the library for a long time and often and lose the main thing - speed. And if it comes to a tricky flag for the correct processing of client library timeouts or something similar, then attempts to imagine setting such a flag using ORM are completely frustrating.

Distributed transactions vs. Own eventual consistency

The main objective of transactions is to ensure data consistency after the operation is completed. Changes are either successfully saved or completely rolled back if something went wrong. And if in the case of one base we were glad to use this, no doubt, important mechanism, then distributed transactions at high loads showed themselves not to the best. The result of use is increased waiting time, complication of logic (here we recall the need to update caches in the memory of application instances, the possibility of deadlocks, and low resistance to physical failures).
As a result, we developed our own version of the mechanism for providing Eventual Consistency, built on message queues. As a result, we got: scalability, resilience to failures, an appropriate time for the onset of consistency, the absence of dead locks.

SOAP, WCF, etc. vs. JSON over HTTP

When using off-the-shelf SOAP-style solutions (standard .NET, WCF, Web API, etc. web services), insufficient flexibility was discovered, difficulties arose with configuration and support by various client technologies, and an extra infrastructure intermediary appeared. To work with data, we chose to send JSON over HTTP, not only because of its maximum simplicity, but also because using such a protocol was very easy to diagnose and fix problems. Also, this simple combination most widely covers client technology.

MVC.NET, Spring.NET vs. Naked ASP.NET

Based on my experience, I can say that MVC.NET, Spring.NET and similar frameworks create unnecessary intermediate constructions that prevent squeezing the maximum performance. Our solution is built on the most basic features provided by ASP.NET. In fact, the entry point is a few common handlers. We do not use a single standard module; there is not a single active ASP.NET session in the application. Everything is clear and simple.

A little bit about bicycles

If none of the existing methods for solving the problem is suitable, you have to become a little inventor and again search for answers to questions. And even if you sometimes reinvent the wheel, if this bicycle is critical for the project - it's worth it.
JSON serialization

A little more than a third of the time we use is spent on CPU serialization / deserialization of large amounts of data in JSON format, so the question of the effectiveness of this task is very important in the context of system performance in general.
Initially, we used Newtonsoft JSON.NET in our work, but at a certain point we came to the conclusion that its speed is not enough, and we can implement the feature sub-option we need in a faster way, without the need to support too many options for deserialization and “wonderful” features like validation JSON schemas, deserialization in JObject, etc.

Therefore, we independently wrote serialization taking into account the specifics of our data. At the same time, on tests, the resulting solution turned out to be 10 times faster than JSON.Net and 3 times faster than fastJSON.

Critical to us was compatibility with pre-existing data serialized using Newtonsoft. To ensure compatibility, before we included our serialization in production, we tested on several large bases: we read data in JSON format, deserialized using our library, performed serialization again and checked the source and final JSON for equality.


Due to our approach to organizing data, we got a negative effect in the form of a too large heap of large objects (large object heap). For comparison, its size averaged about 8 gigabytes against 400-500 megabytes in second-generation objects. As a result, this problem was solved by breaking large blocks of data into smaller blocks using a pool of previously allocated blocks. Thanks to this scheme, a lot of large objects are significantly reduced, garbage collection began to occur less frequently and easier. Users are satisfied, and this is important.

When working with memory, we use several caches of different sizes with different aging and update policies, while the design of some caches is extremely simple, without any frills. As a result, the efficiency indicator for all caches is no lower than 90-95%.

Having tested Memcached, we decided to postpone it for the future, as there is no need for it yet. In general, the results were not bad, but the overall performance gain did not exceed the cost of additional serialization / deserialization when placing data in the cache.

Additional tools

• Profiler
Since familiar profilers significantly slow down the application’s speed when connected to it, actually making it impossible to profile sufficiently loaded applications, we use our system of performance counters:

This test case shows that we wrap the basic operations in counters with names. Statistics are accumulated in memory and collected from servers along with other useful information. Counter hierarchies for call chain analysis are supported. As a result, you can get a similar report:

Among the positive aspects:

- the counters are always on;
- minimum costs (less than 0.5% of the resource used by the CPU);
- A simple and flexible approach to specifying areas to be profiled;
- automatic generation of counters for entry points (network requests, methods);
- the ability to view and aggregate on the basis of parent — child;
- you can evaluate not only real-time data, but also save the values ​​of the measurements of the meters in time with the possibility of further viewing and analysis.

• Logging.
Often this is the only way to diagnose errors. In our work, we use two formats: human readable and JSON, while writing everything that can be written while there is enough disk space. We collect logs from servers and use for analysis. Everything is done based on log4net, so nothing extra is used, the solutions are as simple as possible.

• Administration
In addition to the rich graphical web interface of our admin panel, we have developed a web console where you can add commands directly on the game server, without making any changes to the admin panel. Also, using the console, you can very simply and quickly add a new command for diagnostics, obtaining technical data online or the ability to adjust the system without rebooting.

• Deployment
With the increase in the number of servers, manually pouring something became impossible. Therefore, in just a week of work of one programmer, we developed the simplest system for the automated updating of servers. The scripts were written in C #, which allowed flexible enough to modify and maintain the deployment logic. As a result, we got a very reliable and simple tool, which in critical situations allows you to update all production servers (about 50) in a few minutes.


In order to achieve speed with a high load on the servers, it is necessary to use a simpler and thinner technology stack, all tools must be familiar and predictable. Designs should be simple at the same time, sufficient to solve current problems and have a margin of safety. It is optimal to use horizontal scaling, to cache performance control. Logging and monitoring the status of the system - must have for the life support of any serious project. And the deployment system will greatly simplify life, save you nerves and time.

Also popular now: