Plarium March 6, 2017 at 12:07

Building a cache with efficient multi-threaded access

Hello! My name is Pasha Matlashov. I am the Director of Game Server Development Department at the Plarium gaming company.

Today, as an example of our projects, I will talk about the features of caching, pitfalls and how to get around them.

Background

Our servers process 8500 requests per second and store information about 200 million users. To ensure effective linear scaling, we use user segmentation (it is also sharding). The process of distributing new users occurs according to a simple balancing rule: a new user falls into the least “populated” segment. Such segmentation is static, which provides ease of development. Additionally, to simplify the architecture, there is a rule: user data can only be edited on the segment on which it is stored. This eliminates the need to think about issues such as distributed locks and transactions. Within the server, the user is called a game entity.

Data lock

To regulate flows and correctly edit data, you can follow the principle of optimistic or pessimistic blocking.

Using optimistic locking does not guarantee that until the end of the transaction, parallel entities / threads / requests will not modify the data. To minimize the number of readings of data before editing, we remember the version of the game entity. When the changed data is saved to the database, we additionally save the value of the old version.

There is a minus in this approach: situations may arise when a game entity will have to be read out repeatedly due to a large amount of interaction.

Imagine that a user buys a building, and the battle is triggered at the same time. Both of these events change the essence of the game. It may happen that the result of the battle will be calculated at the time of applying the building purchase logic. Data is stored in the database, the user revision is changed. We need to throw away a huge array of data. In addition, it should be borne in mind that saving the game entity to the database includes the stage of serialization and data compression. As a result, the data is sent to the database, but the revision does not match. You need to retrieve the entity from the database again, reproduce the transaction logic, serialize, compress the data, try again to fix the version of the data in the hope that nothing will change. This approach complicates the code and consumes extra resources.

We went on the principle of pessimistic blocking. Before any action, data is locked - this guarantees a one-man editing process.

Choosing a data lock location

Constantly going to the database where the data is stored is expensive and expensive for a long time. Locking is easier to carry out in the memory of the workflow than in the database. But certain guarantees must be observed:

All methods follow user data only to the server.
Developers make sure that data is edited only on the segment where the user is stored.
Only one copy of the user should be in memory.

As a result, we come to the user cache, which increases the efficiency of working with data. The cache keeps track of the uniqueness of a copy of each entity and provides a blocking location. In this scenario, the cache works on a write-through basis.

The operation of the cache with game entities can be described in stages:

Subtracting the user from the database into the cache.
Gaining exclusive access to data by blocking.
Editing user cache.
Saving the user to the database. Actual user data remains in the cache.

“What if you put the cache in another process, for example, in Memcached?” - you say :) In this situation, you need to consider that there will be additional costs for serialization and deserialization when interacting with an external cache and for transmission over the network.

If multithreading and a large number of processes are used, a situation could theoretically arise in which a copy of the user will exist in parallel in two cache processes. This phenomenon is caused by an interesting feature of ASP.NET. When the assembly changes in a deployed site, for some time there are 2 copies of the site: the first modifies the old code and old queries, and the second creates a new process with new memory for new requests. Technically, a game entity from a new site can load before all processes on the old one are completed. Each copy of the site believes that only it has game entities, which is why data loss occurs.

We encountered such a problem when rebooting and pouring a new version of the server. By default, 2-3 users “lost” troops. To solve this issue, we prohibited the simultaneous launch of two applications at the code level.

To avoid thread deadlocks, we adhere to a number of rules:

Take a lock indicating the timeout.
Do not take more than one lock at a time.
Change several entities through messaging in the queue.

First decision

The first thing that comes to mind for organizing work with the cache is to use a Dictionary of the form . But this solution has a minus. Dictionary is well suited for single-threaded applications. In a multi-threaded environment, it becomes thread-safe and requires additional protection.

Dictionary protection can be implemented due to one exclusive lock, under which the presence of the right user in the collection is checked. If there is no user, the game entity is loaded and returned to the calling code. To ensure consistent access, the called code will be blocked on the user. To optimize a bit, you can use ReaderWriterLock (or any other option on the same topic).

But everything is complicated by the fact that any cache sooner or later reaches a threshold size and requires data crowding out. What will happen in this case?

We simulate the situation in accordance with the above solution. The cache contains a game entity. The request came, took the game entity into the lock context, and started editing it. At the same time, the game entity is being forced out of the cache - for example, the preemptive thread managed to designate this data as old in use. At this time, another request for a game entity arrives, but it is no longer in the cache. The request reads the data from the database, puts it in the cache, takes the context and also starts editing. As a result, we get two copies of the game entity, and this violates our guarantee of uniqueness (Figure 1). Data is lost, and you are lucky with the request that keeps the game entity last.

Picture 1

Obviously, you need to reckon with a stream that is already editing the game entity. You can block all entities that need to be supplanted, but this is labor intensive and inefficient. If, for example, in the cache of 20,000 game entities and 10% of them need to be forced out, 2000 locks will occur. Additionally, if some of the locks are busy, you will have to wait for them to be released.

Pin and Unpin operations

We decided to implement Pin and UnPin operations. Instead of storing the entity as a value in the Dictionary, we store a certain CacheItem, which indicates the blocking of the game entity, the entity itself and the usage counter. With the help of the counter, access control during crowding out occurs. When creating a context, the counter value is increased using the Interlocked operation on the CacheItem (Pin operation), when releasing the context, it decreases (UnPin operation). We have added the Shrink operation, which crowds out objects with a zero counter value by the date of the last use. The global locking of the entire Dictionary is used without entering under the individual locking of each entity. Pin is executed in the context of the same lock to preserve the atomicity of receiving and blocking the entity.

// User Cache
private readonly object _lock = new object();
private Dictionary> _dict = new Dictionary>();
public bool TryGet(long userId, out UserCacheItem userCacheItem, bool pinCacheItem)
{
        lock(_lock)
        {
                if (!_dict.TryGetValue(userId, out userCacheItem))
                        return false;
                userCacheItem.UpdateLastAccessTime();
                if (pinCacheItem)
                        userCacheItem.PinInCache();
                return true;
        }
}

// Shrink
private void ShrinkOldest(int shrinkNumber) // shrinkNumber - quantity CacheItem superseded.
{
        lock(_lock)
        {
                var orderedByAccess = _dict.OrderBy(pair => 
        pair.Value.LastAccessTime).Where(pair => !pair.Value.IsPinnedInCache());
                int i = 0;
                foreach (var pair in orderedByAccess)
                        {
                        i++;
                        if (i > shrinkNumber)
                              break;
                        _dict.Remove(pair.Key);
                        }
        }
}

// UserCacheItem
public class UserCacheItem
where TUser : class, IUser, new()
{
        private int _pinnedInCache; // _pinnedInCache prevents deletion of data from the Cache on Shrink
        public object Locker = new object(); // userCacheItem blocked before user editing
        public long UserId;
        public TUser User;
        public DateTime LastAccessTime;
        public UserCacheItem(TUser user)
        {
                UserId = user.Id;
                User = user;
                UpdateLastAccessTime();
        }
        public void UpdateLastAccessTime()
        {
                LastAccessTime = DateTime.UtcNow;
        }
        public void PinInCache()
        {
                Interlocked.Increment(ref _pinnedInCache);
        }
        public void UnpinInCache()
        {
                Interlocked.Decrement(ref _pinnedInCache);
        }
        public bool IsPinnedInCache()
        {
                return _pinnedInCache != 0;
        }
}

Everything seems to be beautiful and very effective, but there are drawbacks. Firstly, the global locking of the entire Dictionary in terms of resource protection is very large - it protects too much data. This is comparable to a huge library and one old grandmother, who in turn gives out books to everyone. Until Shrink ends, no one can take the context. Secondly, there remains high competition of threads for execution in the device of Lock itself.

Classic synchronization tools on Windows or Linux to capture or release locks require entering kernel mode. To lock in Mutex, for example, even if it is free, you need plus or minus 1000 processor cycles. At one time, optimizations of this process were created (for example, CRITICAL_SECTION in native applications, Monitor in .NET). Initially, an attempt is made to occupy the lock with one Interlocked.CompareExchange operation, if it is free. If it is busy, it is assumed that the lock will be released soon, and SpinWait will occur. Only after the limit of the number of spins has elapsed does waiting for the flow on the lock begin. This is very beneficial: either there will be no competition, or due to a short period of time blocking, extra processor cycles for switching to kernel mode will not be spent.

This is an excellent optimization in terms of time and processor resources, but theoretically, one can pick up a degree of blocking competition that, due to SpinWait operations, will lead to excessive or explosive consumption of the processor. I’ll explain how exactly this can happen: the request detects that Lock is busy right now and decides to spend, for example, 32 cycles on SpinWait. After another unsuccessful attempt, he decides to go to sleep. In such a competition, each Lock will spend extra cycles on Spin in a fairly large amount. We have one exclusive lock on the entire segment through which all requests must go. In the worst case, this can lead to a critical mass of competitions that will load the processor 100%.

Six months ago, Microsoft repaired such a bug for us in the implementation of ASP.NET, and recently we again found a similar one.

When we were working on the new version of the cache, a problem arose: during the period of average processor load (about 50-60%), consumption could increase sharply to 100%. This anomalous surge was not associated with an increase in the number of requests over the network or a sharp arrival of new users - the server worked in a normal rhythm.

There are tons of unique and useful performance counters in .NET. One of them, using the Monitor class, shows how many competitions are happening per second - that is, how many collisions there were on classic locks. With normal CPU usage, the collisions were an average of 70 per second. At the time of 100% load - 500-600. But the most interesting thing is that the coverage rate of our code with an average load was 90–92%, and at the time of loading the indicator dropped to 17%. Our code literally squeezed something else.

What we discovered: ASP.NET has a private type BufferAllocator and several derivatives from it. Inside this type there is a pool that allows you to reuse already allocated buffers in order to reduce the number of objects created. It is implemented very simply: Stack, which is protected by a single lock. The problem was that there was only one instance of each given type per application. In fact, we received several global locks that took all requests, albeit for a short time.

After a long review, Microsoft fixed its bug - the number of such allocators was increased to one for each instance of HttpApplication. This significantly reduced the number of blocking competitions and saved us from anomalies for several months.

Recently, again, the problem of sudden and long consumption of 100% of the processor arose. Already knowing what to look for, we found a bug very close - in the HttpApplicationFactory type, which buffers HttpApplication instances in Stack, blocking on it on every request.

An application with such a bug is considered from several months to six months. We have developed a mechanism that is affectionately called the “hand brake”. Its essence is simple: the server monitors the indicators of processor load and the state of high load on the server. If the state persists for three minutes, we slow down all incoming requests for a short, random period of time, disrupting the chain reaction.

As a result, the peaks became short-lived and invisible to users.

Search for new solutions

Lock was the bottleneck of the entire server. It is as if the whole city had one road with one traffic light. We wanted to increase the number of locks to reduce the number of competitions. After all, if you increase the number of roads and traffic lights to 10, the cars will not become smaller, but the movement will be faster.

We found in .NET a beautiful ConcurrentDictionary class. It has Locks and baskets, the number of which is equal to the product of existing processors by 4. The class is implemented as Low Lock and allows you to modify yourself from several threads in parallel. If 2 objects go to 2 baskets, they can be locked at the same time, if in one - they just wait in line. This class is blocked only on addition, subtraction occurs without locks.

Operation Evict

When we used Dictionary and Lock, Pin passed inside Lock. With ConcurrentDictionary, data retrieval and Pin have become two non-atomic operations. Suppose we have an Item that lies in a ConcurrentDictionary. We take the context of the game entity without blocking and set the Pin, but at the same time Shrink arrives, displaces the game entity and performs Remove. It turns out a clear race.

You can avoid this situation by simply putting another Lock on the Pin, but it will look very ugly. It’s as if there were 20 cash desks in the supermarket, you paid for everything on one of them, but you still have to go through an additional one so that you can again get through the goods for reconciliation with the check.

We came up with a trivial simple idea. We wrote the Evict operation, which sets the Pin counter to a deliberately incorrect value equal to −1, using the usual Interlocked.CompareExchange operation, if the original value is 0.

Let us return to the situation described above, but taking Evict into account. There is an Item that lies in the ConcurrentDictionary. At the same time, a request comes to change the game entity and to supplant it. If Evict managed to set the value to −1, Shrink realizes that this entity is not busy with anyone now and will not be taken in the future. Pin at this time pretends that there is no object, and tries to go back into ConcurrentDictionary. If Evict did not manage to change the value, Shrink realizes that the first game entity was captured by Pin, and does not touch anything.

In a situation where a Pin with a value of 0 wants to take 2 processes at the same time, we say that 2 Items can be pinned more than once (Pin will take a value of 2). There will be only one Lock. The value will be set to 2.

public bool PinInCache()
{
        int oldCounter;
        do
        {
                oldCounter = this._pinnedInCache;
                if (oldCounter < 0)
                        return false;
        }
while (Interlocked.CompareExchange(ref this._pinnedInCache, oldCounter + 1, oldCounter) != oldCounter);
        return true;
}
public void UnpinInCache()
{
        Interlocked.Decrement(ref this._pinnedInCache);
}
public bool EvictFromCache()
{
        return Interlocked.CompareExchange(ref this._pinnedInCache, -1, 0) == 0;
}

// UserCache
public bool TryGet(long userId, out UserCacheItem userCacheItem, bool pinCacheItem)
{
        if (!_dict.TryGetValue(userId, out userCacheItem))
                return false;
        userCacheItem.UpdateLastAccessTime();
        // Return true if item was found. No need to pin or item was evicted.
        return !pinCacheItem || userCacheItem.PinInCache();
}
private void ShrinkOldest(int shrinkNumber)
{
        // Get N oldest elements. Evict the oldest elements from the Cache if possible
        // Make additional Select call to prevent detecting sequence in LINQ as ICollection in Buffer class and prevent using CopyTo method
        var orderedByAccess = _dict.Select(kvp => kvp).OrderBy(pair => pair.Value.LastAccessTime);
UserCacheItem dummy;
        int i = 0;
        foreach (var pair in orderedByAccess)
        {
                if (pair.Value.EvictFromCache())
                {
                         _dict.TryRemove(pair.Key, out dummy);
                         if (++i >= shrinkNumber)
                                  return;
                }
        }
}

As a result, we have highly efficient synchronization on Interlocked operations. Thanks to Evict, we got a good decrease in competition, and the pin counter and Evict became the synchronization point for crowding out the game essence.

The rest of the caches, which do not require the support of such operations for editing and serve only to increase the efficiency of reading data, were also rewritten to use ConcurrentDictionary.

This approach gave a good additional effect: it reduced the load on the server by 7%. In periods of absence of a critical state due to a bug in ASP.NET, we have from 0 to 2 competitions per second and up to 10 in a short period after complete garbage collection.

KeyLockManager and LockManager

We have developed for ourselves an interesting and fairly effective tactic for working in a multi-threaded environment for editing objects. KeyLockManager essentially contains the same ideas as ConcurrentDictionary, Pin, UnPin. But by itself it does not store any data, but allows you to synchronize streams by the value of the key.

Suppose there is a goal to write data to a file or send them to the network from two streams at the same time, but there is no desire (or ability) to put Stream instances in the cache itself. When using KeyLockManager, streams can be synchronized, for example, simply by the name of the file, while Stream should be stored in another place (or even instanced only after they have exclusive access by key).

LockManager is a more complete cache derivative. It is essentially represented by the same ConcurrentDictionary with Pin, UnPin, and Evict, but without saving to the database. This is a generalization of the idea of synchronization with locks, where you can put absolutely any data.

We found the use of LockManager in visualizing the movement of troops on a global map.

Is everything all right in the end?

Thanks to the bug situation in ASP.NET, we began to better understand the features of the synchronization mechanism and parallel access to data and use them more efficiently. We now use ConcurrentDictionary everywhere, but we understand well our actions and their consequences.

The only drawback of ConcurrentDictionary is that there is no complete documentation on its operation. All descriptions are addressed to the novice user. In our own practice, we found the following points that are definitely worth paying attention to. They are not in MSDN and in articles on similar subjects.

The Count method - counts the number of elements in a ConcurrentDictionary. Nowhere is it written that its execution implies blocking all baskets and can negatively affect performance. Obviously, Contract Count tells you the exact value, but not in all situations. If you want to calculate approximately, it is better to use the LINQ .Select (..). Count () operations bundle.

The TryRemove method attempts to delete an object by the specified key. To do this, TryRemove locks the basket, and then searches. This is not bad, but there is a high probability that the object is not in the collection. It is better to first check for its presence with TryGetValue, which does not block anything, and then try to delete. Two search operations can often be more beneficial than permanently blocking baskets.

The most interesting method is GetOrAdd. It either returns a value that lies with a particular key, or adds a value and returns it. GetOrAdd has 2 overloaded versions. The first version takes a value, the second one takes a function that needs to be called if a value is not found, and will try to add it.

It’s a paradox, but they work in completely different ways. The GetOrAdd variant with a value immediately blocks the basket, determines that there is nothing there, and then only executes Add. GetOrAdd with a function is valid without initial blocking. If there is nothing in the basket, he calls the function, gets the value, tries to block and add. What is the difference: in the case of the function, we can once again execute the code to calculate the value, but there will be less synchronization. In the case of a value, we will perform a permanent lock, thereby increasing the competition. That is, due to an error in writing, you can fundamentally change the work inside. We are using a version of GetOrAdd that accepts a function.

LINQ operation OrderBy. When transmitting sequences, LINQ checks to see if the given sequence is an ICollection. Then, instead of going through the sequence, the collection will be copied as a whole. But this is ConcurrentDictionary :) Inside LINQ, it takes a sequence for a collection, creates a buffer and copies it there. LINQ goes to this collection through the Count property, which, as we have already determined, blocks all baskets. If the collection size is reduced, when copying it throws an exception "the number of elements does not match." We specifically inserted an extra Select query to hide that the sequence might be a collection.

What's next?

At this stage, we have only theoretical ideas for improving the server. They can be used when in some version of the server we decide to switch to full asynchrony. If the competition rate for ConcurrentDictionary is too large or the processes occur in asynchronous mode, you can synchronously lock the object and not exit it until the object is accessible. Jeffrey Richter came up with great asynchronous locks: instead of waiting for the lock to be released, the thread will give up its function, which will be called when the lock is released. Waiting for release is not the same as blocking and waiting. This approach can be converted to cache.

PS

I will be glad to know your ideas, thoughts and discuss them in the comments. Tell us what features you know and what I would like to read about in future articles.

Tags: