smira October 30, 2008 at 08:11

The problem of simultaneously rebuilding caches

A series of posts on “Web, Caching, and Memcached” continues. Start here: 1 , 2, and 3 .
In these posts, we talked about memcached, its architecture, possible applications, caching key selection, clustering, atomic operations, and implementation of counters in memcached .

Today we will consider the problem of simultaneous rebuilding of the cache, which occurs when there are a large number of simultaneous accesses to the cache that has just been reset or lost, which can lead to database overloading.

The next post will be about tagging caches.

Rebuild Caches Simultaneously

This problem is typical primarily for highly loaded projects. Consider the following situation: we have a selection from the database that is used on many pages or especially popular pages (for example, on the main page). This sample is cached with some “expiration date”, i.e. the cache will be reset after a certain period of time. At the same time, the selection itself is relatively complex, its calculation noticeably loads the backend (DB). At some point in time, the key in memcached will be deleted, because its life will expire (the lifetime has been set in the cache), at this moment several frontends (several, because sampling is often used) will turn to memcached by this key, detect its absence and try to build the cache again by fetching from DB That is, several identical queries will simultaneously get into the database, each of which significantly loads the database, if a threshold is exceeded, the request will not be completed in a reasonable time, even more frontends will turn to the cache, detect its absence and send more requests to the database, which the database will not cope with. As a result, the database server received a critical load, and “lay down”. This situation is called the dog-pile effect (see also, for example:korchasa.blogspot.com/2008/04/dog-pile.html ). What to do, how to avoid such a situation?

The problem with rebuilding caches becomes a problem only when there are two factors: a lot of cache accesses per unit of time and a complex request. Moreover, one factor can compensate for another: on a relatively unpopular, but very complex and long sample (which should not actually exist), we can get a similar situation. So what to do?

You can suggest the following scheme: we no longer limit the lifetime of the key with the cache in memcached - it will be there until it is replaced by other keys. But together with the cache data, we record the real time of its life, for example:

{
    годен до: 2008-11-03 11:53,
    данные кэша:
    {
       ...
    }
}

Now, when receiving the key from memcached, we can check whether the cache has expired using the “expire before” field. If the lifetime has expired, the cache needs to be rebuilt, but we will do it with a lock (we will talk about locks in the next section), if we won’t be able to block, we can either wait again (if the lock already exists, then someone is rebuilding the cache), either return the old cache value. If it succeeds in blocking, we build the cache ourselves, while other frontends will not rebuild the same cache, as they will see our lock. The main advantage of storing in memcached without specifying an expiration date is precisely the ability to get the old cache value if the cache is already being rebuilt by someone. What exactly to do is to wait until someone else builds the cache and get the new value from memcached, or return the old value, - depends on the task, how acceptable is the old value and how much time can be spent in the standby state. Most often, you can allow yourself to wait 2-3 seconds with a check to remove the lock and, if the cache has not been built (which is unlikely, it turns out that the selection takes more than 2-3 seconds), return the old value, freeing the frontend for other tasks.

An example of such an algorithm

We get access to the cache cache, its life has expired.
We are trying to block using the user cache_lock key.
- Failed to get lock:
  - waiting for the lock to be unlocked;
  - did not wait: we return old cache data;
  - Wait: we select the key values again, return new data (the cache built by another process).
- I managed to get a lock:
  - we build a cache independently.

Such a scheme allows eliminating or minimizing situations of “backfilling” the backend with the same “heavy” requests, when in reality it is enough to complete the request only once. The last question remains, how to ensure the correct lock? Obviously, since the problem of simultaneous rebuilding arises on different frontends, the lock should be in a place that is public for all of them, that is, memcached.

Locks in memcached

Consider two options for implementing locking (mutex, binary semaphore) using memcached. The first is incorrect, it cannot provide the correct exclusion of parallel processes, but it is obvious. The second is completely correct, but not so obvious.

Suppose we want to lock by key ‘lock’: we are trying to get the key values using the operation get. If the key is not found, then there is no lock, and with the help of the operation we setset the value of this key, for example, to one, and set the lifetime in a small time interval that exceeds the maximum lock lifetime, for example, 10 seconds. Now, if the frontend crashes and does not release the lock, it will automatically be destroyed after 10 seconds. So usingsetwe set the lock, performed all the necessary actions, after that we remove the lock by simply deleting the corresponding key with the command del. If on the first operation getwe received the key value, this means that the lock is already set by another process, our lock operation was unsuccessful.

The described method has the disadvantage of having a race condition. Two processes can do at the same time get, both can get the answer that “there is no key”, they will both set, and both will assume that they set the lock successfully. In situations like the simultaneous rebuilding of caches, this may be acceptable, because here the goal is not to exclude all other processes, but to drastically reduce the number of simultaneous queries to the database, which can provide this simple, incorrect option.

The second option is correct, and even simpler than the first. To capture the lock, just execute one command: addspecifying the key name and lifetime (as short as in the first version). The command addwill be successful only if there is no key in memcached yet, that is, our process is the only process that managed to capture the lock. Then we need to perform the necessary actions and release the lock with the command del. If it addreturns the error "such a key already exists", then the lock was captured earlier by some other process.

Tags:

The problem of simultaneously rebuilding caches

Rebuild Caches Simultaneously

An example of such an algorithm

Locks in memcached

Also popular now: