danikin December 22, 2015 at 15:53

How to save a million dollars with Tarantool

What are databases used for, are there good old files? Why are they worse than a database or is a database better than files? DB is a more structured storage. It allows you to make transactions, requests, and so on. The simplest case: there is a server with a database and several applications that make requests to the server. The database responds, changes something within itself, and everything is fine even until the load on it grows so much that the database ceases to cope.

Assuming that this is only a read load, the problem is solved by replication. You can put as many replicas to the database as necessary, and start all reads on the replica, and all records on the master. If there is a write load on the database, then replication does not solve this problem, because recording should be performed on all replicas. Thus, no matter how much you put them, you will not reduce the load on the recording per one machine. Sharding comes to the rescue.

If the base does not hold the load on the record, then shards can be added indefinitely. A shard is more complex than a replica, because you need to somehow distribute data between tables or within a table, by hash, by range - there are many different options. Thus, adding replicas and shards, you can share any load on the database. It would seem that there is nothing more to desire, what more to talk about?

But there is a problem

... which is no longer in the plane of technology. Your boss, seeing an ever-growing fleet of servers, begins to be indignant, because it takes a lot of money. The load is growing, the number of requests from users is growing, and you add and add servers. You’re a techie, don’t think about money - let financiers do it. And you say to your boss: “Everything is fine. We have an infinitely scalable system. We are adding servers, and everything works cool. ” And the boss replies: “Great, but we are losing money. Something needs to be done with this. And if we do not solve the problem, then we will have to close the whole business. Because, despite the growth of the business, we are growing at a faster rate in databases and servers. ” And it is up to you, not the financiers, to solve this problem, because it lies, perhaps, in the technological plane. What to do next? Amazon is much more expensive. Optimize? You have already optimized all the queries.

The solution may be to cache data that is often selected. They can be kept in some cache and constantly returned from there, without resorting to numerous replicas and shards.

Cache issues

Well, the problem is solved: one memcached replaces us with a whole rack of replica servers. But you have to pay for everything.

The application writes to both the cache and the database, which are not replicated among themselves. Therefore, data inconsistency occurs. For example, you write first to the cache, then to the database. For some reason, writing to the database did not work: the application crashed, the network blinked. Then the application returned an error to the user, but other data is already in the cache. That is, there is some data in the cache, and other in the database. Nobody knows about this, the application continues to work with the cache. And when it reboots, the data is lost, because another copy is in the database.

The funny thing is that if you write in the reverse order, the same thing will happen. They recorded it in the database, but failed to write to the cache. We work with old data from the cache, the database has new data, but no one knows about it. The cache has rebooted - data is lost again. That is, in both cases the update is lost. And this means that you lose a certain property of the database, namely, a guarantee that the recorded data is stored in it forever, that is, the commit is no longer committing. You can cope with data inconsistency by writing a smart cache so that the application works only with it. It can be write through, if only the application does not work with the database. First, the cache should write the received data to the database, and then to itself. If for some reason the data were not written to the database, then they should not be written to the cache either. This way the data will always be in sync.

But still, there remains one rare case in which the data is not synchronous: the application writes to the cache, the cache writes to the database, the database inside itself committed. Further, it confirms to the cache the successful completion of the operation, but at this moment the network breaks, and the cache does not receive this confirmation. He believes that the data has not been written to the database, and does not apply it at home. But in the database they still apply. The application works with old data, then the cache is reloaded - the data is different again. This is a very rare case, but it is possible.

And most importantly - smart cache does not solve the problem of sharding. And your boss does not like sharding, because it is very expensive, because you need to buy many, many servers.
Among other things, the implementation of the cache does not save us from sharding, because recording is not accelerated. Each commit must be committed somewhere, and not to the cache.
The next problem: the cache is not a database, but a regular key / value storage. Your requests and transactions are lost. Indexes and tables are also lost, but they can be built with a sin in half on top of the key-value cache. Therefore, the application has to be simplified and radically redone.
The fourth problem is the cold start. When the cache just rises, it is empty, it has no data. Further, all selects go directly to the database, past the cache, because there is nothing in it yet. Accordingly, you have to add replicas again, at least not in full. We need to somehow warm up the cache. And when it warms up, then a lot of selects go to the base. Accordingly, you have to keep a number of replicas just to warm up the cache. Doesn’t it look wasteful enough? But without these replicas, you cannot start normally. Let us consider in more detail the solution to this problem.

Cold start

At the time, the following idea arose: that the data was always “warm”, you just need not to “cool” them. To do this, the cache must be persistence, that is, you need to store data somewhere on the disk, and then everything will be fine. The cache will start and load data. But there was a doubt: the cache is RAM, it must be fast, and when a disk is given to it in a couple, will it not be as slow as the database? In fact, it will not.

The easiest way is to "persist" the cache once every N minutes, dump it entirely to disk. This dump can be done asynchronously in the background. It does not slow down any operations, does not load the processor. This allows you to speed up warming up many times: when the cache rises, it already has its own data dump at hand, it reads them linearly and very quickly. It turns out faster than with any number of database replicas. But it can't be so easy, right? Let's say we dump every five minutes. And if there is a failure in the gap, then the changes accumulated from the moment of the previous dump will be lost. For some applications this is not important, for example, for statistics, but for some it is important.

The second problem is that we load the disk well, which may be required for something else, for example, for logs. During the dump, the disk will slow down, and this will happen endlessly. This can be avoided by keeping a log instead of regular dump dumps. The question should immediately arise: “How so? It’s a cache, it’s fast, and here we are logging everything. ” This is actually not a problem. If you write a log to a file sequentially, on a regular hard drive, then the write speed will reach 100 Mb / s. Suppose an average transaction size of 100 bytes is a million transactions per second. Obviously, we will never run into disk performance when logging cache. Thanks to this, the IOPS problem is also solved: we load the disk exactly as much as necessary so that all data is persistent. Data is always fresh, we don’t lose it,

But journaling has its drawbacks. When maintaining a log, updates that update the same item are not grouped into one record. There are a lot of them, and when you start the cache, you have to “play” all these records, which can take longer than starting from the dump. In addition, the log itself can take up a lot of space, not even fit on the disk.

To solve the problem, you can combine both approaches - dump dump and logging. Why not? We can dump, that is, create a snapshot once a day, and always write to the log. In snapshot, we save the ID of the last change. And when you need to start the cache, read the snapshot, immediately apply it to memory, then read the log, starting with the last change in the snapshot, and apply it on top of the snapshot. That's it, the cache is warm. This is done as fast as if we were reading from a dump. So, with a cold start, we figured out, let's now solve the remaining problems on our list.

The remaining three problems

The database has a property such as durability, which is provided through transactions. The database usually stores hot and cold data. At least once we get to the cache, then our data is definitely divided into hot and cold. Usually there is a lot of cold data, and very little hot. This is how life is arranged. We replicate and shard a database with many, many copies and shards just to serve hot data. We can say to ourselves: “Why are we copying all this? Let's shard only hot data. ” But this does not help in any way, because we must use exactly the same number of servers, because we shard and replicate not because the data does not fit into memory or on the disk, but because we rest on the CPU. That is, the database does not have time to process. Thus, Sharding and replication of hot data only does not solve this problem. And the boss is still angry because you have to pay for all new servers.

What can be done? We have a cache, but the hot data in the database does not allow us to live peacefully, we deliver their replicas and shards. However, the cache also stores data, as does the database. If desired, you can make replication in it. What prevents us from using the cache as the primary data source? Lack of features like transactions? We can do transactions. Thus, we solve the remaining three problems, since the hot data can not be stored in the database at all, only in the cache. Sharding also becomes unnecessary, because we do not have to cut the database into many servers, the cache successfully copes with the load, including the write. And he manages to write because the cache in the cache works with the journal as fast as it does without the journal.

So, in the cache you can embed all the properties that are inherent in the database. We did so, and the resulting brainchild was called Tarantool . In terms of read and write speed, it is comparable to the cache, while it has all the database properties that we need. Thus, we can abandon the base for storing hot data. All problems are resolved.

Features and Features of Tarantool

So, we replicated and sharded these numerous cold data only in order to process small hot data. Now the cold data, rarely requested and modified, is in SQL, and the hot data is sent to Tarantool. That is, Tarantool is the base for hot data. As a result, for most tasks two instances (masters and replicas) are more than enough. Although you can get by with one, because the pattern of access to it and RPS is the same as that of a regular cache, despite the fact that it is a database. For some, this is a psychological problem: how can you abandon the database as an authoritative source of data storage with its cozy durable with transactions and go to the cache? In fact, starting to use memcached or any other cache, you have already abandoned the benefits of the database. Think of inconsistency and the loss of updates.

A few words about the parallel operation of transactions. When Lua is used in Tarantool, it considers it as one transaction: it makes all reads from memory, and transfers all writes to a temporary buffer and at the very end writes to disk in one piece. And while the data is being written, another transaction can already read old, uncommitted data from memory without any locks! A transaction queue can only occur if the throughput of sequential write to disk is exceeded.

How do we shift from hot to cold

This process is not yet automated with us. We analyze the logs and determine that data with such a pattern can be considered hot. For example, user profiles are hot, which means we are transferring them to Tarantool. It is clear that along the way we’ll also catch cold ones, because some users, for example, no longer go to the Post Office. But, despite the cost overrun, this is more efficient than using MySQL. If only because Tarantool has a very optimized memory footprint. A very interesting fact: the SQL database stores everything on disk, but 10–20% should be cached in memory. But at the same time, traditional SQL footprint databases are three to five times worse than Tarantool, so these 20% turn into 100%. It turns out that with a similar load, the SQL server does not even win by memory, although it does not hold this load.

Tarantool vs Redis

From our point of view, there are two key differences between Tarantool and Redis.

According to our tests, Tarantool is 30% faster than percent. Test results are presented on the Tarantool website and in this article .
Tarantool is a database. There you can write server side scripts in Lua. Redis also has Lua, but it is single-threaded, blocking, you can write your own scripts, but their scope is very limited. In addition, Lua in Redis is not transactional. In this sense, Tarantool is perfect. It is faster and allows you to use transactions wherever you need. There is no need to get the key from the cache, update and put it back when in parallel someone else can change.

One million dollars

This amount is not an invention for an attractive headline, but really money saved in one of the Mail.Ru Group projects. We needed to store user profiles somewhere. Before that, they were in the old storehouse, and we were thinking where to transfer them. We originally looked at MySQL. We deployed 16 MySQL replicas and shards, and began to slowly duplicate the load from the read and write profiles in them. Profiles are small pieces of information (from 500 bytes to kilobytes) that store the name, number of emails sent, various flags and service data that are usually needed on each page. This data is often requested and updated. With 1/8 of all our load, the farm of 16 MySQL broke. And this is after all the optimizations we made there. After that, we decided to try Tarantool. It turned out that he calmly kept the load on four servers, which before that had been distributed across 128 servers. In fact, even on one server kept, we put four for safety. And saving in the form of 128 servers and reducing hosting costs was equivalent to a million dollars.

And this is just one case. Tarantool has found application in many of our projects. For example, Mail and Cloud employs 120 Tarantool instances. If MySQL were used there at the existing load level, then you would have to install tens of thousands of servers or other SQL - PostgreSQL, Oracle, whatever. It’s even hard to assess what millions of dollars it would have resulted in. The moral of this fable is that for each task you need to select the right tool, this allows you to save a lot of money on a simple feature. The cold data must be stored in the SQL database intended for this, and the hot data, which is often queried and often updated, needs to be stored in a storage adapted for this, which is Tarantool.

In version 1.7, which is currently under development, we want to make a fully automatic cluster solution with sharding and replication like RAFT. Stay tuned !

Tags: