aleks_raiden October 2, 2008 at 15:05

MemcacheDB and MemcacheQ are key components of a high-performance infrastructure

Today we’ll talk about the components for a high-performance and scalable architecture based on the memcached server, namely, a distributed base for storing MemcacheDB data and the MemcacheQ message queuing system .

First, let's consider what we have at our disposal to create a distributed data storage infrastructure for a web application. Well, the first thing that comes to mind is database clustering, it is now supported on all common systems, as well as various replication technologies. For example, the most popular database management system for web projects, MySQL supports both replication and clustering. You can also turn to the traditional file system and store data in the file system, for example, Apache Hadoop. But often this is too high-level solution, usually options are much simpler - when you need to store and operate just key-value pairs. Seriously, this functionality will cover the needs of 90% of web applications. And if we add to this the ability to very, very quickly manipulate data, store it in the form of a distributed multi-server system and the possibility of permanent storage that is resistant to failures - we get a very attractive platform.

MemcachedIt has long been known as a data cache server, which is used on many highly loaded projects, including Wikipedia and LiveJournal, and allows you to cache any data in memory and quickly operate on it, while only the simplest operations are supported, this is clearly not a complete database. And using memory as a data storage is ideal for the case of data caching, but if it comes to reliability or fault tolerance - here the work is transferred to the server itself and the equipment.

Here, to solve all these issues, to combine high speed, a simple interface and the principles of memcached, and the reliability of conventional databases, MemcacheDB was developed. This is a distributed data storage system in the form of key-value pairs that is compatible with the memcached API, which means that any client can transparently work with both cache and data storage without even noticing it. But, unlike the cache, memcacheDB data is stored on disk - the BerkeleyDB built-in industrial base is used as a backend and all the features are used to ensure storage efficiency and reliability, in particular, transactional and replication.

In terms of speed of access to data, memcacheDB is at the same level as memcache and is comparable to specialized databases, for example, CouchDB , and in numbers this amounts to tens of thousands of writes and reads per second (and here is the benchmark , as well as a comparison with CouchDB ). The developers themselves warn that memcacheDB does not have a cache, so you should not use it everywhere as a replacement for memcached itself, it simply implements a different storage strategy with compatible access, similar to memcached.

Despite the simplest operations with data - writing, reading, updating and deleting, this functionality is often enough for most tasks where we are used to using regular databases. But if you transfer part of the operations to specialized solutions, this will significantly relieve the main base for operations where serious means of working with data are already required. For example, a variable increment / decrement command is supported, which will allow implementing various counters and statistics without accessing the database, while the system will be able to serve thousands of clients in real time.

MemcacheDB is easy to deploy - install and compile from source, install the database (it does not require administration) and that’s it. Simply configure access parameters for clients - the port and several other parameters that affect performance, for example, the size of the data buffer, the directory for storing the database, the size of the cache in memory. Do not think that all read operations go from disk, since the system uses the database as a file on disk, of course caching is also used, which allows you to compare in speed with the original memcached, while also ensuring reliable storage.

The most interesting feature of memcacheDB is the ability to work on multiple servers using replication for data exchange and database synchronization. MemcacheDB can use several replication strategies, depending on your needs, which guarantee data integrity or ensure speed. The main distributed infrastructure model supported by the system is one master server and several slave servers that are used only for reading data.

In the case of multiple servers, the system can use the following replication strategies:

DB_REPMGR_ACKS_ALL - the master server is waiting for confirmation of successful data recording from all other servers;
DB_REPMGR_ACKS_ALL_PEERS - the master expects a response from all slave servers, which are, in turn, the master servers for their groups (multi-level system);
DB_REPMGR_ACKS_NONE - we do not expect any confirmations from other servers. The highest speed, but there is no guarantee that copies of data are on servers other than the master.
DB_REPMGR_ACKS_ONE - we are waiting for confirmation of at least one of the servers;
DB_REPMGR_ACKS_ONE_PEER is the same as in the previous case, but confirmation is expected from the server, which in turn is the master server for its group.
DB_REPMGR_ACKS_QUORUM - we expect confirmation from a certain minimum number of servers that guarantee data integrity and are also master servers for their groups. This strategy is used by default.

While it is necessary that all the servers in the group use the same replication strategy, but I'm not sure that this applies to complex systems, where there are many groups - in the end, you can configure the request to be distributed first across one server group with one strategy, and then the data is distributed according to another algorithm, while the root server does not even suspect about it.

Separately supported logging and backup of the database files themselves, including hot backup, but this is a separate conversation and the specifics of installation and use.

And so, we have the opportunity to organize ... such an Amazone S3 service, while being arbitrarily distributed, fast and reliable, with a simple and understandable universal API. There are many applications for such a system; in almost every highly loaded project, you can transfer part of the logic from the database to such a storage system and obtain high fault tolerance and provide unloading of many simple queries from the main database.

The second project, also based on memcached code, is memcacheQ. This is a message queuing system that has an even simpler API and supports only two commands, write and read. A message queue is a named stack where messages can be written, and a client, specifying the queue name, can receive all messages from the queue at any other time. The maximum message size is 64Kb, and the data itself is stored in the same BerkeleyDB database, which means that the same data storage conditions, replication and other features are provided. Such a system can be used to build communication systems between users within the project, mail systems, chat and other similar ones where such functionality is required, multiplied by high speed and reliability.

These two projects, MemcacheDB and MemcacheQThey are quite simple in terms of the external interface and seemingly limited in capabilities, but at the same time they allow you to build very powerful and highly loaded projects on your basis, if you take into account their capabilities even at the design stage. For many projects, this will eliminate or reduce the burden on expensive resources in the form of a database and provide high fault tolerance and flexibility.

Tags:

MemcacheDB and MemcacheQ are key components of a high-performance infrastructure

Also popular now: