How are Yandex clouds arranged: Elliptics

    Over the past few years, a fashionable trend has appeared in the IT world - the use of all the "cloud" to develop new products. There are not so many public cloud providers, the most popular among them are Amazon. However, many companies are not ready to trust private data to anyone, at the same time they want to store it securely, and therefore they raise private small clouds.

    Any cloud consists of two main components: a Single Entry Point (ETV) and Cloud Magic (OM). Consider Amazon S3 cloud storage : the ETB uses a fairly convenient REST API, and Cloud Elves are powered by dollars. Companies wishing to place small video files or a database in S3 preliminarily calculate on the calculator the amount that they will pay per month for the planned load.

    This article is about another cloud storage in which elves feed on the Spirit of Freedom, electricity and they also need a little bit of “ cocaine ”. This is called the Elliptics repository . The history of its creation dates back to 2007 as part of POHMELFS . The following year, Elliptics was moved to a separate project, and many different approaches to distributed data storage were tried out in it, many of them did not fit because of the complexity, because of too little practicality in real life. In the end, its author, Eugene Polyakov, came to the creation of Elliptics in a modern form.





    fault tolerance


    This is the first of the Five Elements that Elliptics aims to preserve. Cars constantly break down, hard drives fail, and elves from time to time go on strike and throw bark. The data center can also fall out of our world at any time, due to the power of black magic and witchcraft. The most obvious of these are Cable Breaks and Loss of Electricity.

    One sad story
    Once, during the rain, one little kitten felt cold, and he decided to warm himself. There was not a single room nearby, except for a transformer box, which was made strictly according to GOST — with an opening below 10 centimeters. The transformer was short-circuited, and one kitten became smaller, and the data center turned out to be without electricity. This happened on a weekend, on Saturday, and therefore full-time electricians were not at the workplace and could not restore power supply. Of course, a diesel generator was available and worked, but the fuel was only enough for a few hours. The security service refused to let into the territory of the data center an operatively called fuel truck. In the end, everything was decided safely, but no one is safe from such situations.

    Everyone has long known what to do if a drive or server is disconnected. But no one wonders what will happen to their system if the data center, Amazon region fails or another major event occurs. Elliptics was originally planned to solve this class of problems.

    In Elliptics, all documents are stored using 512-bit keys, which are obtained as a result of the sha512 function of the document name. All keys can be represented as a Distributed Hash Table (DHT), a ring with a range of values ​​from 0 to 2 512. The ring is randomly divided between all the machines in one group, and each key can be simultaneously stored in several different groups. It can be roughly assumed that one group is one data center. As a rule, Yandex stores 3 copies of all documents, and if some machine fails, we lose only one copy of a part of the ring. Even with the failure of the entire data center, information is not lost, we have all the documents left in at least two more copies, and therefore elves can still bring joy to people.

    Extensibility


    Elf administrators have always wanted to be able to connect additional machines, and Elliptics can do it! When you connect a new computer to a group, it takes random intervals from those 2,512 keys, and all subsequent requests for these keys will come to this machine.

    If we have three groups consisting of hundreds of machines each, then adding new ones will entail rebalancing and, as a result, large volumes of data will have to be transferred over the network to restore them. For those who are not suitable for this, we have developed a load balancing system - Mastermind , which runs on the Cocaine cloud platform. Mastermind is a set of peer-to-peer peers that determine in which groups a particular file will be stored based on the load on each server. There is also difficult Mathematical Magic here - free space, disk load, CPU load, switch loading, drop rate and much more are taken into account. In the event of failure of any of the Mastermind nodes, everything continues to work, as before. If the group size is kept small, then when adding new machines it is enough for them to issue a new group and let Mastermind know about it. In this case, information about which groups should be searched for is added to the file identifier. This mechanism allows for truly endless and very simple storage expansion.

    Data security


    Let's look at what the system does with old data that turned out to be on the wrong machines after balancing or even got lost.

    In this case, Elliptics provides a recovery system. The scheme of her work is as follows: she goes through all the machines and, if the data is not on her own machine, moves them to the right one. If it turns out that some document is not in three copies, then during this procedure the document will be multiplied by the necessary machines. Moreover, if the system works with Mastermind, then the group number of the file location is taken from the meta-information. The recovery speed depends on many factors and the exact speed of key recovery cannot be called, but, as load testing shows, the most bottleneck is the network channel between the machines, which is used to transfer data.

    Speed


    Elliptics has low overhead and is capable of performing up to 240 thousand read operations per second on a single machine when working with a cache. When working with a disk, the speed naturally drops and rests on the speed of reading data from the disk.

    Based on Elliptics, the HistoryDB project has been implemented. With it, we store logs from various external events. During testing, a group of three machines calmly and smoothly cope with a load of 10 thousand requests per second, when logs from 30 million users were written. At the same time, 10% of them generated 80% of all data. A total of 30 million users accounted for about 500 million updates every day. However, simply storing this data would be too easy. Therefore, the system was tested in such a way that under load, it allowed to get a list of all users who were active for a certain period (day / month), look at the logs of any user for any day, and it was possible to add any other secondary indexes to user data.



    Also, based on Elliptics, Yandex built the work of services such as Yandex.Music, Photos, Maps, Market, vertical searches, people search, backup mail. Preparing to move and Yandex.Disk.

    Simplicity of architecture


    In Elliptics, clients connect directly to servers, and this allows you to:
    • have a simple architecture;
    • ensure that transactions are performed correctly;
    • always know which machine in the group is responsible for a particular key.

    For users of the data storage service, there are the following APIs with which you can easily write, delete, read and process data:
    • asynchronous library in C ++ 11
    • Python binding
    • HTTP frontends based on FastCGI and TheVoid (using boost :: asio)

    However, there is no silver bullet, and Elliptics has its limitations and problems:
    • Eventual consistency. Since Elliptics is fully distributed, in case of various problems the server may render the version of the file older than the current one. In some cases, this may not be applicable, then due to the sagging response time, you can use more reliable methods of requesting data.
    • Due to the fact that the client writes data to several servers in parallel, the network between the client and servers can become a bottleneck.
    • The API may not be convenient for high-level queries. At the moment, we do not provide convenient SQL-like data queries.
    • Elliptics also does not have high-level transaction support, so it’s impossible to guarantee that a group of commands will either execute all or not at all.

    In the following articles we will give an example of the use of Elliptics and tell us more technical details: the internal structure of the cache, the operation of our eblob backend, data streaming, secondary indexes and much more.

    Also popular now: