TeamMRG February 5, 2014 at 11:28

The Clouds on Zeppelin: Experience in Creating the Mail.Ru Group Cloud Service

We started working on the Mail.Ru Cloud in June 2012. For a year and a half, we have come a long and thorny path from the first prototype to a public service that can withstand loads of over 60 Gb / s. Today we want to share with you a story about how it was.

Do you remember how it all started

It all started with one development virtual machine, on which the first prototype of the metadata storage daemon was launched. Then everything was crystal clear. We kept logs of the actions of each user (creating a directory, creating a file, deleting a file, etc.), which led to the final state of its virtual file system. The bodies of files were stored in a directory common to all users. This system, written on the knee in two weeks, worked for several months.

Metadata server

The prototype approached the product version when we wrote our server for storing metadata. It can be called advanced key-value storage, where the key is a twelve-byte tree identifier, and the value is the hierarchy of files and directories of an individual user. Using this tree, the file system is emulated (files and directories are added and deleted). To communicate with the service, the first client used a binary session-oriented protocol: a TCP connection was established, notification of changes fell in it so that the two clients could synchronize, and the clients wrote commands to this supported connection. This prototype worked for a couple of months.

Of course, we were well aware that clients could not afford to keep a constant TCP connection to the server. Moreover, at that time we did not have any authentication: we trusted the client. It was good for development, but completely unacceptable for production.

HTTP wrapper

The next step, we wrapped the individual packets that our binary protocol consisted of with HTTP headers, and began to communicate with the server via HTTP, telling in the query line which tree we want to work with. This wrapper is currently implemented using the Nginx module, which supports a TCP connection to the metadata server; the server knows nothing about HTTP.

Such a solution, of course, still did not give us an authorization check, but the client could already work in conditions close to real ones: for example, with a short-term disconnection, everything did not go to waste.

Also at this stage we were faced with the need for a third-party HTTP notifier. We took the Nginx HTTP Push-Module - this is a Comet server for notification - and began to send notifications to it from our metadata server.

When it didn’t just work, but began to normally experience the crash and restart of the Comet server, we proceeded to the next stage.

Login

The next step was the introduction of authorization both on the server and on the client. Armed with Nginx, the source code of the module that performs similar actions, and a manual for working with our internal interface for authorization and session verification, we wrote an authorization module. The implemented module at the input received the user's cookie, and at the output, calculated the identifier of this user, received his email and performed other useful actions.

We got two nice goodies at once: authorization worked simultaneously both when working with the metadata server and when working with the notification server.

Loader

The Nginx DAV module worked on file loading, which received files with hashes and put them locally in one directory. At that time, nothing was backed up, everything worked until the first drive out. Since then it was far from even closed internal testing, and the system was used only by the developers themselves, this option was acceptable. Of course, we went into production with a completely default-tolerant scheme, today each user file is stored in two copies on different servers in different data centers (DCs).

By November 2012, the day came when we could no longer turn away from the fact that not only the hierarchy of user directories was important, but also the content that it stored.

We wanted to make the bootloader reliable and functional, but at the same time quite simple, so we did not build ornate weaves between the directory hierarchy and file bodies. We decided to limit the scope of the bootloader to the following: accept the incoming file, calculate its hash and write to two backends in parallel. We called this entity Loader, and its concept has not changed much since then. The first working prototype of Loader was written in two weeks, and a month later we completely switched to the new bootloader.

He knew how to do everything that we needed: to receive files, read the hash and send to the store without saving the file on the frontend's local hard drive.

Sometimes we get rave reviews that, for example, a gigabyte movie flooded into the Cloud instantly. This is due to the upload mechanism that our client applications support: before uploading, they calculate the hash of the file, and then “ask” Loader if we have such a hash in the Cloud; no - upload, yes - just update metadata.

Zeppelin: Python vs C

While the loader was being written, we realized that, firstly, one metadata server is not enough for the entire audience that we would like to reach. Secondly, it would be nice if the metadata also existed in duplicate. Thirdly, none of us wanted to route the user to the desired metadata server inside the Nginx module.

The solution seemed obvious: write your daemon in Python. Everyone does it, it’s fashionable, popular, and most importantly - quickly. The daemon’s task is to find out which metadata server the user lives on and redirect all requests there. In order not to engage in user routing on data servers, we wrote a layer on Twisted (this is an asynchronous framework for Python). However, it turned out that either we do not know how to cook Python, or Twisted braking, or there are other unknown reasons, but this thing did not hold more than a thousand requests per second.

Then we decided that as hardcore guys we would write in C using the Metad framework that allows us to write asynchronous code in a synchronous manner (it looks like state threads) The first working version of the Zeppelin system (the chain of associations Cloud - airships - Zeppelin led us to choose this name) appeared on test servers within a month. The results were quite expected: the existing one and a half to two thousand requests per second were processed easily and naturally at close to zero processor load.

Zeppelin’s functionality expanded over time, and today it is responsible for proxying requests in Metad, working with web links, reconciling data between Metad and filedb, and authorization.

Thumbnail Generator

In addition, we wanted to teach Cloud to display thumbnails. We considered two options - to store thumbnails or generate them on the fly and temporarily cache them in RAM - and settled on the second. The generator was written in Python and uses the GraphicsMagick library.

In order to ensure maximum speed, thumbnails are generated on the machine where the file is physically located. Since there are a lot of stories, the load on generating thumbnails is "spread out" approximately uniformly, and, according to our feelings, the generator works quite quickly (well, maybe, except for those cases when a 16,000 x 16,000 pixel file is requested).

The first stress phase: closed beta

A serious test of our decisions was the release of closed beta testing. Beta started like people did - more people came than we expected, they created more traffic than we expected, disk loads were more than we expected. We heroically fought these problems, and over the course of one and a half to two months many of them overcame. We quickly grew to one hundred thousand users and in October we removed invites.

Second stress phase: public release

The second major test was the public release: problems with the disks returned, which we once again solved. Over time, the excitement began to fade, but then the New Year turned up, and we decided to give out a terabyte to everyone. In this regard, the 20th of December turned out to be very cheerful: the traffic exceeded all our expectations. At that time, we had around 100 stories in which there were about 2400 drives, while the traffic to individual machines exceeded gigabit. Total traffic exceeded 60 Gbps.

Of course, we did not expect this. Based on the experience of colleagues in the market, it could be assumed that about 10 TB will be poured per day. Now that the peak has already slept, 100 TB is being poured in our country, and on the most stressful days it reached 150 TB / day. That is, every ten days we do a petabyte of data.

Property Description

So, at the moment we are using:

● Samopisnaya database. Why did you need to write your own base? Initially, we wanted to run on a large number of cheap machines, so during development, we fought for every byte of RAM, for every clock cycle of the processor. Doing this on existing implementations like MongoDB that store JSON, even if binary-packed, would be naive. In addition, we needed specific operations, such as unicode case folding, the implementation of file operations.

This could be done by expanding the existing noSQL database, but in this case, the vfs logic would simply have to be entered into an already existing solid amount of code base, or expand the server functionality through built-in scripting languages, the effectiveness of which caused some doubts.

In traditional SQL databases, storing tree structures is not very convenient, and quite inefficient - nested sets will put the database on updates.

It would be possible to store all paths as strings in Tarantool, but this is disadvantageous for two reasons. Firstly, we obviously will not be able to fit all the users into RAM, and Tarantool stores the data there; secondly, if you keep the full path, we will get a huge overhead.

As a result, it turned out that all existing solutions required intensive refinement with a file and / or were rather voracious in terms of resources. In addition, you would need to learn how to administer them.

That is why we decided to write our bike. It turned out quite simple and does exactly what we need from it. Thanks to the existing architecture, you can, for example, look at the user's metadata at a point in time in the past. In addition, file versioning can be implemented.

● Tarantool key-value storage

● Nginx is widely used (wonderful Russian development, ideal until you have to write extensions for it)

● As a rudiment, the old notifier on the internal development of Mail.Ru Imagine is temporarily used

● More than three thousand pairs of disks. The user files and metadata are stored in two copies on different machines, each of which has 24 disks with a capacity of 2 or 4 TB (all new servers are set up with 4 TB drives).

Unexpected discoveries

We write detailed logs for each request. We have a separate team that writes an analyzer that builds various graphs to identify and analyze problems. Graphs are displayed in Graphite, in addition we use Radar and other systems.
Exploring how to use our Cloud, we made some interesting discoveries. For example, the most popular file extension in the Cloud is .jpg. This is not surprising - people love fotochki. More interestingly, the top 10 popular extensions include .php and .html. We don’t know what is stored there - the code base or backups from the hosting, but the fact remains.

We notice significant jumps in traffic on the days when updated mods appear, released for updates of popular online games: someone downloads the next add-on, and then the army of gamers together download it from our Cloud. It got to the point that on some days up to 30% of visits to us were made to download these updates.

Questions and answers

We thought about a list of questions that you might want to ask.

What about the HTTP extension?
We are not expanding the HTTP protocol anywhere. When we started developing Cloud, everything worked on open data channels without HTTPS (we went into production, of course, with full SSL encryption of all traffic between the client and server). Therefore, when someone had a desire to pull something into the headline, the threat of a wild proxy that would cut all custom headers quickly beat him off. So all we need is concentrated either in the URL or in the body.

What is used as a file identifier?
The file identifier for us is SHA-1.

What will happen to user data if the whole server or the data center of the Cloud is “covered”?
Everything is absolutely completely tolerant for us, and the failure of the server or even the whole DC has almost no effect on the user's work. All user data is duplicated on different servers in different DCs.

What plans do you have for the future?
In our immediate development plans, launch WebDAV, further optimize everything and everything, and many interesting features that you will learn about soon.

If we missed something, we will try to answer in the comments.

Tags: