Avito Picture Storage History
But what if you are given the task of organizing the storage and distribution of static files? Surely many will think that everything is simple. And if there are a billion such files, several hundred terabytes and several billion requests per day? Also, many different systems will send files of different formats and sizes for storage. This quest no longer seems so simple. Under the cut is the story of how we solved such a problem, what difficulties arose in this, and how we overcame them.
Avito developed rapidly from the first days. For example, the download speed of new pictures for ads increased several times in the early years. This required us at the initial stage to resolve issues related to architecture as quickly and efficiently as possible, in conditions of limited resources. In addition, we always preferred simple solutions that require few support resources. The KISS principle (“Keep it short and simple”) is still one of the values of our company.
The question of how to store and how to give pictures of ads arose immediately, since the ability to add a photo to an ad is certainly a key one for users - customers want to see what they are buying.
At that time, Avito fit on less than 10 servers. The first and fastest solution was to store image files in a directory tree on one server and synchronize over the crown to the backup.
The path to the file was determined based on the unique digital identifier of the pictures. Initially, there were two nesting levels of 100 directories on each.
Over time, the server space began to come to an end, and something had to be done with it. This time we chose to use a network attached to multiple servers. Prior to this, the network storage in test mode was provided by the data center itself as a service, and according to our test measurements at that time it worked satisfactorily. But as the load grew, it began to slow down. Storage optimization by the data center did not help radically and was not always operational. Our ability to influence how we optimize storage was limited and we could not be sure that the possibilities of such optimizations would not be exhausted sooner or later. The server park by that time had noticeably increased and began to number in the tens. An attempt was made to quickly raise distributed glusterfs. However, it worked satisfactorily only under light load. As soon as they were brought to full capacity, the cluster “fell apart”, or all requests to it “hung up”. Attempts to tune him were unsuccessful
We realized that we need something else. The requirements for the new storage system were identified:
- as simple as possible
- built from simple and proven components
- low support resources
- maximum possible and simple control over what is happening
- good scalability
- very fast deadlines for implementation and migration from the current solution
- the inaccessibility of some images during server repair is acceptable (the servers were repaired by us more or less promptly)
- loss of a small part of pictures is allowed in the event of the complete destruction of one of the servers. I must say that for the entire existence of Avito, this has never happened. 3 * Ugh.
For simplicity, the image shows only four nodes on two servers.
During the discussion, we came to such a scheme. Files will be stored in the same directory tree, but top-level directories are distributed across a group of servers. I must say that by that time we had accumulated servers with the same disk configuration, the disks on which were “idle”. Files are given all the same nginx. On each server, IPs are raised that correspond to a specific top-level directory located on the server. At that time, we did not think about balancing traffic, since the data center did it for us. The balancing logic was that, depending on the domain to which the request came (there were 100 domains in total), send the request to the internal IP, which was already on the right server.
The question arose of how the site code will upload images to the repository, if it is also distributed across different servers. It was logical to use the same http protocol by which the file came to us from the user. We started looking for what we can use as a service that will accept files on the server side, which stores files. The eye fell on the open nginx module for the file upload. However, during his study, it turned out that the logic of his work does not fit into our scheme. But this is open source, and we have experience in programming in C. For a short time, in between other tasks, the module was finalized and now, working as part of nginx, it took files and saved them in the right directory. Looking ahead, I’ll say that during the production process a memory leak was detected,
Loads are growing
Over time, we began to use this storage not only for pictures, but also for other static files, flexibly adjusting with nginx parameters that affect performance and access to files.
As the number of requests and the volume of data increased, we encountered (and sometimes foresaw in advance) a number of problems that, under high loads, required quick and effective solutions.
One of these problems was the balancing that the data center provided us with on the F5 Viprion load balancer. We decided to remove it from the traffic processing path by allocating 100 external IPs (one for each node). So we removed the bottleneck, accelerated data delivery and increased reliability.
The number of files in one directory was increasing, which led to a slowdown of components due to the increased reading time of the contents of directories. In response, we added another level of 100 directories. We got 100 ^ 3 = 1M directories for each storage node and increased overall speed.
We experimented a lot with optimizing nginx and disk cache settings. In the picture that we observed, we had the impression that the disk cache does not give full return on caching, and caching with nginx in tmpfs works better, but clearing its cache noticeably loads the system at peak hours. First of all, we included the logging of nginx requests into a file and wrote our daemon, which read this file late at night, cleaned it, detected the most relevant files, and cleared the rest from the cache. Thus, we limited the cleaning of the cache to a night period when the load on the system is not large. This worked quite successfully for some period, up to a certain level of load. Building statistics and clearing the cache no longer fit into the night interval, in addition, it was clear that disk space would soon come to an end.
We decided to organize the second level of data storage, similar to the first, but with some differences:
- 50 times the size of the disk subsystem on each server; however, the disk speed may be less.
- fewer servers
- external access to files is possible only through proxying through the first level
This gave us the opportunity:
- configure caching options more flexibly
- use cheaper equipment to store the bulk of the data
- even easier to scale the system in the future
However, this required some complication of the system configuration and code for uploading new files to the storage system.
The diagram below shows how two storage nodes (00 and 01) can be located on two storage levels using one server on the first level and one on the second. It is clear that you can place a different number of nodes on the server, and the number of servers at each storage level can be from one to one hundred. All nodes and servers in the diagram are not shown for simplicity.
For simplicity, the image shows only two nodes on two servers
What did we get as a result? A system for storing static files, which a mid-level specialist can easily figure out, built on reliable, proven open elements, which can be replaced or modified at a small price if necessary. Moreover, the system can give users dozens of PB data per day and store hundreds of terabytes of data.
Cons too. For example, the lack of data replication, complete protection against equipment failure. And although initially, when designing the system, it was immediately determined and agreed in accordance with the risk assessment, we have a set of additional tools that allow you to level these minuses (backup system, scripts for testing, recovery, synchronization, moving, etc.)
In future plans, add the ability to flexibly configure replication for part of the files, most likely through some delayed queue.
I deliberately did not delve into the details of the implementation, the logic of the work (nothing was said about at all), the nuances of the settings, so as not to drag out the article and convey the main idea: if you want to build a good strong system, one of the right ways could be to use open, proven products connected in a relatively simple and reliable scheme. Do it normally, it will be normal!