How the virus protection appeared in Mail.Ru Cloud



    Hello everyone, my name is Yuri Lazarev, I am the system administrator of Mail.Ru Cloud . Recently, we have implemented automatic anti-virus scanning of all files uploaded to the repository. Now all content is scanned by Kaspersky Anti-Virus, whose products are already used to protect against viruses in Mail.Ru Mail. In addition, files uploaded to the Cloud have been scanned since its launch last year. To implement such a test in a highly loaded service, while maintaining the same high speed, is a rather difficult task.

    As an analogy, we can compare the process of building a one-story house and a skyscraper. One-story house can be built even by a person without deep knowledge and great experience, and this construction will at least stand and serve. Everything is much more complicated with a skyscraper: the design of such a building must be seriously calculated in terms of the bearing capacity of the soil, wind loads and many other factors. And the anti-virus scan in the cloud service is organized quite differently than on home computers or even on corporate networks.

    If you want to know in more detail what the architecture of the Cloud is, then you can read the previous articleon Habré. This will give an understanding of how the process of saving the file and uploading it to the Cloud proceeds. And here we describe how we manage to check for petabytes of data for viruses in our highly loaded system, without losing either the quality of the service or the speed of downloading and checking files.

    New files

    There is deduplication in the Cloud, that is, a file with specific content is present in only one instance (in fact, in two, since there is a backup copy). If 200 people upload the same file, then 200 identical files will not be in the storage. It’s just that all these users will be given copies of one file. What for? Firstly, it allows us to use disk space more efficiently and, as a result, offer users more free storage space for information. In addition, we save power for checking files. Deduplication currently allows us to reduce storage load by about 15%.

    Scanning is performed several times: as soon as the file enters the Cloud and later, using updated anti-virus databases. After all, there is always the possibility that a file has become infected with a new virus that was not yet known to the antivirus at the time of download. So checks are carried out on an ongoing basis. If the file is infected, the service will neither allow downloading nor create a link to it.

    We scan files on separate servers that are dedicated exclusively to this task. In addition, we wrote a utility that allows you to scan files using the Kaspersky API. The fact is that you can’t just put the boxed version of the antivirus on some server and tell him to check all the files. In this case, it will be possible to completely forget about such a phenomenon as high performance. An antivirus product is not a tool specifically designed for use in cloud systems; it must be integrated. And the very process of anti-virus scanning in highly loaded systems must be tightly managed. The aforementioned utility took on this role. It not only determines the sequence of checking files, but also optimizes the load. To describe in a simple way: no need to download the entire file from the repository and submit for verification. The utility takes the beginning of the file, downloads a specific piece from the repository. Next, Kaspersky analyzes the type of this file. As a rule, it makes no sense to check the entire body of the file. Depending on the file type, the anti-virus SDK determines the scan strategy. Next comes the request to our utility, they say, give me this piece, and the necessary information is downloaded from the repository. As a result, when the SDK decides that the file has been checked, it receives a mark about the fact of the scan itself, the time of its execution and the version of the anti-virus database are indicated. Thus, the use of the management utility significantly reduces the scan time, reduces the load on the network and on the drives themselves. it makes no sense to check the entire body of the file. Depending on the file type, the anti-virus SDK determines the scan strategy. Next comes the request to our utility, they say, give me this piece, and the necessary information is downloaded from the repository. As a result, when the SDK decides that the file has been checked, it receives a mark about the fact of the scan itself, the time of its execution and the version of the anti-virus database are indicated. Thus, the use of the management utility significantly reduces the scan time, reduces the load on the network and on the drives themselves. it makes no sense to check the entire body of the file. Depending on the file type, the anti-virus SDK determines the scan strategy. Next comes the request to our utility, they say, give me this piece, and the necessary information is downloaded from the repository. As a result, when the SDK decides that the file has been checked, it receives a mark about the fact of the scan itself, the time of its execution and the version of the anti-virus database are indicated. Thus, the use of the management utility significantly reduces the scan time, reduces the load on the network and on the drives themselves. It indicates the time and version of the anti-virus database. Thus, the use of the management utility significantly reduces the scan time, reduces the load on the network and on the drives themselves. It indicates the time and version of the anti-virus database. Thus, the use of the management utility significantly reduces the scan time, reduces the load on the network and on the drives themselves.

    At the moment we have more than 20,000 disks with files of Cloud users. And the verification is ongoing. A variety of data, including huge video files, gets into the storage. Pulling them out of storage and moving them over the network would be an extremely suboptimal waste of resources. But, thanks to the mechanism described above, we managed to establish an anti-virus scan by several tens of servers. Now about 8 million files are checked per day, about 50 terabytes. This is far from the peak performance of the system, in addition, we have laid the possibilities for further scaling.

    Check queue

    So, we reduce the cost of storage and the load on it due to the use of deduplication, and also significantly increase the speed of anti-virus scanning using the management utility. But this would not be enough to quickly process such a large amount of data. Therefore, we used another tool - the verification queue. It is not just a list of files to which data is added from below, it is a separate service. The queue itself is located on a separate server and runs under the control of the Tarantool DBMS. This is an in-house development of Mail.Ru Group employees, and one of its features is its very high performance. This is what became the determining factor when choosing a DBMS, and not its origin at all. First of all, new files uploaded to the Cloud get into the queue. The service loader puts them there. Also, files with the longest time elapsed since their last scan are added to the queue. The second service, which replenishes the scan queue, has a limit on the maximum number of old files added so that it does not slow down the process of checking new ones. Each server simultaneously runs several of these processing services. Now we are trying to distribute files of different types and sizes into different queues in order to further reduce the scan time for most downloaded files.



    Old files

    Due to the fact that automatic verification was implemented some time after the launch of the Mail.Ru Cloud, about 14 petabytes of data were accumulated that needed to be verified. Moreover, of course, they were not lying on the same machine, but were scattered in several data centers. The situation was complicated by the fact that all these servers are active, which means that it was impossible to load them with tasks of checking files. If the server on which the files are stored will be busy with some analytical tasks that load the hardware resources of the storage, then potentially the speed of all operations will be significantly reduced, including the transfer of files over the network. And in this case, we would get a degradation in the quality of service.

    Checking such a volume of data gradually would also be inappropriate, it would take too much time. Therefore, it was decided to use additional resources in order to conduct the audit as soon as possible. For this purpose, a temporary cluster of 60 servers was assembled. They checked all previously downloaded data in about three weeks.

    We also calculated which blocked malware is the most common:



    Conclusions and Future Plans

    So, thanks to the integrated use of the management utility, the scan queue and the Tarantool DBMS, we were able to achieve high performance anti-virus scanning, almost in real time, using relatively small resources.
    The trend is that over time, more and more user information will be stored in the clouds. Therefore, anti-virus scanning is becoming an integral part of not only user devices, but also online services where their data is stored.

    The verification mechanism in our Cloud, we will still significantly upgrade. For example, it is planned to introduce specific gravity. Thanks to this parameter, the files most demanded by users will be checked most often. Large files will be allocated in a separate queue, because their verification takes a lot of resources and time. The organization of such queues for different files is an interesting task, which we will discuss in one of the following posts.

    Also popular now: