How we accelerated the search in Yandex.Mail and at the same time freed 25 servers
We already wrote about how the work of searching letters in Yandex.Mail is organized . Since then, a lot has changed and improved, so we decided to share our experience and tell you about these changes.
About 100M letters come to the Mail per day, 10M of which are with attachments. Despite the fact that only 10% of letters contain an attachment, among letters with attachments a significant proportion of those in which there are more than one file. On average, it turns out that the total number of letters is equal to the total number of attachments to them.
The average size of a letter with an attachment is 400 kb, and letters without an attachment are 4 kb. The total size of attachments in one letter can reach 30 mb. TOP 10 types of attachments: .jpg, .pdf, .xls, .rar, .doc, .zip, .eml, .mp3, .tif, .docx. Almost all file formats except text, contain a significant amount of redundant overhead information. So, for example: .docx format, contains on average only 10% of textual information, and from jpg we get only 0.25% of meta-information for indexing in search.
This gives a total amount of incoming traffic of the order of 25 Tb per day, which is increased several times to ensure the functioning of the large and complex Mail product. To service such a load, Yandex.Mail created a large network, server, and service infrastructure, which includes several clusters distributed across different data centers.
All letters in the mail go to the delivery system - a cluster of hundreds of servers. The delivery system tries to save the letters in the mail repository, the meta-information repository of the letters and send them in a search - that is, to three different places at once, each of which performs its own tasks.
Email repositoryIt is responsible for storing and returning, upon request, the entire contents of each letter and is historically for some reason called mulca. To store letters in Yandex.Mail deployed 700 mule servers. It stores the contents of letters, headings, attachments - in a word, everything related to the letter.
The meta-information repository serves for a quick display of the inbox and contains only the descriptive part of the letter. For example, the “From”, “To”, “Subject” fields, the name of the folder in which the letter is currently located, its current label, date of writing, etc. The meta-information cluster occupies 60 servers.
Search repository- This is a search index containing all the information from the letter that is needed to provide a quick full-text search in the mailbox, taking into account morphology. Search service, performing the tasks of indexing and search, engaged in one hundred and forty servers simultaneously.
A letter is deemed delivered if it is in the repository of letters and in the repository of meta-information. Delivery of letters in the search is carried out after laying in the repository. A separate Services cluster consisting of twenty-five servers has been allocated for the delivery of letters in search. On this cluster there are queues of letters awaiting indexing, and programs that prepare data for indexing.
Thus, letters getting into the mail go a long way. First, they are added to the repository, then they fall into the Services cluster, and there they are prepared to be sent for search.
But a letter search is not only a search in the body of a letter , but also a search in the contents of attachments. To provide it, the files sent as an attachment must be pre-processed, the text extracted and sent in text form to search.
Several years ago, without changing the original architecture announced above, we launched a search on the contents of the attachments. This entailed a number of problems. Firstly, some types of files (especially .pdf) were processed for a long time - up to several minutes, and slowed down the delivery of new letters in search. Secondly, there was constant competition for resources between the indexing program and the conversion program on the Services cluster, which also slowed down the search for new messages. And thirdly, when we began to send the entire letter with attachments to the Services cluster, we actually doubled the mail traffic inside Yandex networks. Intranet traffic increased by 25 Tb per day, and this is a load on resources useful for personal services and for Yandex as a whole: servers, network infrastructure, and total network performance.
So, we had to fight for the quality of service. It was impossible to allow letters to come in search after minutes. Moreover, before starting the search for attachments, 95% of all incoming letters fell into the search in less than a second.
There was an idea to deliver letters in their entirety only to the repository, and in the search to return only structured text directly from the repositories. Preliminary studies have shown that the average size of the text needed for the search is 10 times smaller than the average size of a letter in the mail. Therefore, if only text appears in the search, then the network traffic consumed by the search will be 10 times less, which will save about 22 Tb per day. It seems that for the sake of saving traffic, the game is worth the candle.
It was also a tempting idea to use storages performance that is two orders of magnitude larger than a small Services cluster for preprocessing files into search. This would allow us to accelerate. So they did.
So that the voiced idea could be realized, and we could receive the contents of letters and attachments directly from the repository, it was necessary to place a program for extracting the contents of attachments on its servers. This program is based on the Apache Tika library , so for simplicity and consonance with the Russian language, the developer called it Tikaite. When placing Tikaite on mulca's, it was important not to harm the storage of letters, so we examined the load in detail and saw that the storages were loaded on disk space and had a sufficient free processor performance resource that could be used.
We placed in the repository a program for extracting text from different formats with severe restrictions: the program was provided with 50% of the processor performance and 1 GB of RAM was allocated on each server. Such restrictions allowed us to start the conversion process to the repository and not interfere with the storage process.
As a result, we reduced spurious intranet traffic and increased the performance of the system for delivering letters to search by two orders of magnitude - again 95% of letters began to go into search within a second. On delivery, we received an additional bonus in 25 free servers of the cluster for delivering letters to search. If initially we planned to expand the Services cluster by 2 times and get 50 servers here to cope with the ever-growing flow, now that all the data needed for the search is prepared directly in the mail repository, in fact we have a whole cluster of 25 servers that has become free. So in the near future we will be able to use it for other tasks, which we will certainly tell.
PS And to search by mail, reindexing of all the letters accumulated over the entire existence of Yandex.Mail is periodically required. This happens when it is necessary to change the search algorithm, and there is not enough data in the search for this. At this time, we are actually processing the entire array of letters stored in the mail, and this is now about 10 petabytes. And here it turns out just a huge savings in network traffic and performance.
PPS Despite the results, we do not plan to stop there. We will strive to make the delivery of letters in search at the same time as putting them in storage and for this we will use the vacant cluster. Wait for information about this in new publications about the search in Yandex.Mail.
About 100M letters come to the Mail per day, 10M of which are with attachments. Despite the fact that only 10% of letters contain an attachment, among letters with attachments a significant proportion of those in which there are more than one file. On average, it turns out that the total number of letters is equal to the total number of attachments to them.
The average size of a letter with an attachment is 400 kb, and letters without an attachment are 4 kb. The total size of attachments in one letter can reach 30 mb. TOP 10 types of attachments: .jpg, .pdf, .xls, .rar, .doc, .zip, .eml, .mp3, .tif, .docx. Almost all file formats except text, contain a significant amount of redundant overhead information. So, for example: .docx format, contains on average only 10% of textual information, and from jpg we get only 0.25% of meta-information for indexing in search.
This gives a total amount of incoming traffic of the order of 25 Tb per day, which is increased several times to ensure the functioning of the large and complex Mail product. To service such a load, Yandex.Mail created a large network, server, and service infrastructure, which includes several clusters distributed across different data centers.
All letters in the mail go to the delivery system - a cluster of hundreds of servers. The delivery system tries to save the letters in the mail repository, the meta-information repository of the letters and send them in a search - that is, to three different places at once, each of which performs its own tasks.
Email repositoryIt is responsible for storing and returning, upon request, the entire contents of each letter and is historically for some reason called mulca. To store letters in Yandex.Mail deployed 700 mule servers. It stores the contents of letters, headings, attachments - in a word, everything related to the letter.
The meta-information repository serves for a quick display of the inbox and contains only the descriptive part of the letter. For example, the “From”, “To”, “Subject” fields, the name of the folder in which the letter is currently located, its current label, date of writing, etc. The meta-information cluster occupies 60 servers.
Search repository- This is a search index containing all the information from the letter that is needed to provide a quick full-text search in the mailbox, taking into account morphology. Search service, performing the tasks of indexing and search, engaged in one hundred and forty servers simultaneously.
A letter is deemed delivered if it is in the repository of letters and in the repository of meta-information. Delivery of letters in the search is carried out after laying in the repository. A separate Services cluster consisting of twenty-five servers has been allocated for the delivery of letters in search. On this cluster there are queues of letters awaiting indexing, and programs that prepare data for indexing.
Thus, letters getting into the mail go a long way. First, they are added to the repository, then they fall into the Services cluster, and there they are prepared to be sent for search.
But a letter search is not only a search in the body of a letter , but also a search in the contents of attachments. To provide it, the files sent as an attachment must be pre-processed, the text extracted and sent in text form to search.
Several years ago, without changing the original architecture announced above, we launched a search on the contents of the attachments. This entailed a number of problems. Firstly, some types of files (especially .pdf) were processed for a long time - up to several minutes, and slowed down the delivery of new letters in search. Secondly, there was constant competition for resources between the indexing program and the conversion program on the Services cluster, which also slowed down the search for new messages. And thirdly, when we began to send the entire letter with attachments to the Services cluster, we actually doubled the mail traffic inside Yandex networks. Intranet traffic increased by 25 Tb per day, and this is a load on resources useful for personal services and for Yandex as a whole: servers, network infrastructure, and total network performance.
So, we had to fight for the quality of service. It was impossible to allow letters to come in search after minutes. Moreover, before starting the search for attachments, 95% of all incoming letters fell into the search in less than a second.
There was an idea to deliver letters in their entirety only to the repository, and in the search to return only structured text directly from the repositories. Preliminary studies have shown that the average size of the text needed for the search is 10 times smaller than the average size of a letter in the mail. Therefore, if only text appears in the search, then the network traffic consumed by the search will be 10 times less, which will save about 22 Tb per day. It seems that for the sake of saving traffic, the game is worth the candle.
It was also a tempting idea to use storages performance that is two orders of magnitude larger than a small Services cluster for preprocessing files into search. This would allow us to accelerate. So they did.
So that the voiced idea could be realized, and we could receive the contents of letters and attachments directly from the repository, it was necessary to place a program for extracting the contents of attachments on its servers. This program is based on the Apache Tika library , so for simplicity and consonance with the Russian language, the developer called it Tikaite. When placing Tikaite on mulca's, it was important not to harm the storage of letters, so we examined the load in detail and saw that the storages were loaded on disk space and had a sufficient free processor performance resource that could be used.
We placed in the repository a program for extracting text from different formats with severe restrictions: the program was provided with 50% of the processor performance and 1 GB of RAM was allocated on each server. Such restrictions allowed us to start the conversion process to the repository and not interfere with the storage process.
As a result, we reduced spurious intranet traffic and increased the performance of the system for delivering letters to search by two orders of magnitude - again 95% of letters began to go into search within a second. On delivery, we received an additional bonus in 25 free servers of the cluster for delivering letters to search. If initially we planned to expand the Services cluster by 2 times and get 50 servers here to cope with the ever-growing flow, now that all the data needed for the search is prepared directly in the mail repository, in fact we have a whole cluster of 25 servers that has become free. So in the near future we will be able to use it for other tasks, which we will certainly tell.
PS And to search by mail, reindexing of all the letters accumulated over the entire existence of Yandex.Mail is periodically required. This happens when it is necessary to change the search algorithm, and there is not enough data in the search for this. At this time, we are actually processing the entire array of letters stored in the mail, and this is now about 10 petabytes. And here it turns out just a huge savings in network traffic and performance.
PPS Despite the results, we do not plan to stop there. We will strive to make the delivery of letters in search at the same time as putting them in storage and for this we will use the vacant cluster. Wait for information about this in new publications about the search in Yandex.Mail.