How Yandex taught artificial intelligence to understand the meaning of documents

    Today we will talk about the new Korolev search technology, which includes not only a deeper use of neural networks for searching by meaning, not words, but also significant changes in the architecture of the index itself.



    But why do we need technologies from the field of artificial intelligence, if twenty years ago we perfectly found what we were looking for? How does Korolev differ from last year's Palekh algorithm, which also used neural networks? And how does the architecture of the index affect the quality of the ranking? Especially for readers of Habr, we will answer all these questions. And start from the beginning.

    From word frequency to neural networks

    The Internet at the dawn of its existence was very different from its current state. And it was not only the number of users and webmasters. First of all, there were so few sites on each topic that the first search services needed to list all the pages containing the search term. And even if there were a lot of sites, it was enough to calculate the number of word uses in the text, and not to deal with complex ranking. There was no business on the Internet, so no one was engaged in wrapping.

    Over time, sites, as well as those wishing to manipulate the issuance, became noticeably more. And search companies are faced with the need not only to search for pages, but also to choose among them the most relevant to the user's request. At the turn of the century, technologies still did not allow “understanding” the texts of pages and comparing them with the interests of users, so a simpler solution was first found. The search began to consider links between sites. The more links, the more authoritative the resource. And when they ceased to be enough, he began to take into account the behavior of people. And it’s precisely the users of the Search who now largely determine its quality.

    At some point, all of these factors accumulated so much that a person ceased to cope with writing ranking formulas. Of course, we could still take the best developers, and they would write a more or less working search algorithm, but the machine did better. Therefore, in 2009 Yandex introduced its own Matrixnet machine learning method, which to this day builds a ranking formula taking into account all available factors. For a long time we dreamed of adding to this factor one that would reflect the relevance of the page not through indirect features (links, behavior, ...), but “understanding” its content. And with the help of neural networks we succeeded.

    At the very beginning, we talked about a factor that takes into account the frequency of words in the text of a document. This is an extremely primitive way of determining whether a page matches a query. Modern computing power allows the use of neural networks that cope with the analysis of natural information (text, sound, images) better than any other machine learning method. Simply put, it is the neural networks that allow the machine to go from searching by words to searching by meaning. And this is exactly what we started doing in the Palekh algorithm last year.

    Request + Heading

    More details about Palekh are written here , but in this post we will briefly recall this approach again, because it is Palekh that underlies Korolev.

    We have a person’s request and a page title, which claims to be in the top of the list. You need to understand how they correspond to each other in meaning. To do this, we present the query text and the heading text in the form of such vectors, the scalar product of which would be the greater, the more relevant the document with this heading. In other words, we use the accumulated search statistics to train the neural network in such a way that it generates similar vectors for texts that are close in meaning, and for semantically unrelated queries and vector headers, they should be different.

    image

    As soon as a person enters a request in Yandex, our servers in real time convert texts into vectors and compare them. The results of this comparison are used by the search engine as one of the factors. Presenting the text of the request and the text of the page title in the form of semantic vectors, the Palekh model allows you to catch fairly complex semantic links that are otherwise difficult to identify, which in turn affects the quality of the search.

    Palekh is good, but it had great unrealized potential. But in order to understand it, we first need to remember how the ranking process works.

    Stages of ranking

    Search is an incredibly complicated thing: it is necessary in a split second to find among millions of pages the most relevant to the query. Therefore, ranking in modern search engines is usually carried out using a cascade of rankers. In other words, the search engine uses several stages, at each of which documents are sorted, after which the lower documents are discarded, and the top, consisting of the best documents, is transferred to the next stage. At each subsequent stage, increasingly heavy ranking algorithms are applied. This is done primarily to save resources for the search cluster: computationally heavy factors and formulas are calculated only for a relatively small number of the best documents.



    Palekh is a relatively heavy algorithm. We need to multiply several matrices to get the query and document vectors, and then multiply them as well. Matrix multiplication spends valuable processor time, and we cannot afford to perform this operation for too many documents. Therefore, in Palekh, we applied our neural models only at the very latest stages of ranking (L3) - to about 150 of the best documents. On the one hand, this is not bad. In most cases, all the documents that need to be shown in the top ten are somewhere among these 150 documents, and you just need to sort them correctly. On the other hand, sometimes good documents are still lost in the early stages of ranking and do not get into the top. This is especially true for complex and low-frequency queries. Therefore, it was very tempting to learn how to use the power of neural network models to rank as many documents as possible. But how to do that?

    Korolev: computing in exchange for memory

    If you cannot make a complex algorithm simple, then you can at least redistribute the consumption of resources. And in this case, we can profitably exchange processor time for memory. Instead of taking the title of the document and calculating its semantic vector during query execution, you can pre-calculate this vector and save it in the search database. In other words, we can do a significant part of the work in advance, namely, multiply the matrices for the document and save the result. Then, during the execution of the query, we will only need to get the document vector from the search index and perform scalar multiplication with the query vector. This is significantly faster than calculating a vector dynamically. Of course, in this case we need a place to store the precomputed vectors.



    The approach based on precomputed vectors allowed to radically increase the depth of the top (L3, L2, L1), to which neural models are applied. New models of Korolev are calculated at a fantastic depth of 200 thousand documents per request. This made it possible to obtain an extremely useful signal in the early stages of ranking.

    But that is not all. Successful experience of preliminary calculation of vectors and their storage in memory cleared before us the path to a new model, which we could only dream of before.

    Korolev: request + document

    In Palekh, only the page title was fed to the model input. Typically, a heading is an important part of a document that briefly describes its contents. Nevertheless, the body of the page also contains information that is extremely useful for effectively determining the semantic correspondence of a document to a query. So why did we initially limit ourselves to a headline? The fact is that in practice the implementation of full-text models is fraught with a number of technical difficulties.

    Firstly, it is expensive from memory. To apply a neural model to a text during query execution, you must have this text “at hand”, that is, in RAM. And if you put short texts like headings into the RAM, it was quite realistic at the capacities at our disposal, then you won’t be able to do this with full texts of documents.

    Secondly, it is expensive on CPU. The initial stage of calculating the model is to project the document into the first hidden layer of the neural model. To do this, we need to make one pass through the text. In fact, at this stage, we must perform n * m multiplications, where n is the number of words in the document, and m is the size of the first layer of the model. Thus, the amount of processor time required to apply the model, linearly depends on the length of the text. This is not a problem when it comes to short headlines. But the average length of the document body is significantly larger.

    All this sounds as if it is impossible to implement the model using full texts without a radical increase in the size of the search cluster. But we did without it.

    The key to solving the problem was the same pre-computed vectors that we had already tested for the model in the headers. In fact, we do not need the full text of the document - it is enough to store only a relatively small array of floating-point numbers. We can take the full text of the document at the stage of indexing it, apply a series of operations to it, consisting in sequentially multiplying several matrices, and obtain the weight in the last inner layer of our neural model. Moreover, the size of the layer is fixed and does not depend on the size of the document. Moreover, such a redistribution of loads from processors to memory allowed us to take a fresh look at the architecture of the neural network.

    Korolev: layer architecture

    In the old Palekh models, there were 3 hidden layers with sizes of 150, 300 and 300 neurons. This architecture was due to the need to save computing resources: multiplying large matrices during query execution is expensive. In addition, RAM is also required to store the model itself. Particularly strongly, the size of the model depends on the size of the first hidden layer, so in Palekh it was relatively small - 150 neurons. Reducing the first hidden layer allows you to significantly reduce the size of the model, but at the same time reduces its expressive ability.

    In the new Korolev models, the bottleneck is only the size of the last hidden layer. When using precomputed vectors, resources are spent only on storing the last layer in the index and on its scalar multiplication by the query vector. Thus, it would be a reasonable step to give the new models a more “wedge-shaped” shape when the first hidden layers increase and the last layer, on the contrary, decreases. The experiments showed that you can get a good gain in quality if you make the size of the hidden layers equal to 500, 500 and 40 neurons. As a result of the increase in the first inner layers, the expressive power of the model has noticeably increased, while the last layer can be reduced to a couple of tens of neurons with almost no quality drawdown.

    Nevertheless, in spite of all our optimization, such a profound application of neural networks in search requires considerable computing power. And who knows, how much more time would have been required for implementation, if not for another project that would free up resources for their application, although we solved a completely different problem with it.

    Korolev: an additional index

    When we receive a user request, then among the millions of pages in the index, we begin to gradually select the best pages. It all starts with stage L0, which is actually a filtering one. It filters out most of the irrelevant documents, and other stages are already engaged in the main ranking.



    In the classic search model, we solve this problem using inverted indices. For each word, all the documents in which it occurs are stored, and when the request arrives, we try to cross these documents. The main problem is frequency words. The word “Russia,” for example, may appear on every tenth page. As a result, we must go through every tenth document so as not to lose anything we need. But on the other hand, we are waiting for the user who has just entered his request and expects to see the answer at the same instant, so the filtering stage is strictly limited in time. We could not afford to go around all the documents for frequency words and used different heuristics: sorted documents by some value to an indifferent relevance request or stopped the search when it seemed to us that there were enough good documents. In general, this approach worked well, but useful documents were sometimes lost.

    With the new approach, everything is different. It is based on the hypothesis: if a query from a few words takes a not very large list of the most relevant documents for each word or phrase, then among them there will be documents that are relevant at the same time to all words. In practice, this means this. For all words and popular pairs of words, an additional index is formed with a list of pages and their preliminary relevance to the query. That is, we take part of the work from stage L0 to the indexing stage. What does this give us?

    The tight time constraints of calculations are related to a simple fact - you cannot force the user to wait. But if these calculations can be done in advance and offline (i.e. not at the time of entering the request), then there are no such restrictions. We can let the machine bypass all documents from the index, and not a single page will be lost.

    The completeness of the search is important. But no less important is the fact that at the cost of RAM consumption we significantly unloaded the moment of constructing the output, freeing up computing resources for heavy neural network models request + header and request + document. And not only for them.

    Korolev: request + request

    When we started working on a new search, we still had no confidence in which direction would be the most promising. Therefore, we allocated two teams for the study of neural models. For some time they worked independently, developing their own ideas, and even to some extent competed among themselves. One of them worked on an approach with a request and a document, which we already described above. The second team approached the problem from a completely different perspective.

    For any page on the Internet, you can come up with more than one request. You can search for the same “VKontakte” using the queries [VKontakte], [VKontakte login] or [VKontakte social network]. The requests are different, but the meaning behind them is one. And it can be used. Colleagues from the second team came up with the idea of ​​comparing the semantic vectors of the query that the user had just entered and of another query for which we know the best answer for sure. And if the vectors (and therefore the meanings of the queries) are close enough, then the search results should be similar.

    As a result, it turned out that both approaches give good results, and our teams joined forces. This allowed us to quickly complete the research and introduce new models in the search for Yandex. For example, if you type in the query now [lazy cat from Mongolia], then it is neural networks that help pull information about the manul to the top.

    What's next?

    "Korolev" is not one specifically taken model, but a whole set of technologies for deeper use of neural networks in the search for Yandex. This is another important step towards the future, in which the Search will focus on the semantic matching of queries and pages no worse than a person. Or even better.

    All of the above already works, and some other ideas are waiting in the wings. For example, we would like to try to use neural networks at the L0 search stage, so that semantic vectors help us find documents that are close in meaning to the query, but do not contain query words at all. We also wanted to add personalization (imagine another vector that will correspond to the interests of man). But all this requires not only time and knowledge, but also memory and computing resources, and here you can not do without a new data center. And Yandex already has one. But this is another story that we are sure to tell about in the near future. Follow publications.

    Also popular now: