Yandex ranking: how to put machine learning on a thread (post # 2)

    We continue a series of publications about our FML framework , which automated the work with machine learning and allowed Yandex developers to use it in their tasks easier and more often. The previous post talked about what a ranking function is and how we learned to build it, having only a sufficiently large number of assessments from assessors at the input and a fairly diverse set of attributes (factors) of documents for a large number of requests.

    From this post you will learn:
    1. Why do we need to select a new ranking formula very often, and how exactly does FML help us with this;
    2. How we develop new factors and evaluate their effectiveness.

    image


    Selection of a ranking formula


    It is one thing to choose a formula once, and quite another to do it very often. And we will talk about the reasons why the second is so necessary in our realities.

    As already mentioned , the Internet is changing rapidly and we need to constantly improve the quality of search. Our developers are constantly looking for new factors that could help us with this. Our assessors evaluate thousands of documents every day in order to quickly train algorithms on new types of patterns appearing on the Internet and take into account changes in the usefulness of documents already evaluated previously. The search robot collects a lot of fresh documents on the Internet, which constantly changes the average values ​​of factors. Values ​​can change even with unchanged documents, since the algorithms for calculating factors and their implementation are constantly being improved.

    In order to quickly take into account this flow of changes in the ranking formula, a whole technological conveyor is needed. It is desirable that he does not require the participation of a person or be as simple as possible for him. And it is very important that some changes do not interfere with the evaluation of the usefulness of others. This is exactly the pipeline that FML has become. While MatrixNet acts as the “brain” of machine learning, FML is a convenient service based on it, the use of which requires much less specialized knowledge and experience. That's where this is achieved.

    Firstly, for each specific task that a developer comes to us with, FML recommends MatrixNet startup parameters that best suit the conditions and limitations of the task. The service itself selects the settings that are optimal for a particular volume of assessments - for example, it helps to choose the objective function ( pointwise or pairwise ) depending on the size of the training sample.

    Secondly, FML provides transparent multitasking. Each iteration of the selection of the formula is a many-hour calculation, requiring the full load of several dozen servers. As a rule, a dozen different formulas are selected at the same time, and FML manages the load and provides each developer with isolation of his calculations from the calculations of his colleagues so that they do not interfere with each other.

    Thirdly, unlike Matrixnet, which needs to be started manually, FML provides distributed execution of resource-intensive tasks on the cluster. This includes the use by all of the single and latest version of machine learning libraries, and the layout of the program on all machines, and the processing of arising failures, and the preservation of already performed calculations, and verification of the results in case of restarting the calculations.

    Finally, we took advantage of the fact that for computationally complex tasks, you can get a very significant performance gain if you run them on graphic processors (GPUs) instead of general-purpose processors (CPUs). To do this, we adapted the Matrixnet to the GPU, due to which we received more than 20-fold gain in the calculation speed per unit cost of equipment. The features of our implementation of the decision tree construction algorithm allow us to use the high degree of parallelism available on the GPU. Due to the fact that we retained the programming interfaces used by FML, we were able to provide colleagues working on factors with new computing power without changing the usual development processes.

    A few words about the GPU
    In general, the advantage of GPUs over CPUs is revealed in tasks with a large share of floating-point calculations, and machine learning is not distinguished from them. Computing performance is measured in IOPS for integer calculations and FLOPS for floating point calculations. And, if all the costs of I / O, including communication with memory, are bracketed, it is precisely by the FLOPS parameter that GPUs have gone far ahead compared to conventional ones. On some classes of tasks, performance gains compared to general-purpose processors (CPUs) are hundreds of times.

    But precisely becausethat far from all common algorithms are suitable for the GPU computing architecture and not all programs require a large number of floating-point calculations, the entire industry continues to use the CPU, rather than switching to the GPU.


    image

    About our GPU cluster and supercomputers
    Right now, Yandex GPU cluster performance is 80 Tflops, but soon we plan to expand it to 300 Tflops. We do not call our cluster a supercomputer , although in fact it is. For example, in terms of elemental base, it is very close to the Lomonosov supercomputer, the most powerful in Russia and Eastern Europe. A number of components in our case are even more modern. And although we are inferior to Lomonosov in terms of the number of computing nodes (and hence performance), after expansion, our cluster is likely to be among the first hundred of the world's most powerful TOP500 Supercomputer Sites and the top five of the most powerful supercomputers in Russia .


    Development of new factors and assessment of their effectiveness


    Factors in ranking play an even more important role than the ability to select a formula. After all, the more diverse signs different documents will distinguish, the more effective the ranking function can be. In an effort to improve search quality, we are constantly looking for new factors that could help us.

    Their creation is a very complex process. Not every idea stands the test of practice in it. Sometimes it can take several months to develop and set up a good factor, and the percentage of hypotheses confirmed by practice is extremely small. Like Mayakovsky: “Gram production, per year labor . In the first year of FML's work, for tens of thousands of checks of various factors with different combinations of parameters, only a few hundred were allowed to be implemented.

    For a long time in Yandex, to work on factors, it was necessary, firstly, to deeply understand the search device in general and ours in particular, and, secondly, to have good knowledge about machine learning and information retrieval in general. The appearance of FML allowed us to get rid of the first requirement, significantly lowering the threshold for entering the development of factors. The number of specialists who can now deal with it has grown by an order of magnitude.

    But a large team required transparency of the development process. Previously, each was limited only to inspections, which he himself considered sufficient, and measured quality “by eye”. As a result, getting a good factor turned out to be rather an art object. And if the factor hypothesis was rejected, then over time it was impossible to get acquainted with the tests by which the decision was made.

    With the advent of FML, factor development has become a standard, measurable and controlled process in a large team. Cross transparency also appeared, when everyone could see what colleagues were doing, and the ability to control the quality of previous experiments. In addition, we received a quality control system for the factors produced that allows a poor result with a much lower probability than at leading world conferences in the field of information retrieval.

    image

    To assess the quality of the factor, we do the following. We split (each time in a new random way) the set of grades that we have in two parts: training and test. According to the training estimates, we select two formulas - the old one (without the tested factor) and the new one (with it), and for the test ones we look which of these formulas is better. The procedure is repeated many times on a large number of different partitions of our estimates. In statistics, this process is called cross-validation . It allows us to make sure that the quality of the new formula is better than the old. In machine learning, this technique is known as dimensionality reduction using wrappers.. If it turns out that on average the new formula gives a noticeable improvement in quality compared to the old, a new factor may become a candidate for implementation.

    But even if the factor has proven its usefulness, you need to understand what the cost of its implementation and use is. It includes not only the time that the developer spent on developing the idea, its implementation and customization. Many factors must be calculated directly at the time of the search - for each of the thousands of documents found by request. Therefore, each new factor is a potential slowdown in the response speed of the search engine, and we make sure that it remains in a very tight framework. This means that the introduction of each new factor should be ensured by an increase in the capacity of the cluster responding to user requests. There are other hardware resources that cannot be spent indefinitely. For example, the cost of storing in RAM each additional byteper document on a search cluster is about $ 10,000 per year.

    Thus, it is important for us to select from many potential factors only those for which the ratio of the increase in quality to the cost of equipment will be the best - and abandon the rest. It is in measuring the increase in quality and assessing the volume of additional costs that the following FML task consists in choosing formulas.

    Measurement price and accuracy
    According to our statistics, evaluating the quality of factors before their implementation takes significantly more computational time than choosing the formulas themselves. Including because the ranking formula needs to be repeatedly re-selected for each factor. For example, over the past year, about 10 million machine hours were spent on about 50,000 inspections, and about 2 million were spent on selecting ranking formulas. That is, most of the cluster’s time is spent on research, and not on regular recounting of formulas.

    As in any mature market, each new improvement is much harder than the previous one, and each next “nine” in quality costs several times more than the previous one. Our account goes to tenths and hundredths of a percent of the target quality metric (in our case, it is pFound) Under such conditions, quality measuring instruments must be accurate enough to reliably record even such small changes.

    Speaking about hardware resources, we evaluate three components: computational cost, disk size and RAM size. Over time, we even developed “bargaining chips”: how much we can degrade performance, how many bytes of disk or RAM are willing to pay for a 1% increase in quality. Memory consumption is estimated experimentally, the quality gain is taken from FML, and the decrease in performance is estimated by the results of a separate load testing. However, some aspects cannot be evaluated automatically - for example, does the factor bring in strong feedback. For this reason, there is an expert council that has the right to veto the introduction of a factor.

    When it comes time to implement a formula built using new factors, we spendA / B testing is an experiment on a small percentage of users. He is needed to make sure that they like the new ranking more than the current one. The final implementation decision is made based on custom quality metrics . Dozens of experiments are conducted at Yandex at any given time, and we try to make this process invisible to users of the search engine. Thus, we achieve not only the mathematical validity of the decisions made, but also the usefulness of innovations in practice.

    So, FML allowed us to put the development of factors on Yandex in a stream and gave their developers the opportunity to understand the question of whether the new factor is good enough for consideration for implementation in an understandable and regulated way and with relatively little effort. We will talk about how we make sure that the quality of the factor does not degrade over time in the next - last - post . From it you will learn about where else our machine learning technology is applicable .

    Also popular now: