How hh.ru tests job search

    I already shared a story about our experience with the use of artificial intelligence in search on hh.ru, and today I would like to dwell on measuring the quality of this search in more detail.

    For the search to work properly, a system of metrics is extremely important - local, A / B tests, queues at the prod, etc., and this system requires separate attention and resources. It is wrong to think that it’s enough just to gash a cool ML and fasten all these metrics with “tape”; it’s not enough to measure the quality of work of an already working system - it’s not so important whether it uses ML or is Lucene “out of the box”.

    We refused the old solutions in the search not because they seemed obsolete to us, or because ML is stylish, fashionable and youthful. The old search lacked local quality metrics by which it would be possible to measure the benefits of changes before they were launched into lengthy experiments; besides, it was not clear what to change and measure in order to organize the process of continuous improvement.

    When we began to build a search system on ML, we immediately provided for a system of local metrics. During the development process, we compared the quality of the new search in ML with the speed of the models predicting the likelihood of a response, with the quality of the old search by keywords, which only used the textual matching of the request and the vacancy. For this, we used the usual local metrics: MAP, NDCG, ROC-AUC. In addition, in the process, we expanded the number of metrics and cohorts in A / B tests and covered the new search with autotests. In this article I will talk about how we monitor the quality of our recommendation models - it is quite possible that the HeadHunter experience will be useful to you too, because, I repeat, it’s not that important whether your search is built on ML or not.

    Statistical tests


    First of all, we began to measure the quality of models using local MAP, NDCG, ROC-AUC metrics and noticed a significant improvement from the transition from keyword search to ML based search. This is explained by the fact that a traditional search based on Lucene or Sphinx is not able to predict the probabilities of target actions and to rank on it. For example, it does not know how to take into account the role of the salary indicated in the vacancy and in the resume of the applicant; It does not correlate key skills in resumes and job requirements, does not take into account semantic connections when comparing words. This can be seen on the search quality metrics, if we compare the speed of text matching in Lucene with the speed from models that are selected using ML and provide ranking and filtering by the probability of response and invitation:
    MetricsSearch by key. according toML search
    Area under the ROC curve0.6080.717
    Mean Average Precison0.3270.454
    NDCG0.525 0.577

    Values ​​of local metrics can predict product values ​​as well as how correctly these local metrics are measured. For example, when switching to splitting splits by time and user during cross-validation, the values ​​of metrics decreased, but they began to better predict future changes in A / B tests.

    Over the past year, having improved the quality of search and recommendations, we have increased the success of search sessions in the application, on the mobile site and on the desktop by an average of 22% (a failure on the chart is the New Year holidays).


    Auto tests


    After we expanded the coverage of unit and smoke tests. For example, we look at smoke-tests for high-frequency queries ([accountant], [driver], [administrator], [manager]) and the model’s work with reference user resumes from the reference database - so that every time we release, we see that we they didn’t break the search and for the request “sales manager” there are relevant vacancies, and on the first pages there are no vacancies, for example vacancies for project managers.

    A / b


    The main purpose of the A / B testing system for us is control and decision-making (whether to roll out a new model, interface, etc.). To control (check the quality of an already working model), we conduct reverse tests when the old model is included as an experiment. So you can be sure that the current model is still better than the old one.

    We have been using our own A / B test system for quite some time. For example, after the very first launch of the alpha version of recommendations on ML, it allowed us to see that the success of recommendations increased by 30%. By the way, we examined the quality of the A / B testing system and the metrics used separately in the article .

    Performance


    But the “victory” of the new model in local metrics or in the A / B test does not mean that this model will work in the prod: the model may be too resource-intensive, which would be completely unacceptable for hh.ru, a highly loaded site. To measure resource intensity, we monitored all stages of calculating the speed of a document.

    The graph shows the time spent searching in each stage. It can be seen that the new model was too heavy - it had to be rolled back, optimized features and rolled out a computationally lighter one.


    And other indicators


    The most important task of the search and recommendation system is the selection of vacancies to which the user will most likely respond. We want the number of responses to vacancies to increase and people find jobs faster. Therefore, in addition to the CTR and the number of successful search sessions, the most important indicator of the search was the absolute number of responses to vacancies. With the inclusion of the new model, the number of responses began to increase sharply: now on hh.ru on average per day users make more than 600,000 responses to vacancies. This is a floating indicator - there are days when we record more than a million responses. We can also consider success as adding a candidate to a vacancy as a candidate or, for example, viewing contacts in a proposed vacancy.

    At the end of this story, I would like to step aside a little and voice one more conclusion that we came to when creating a new search: quality is not enough to measure, it must be built into the product initially. In addition to understandable metrics, this is facilitated by the correct setting of tasks, so that you do not have to redo them, proper planning, which ensures quiet work without any emergency, careful attitude to the team, ideas and time. It is in these conditions that there will be something to measure.

    Also popular now: