Performance metrics for vertical search results based on a click model
In 2014, a report was published by Yandex , revealing the details and conclusions of an experiment devoted to the influence of user behavior on metrics for evaluating the effectiveness of issuance. The translation was supported by the Research Department of ALTWeb Group , which is studying the influence of behavioral factors on the order and ranking methods. ALTWeb Group uses the results of its own research to develop and implement modern solutions in the field of digital commerce. Publications from open sources are used for scientific purposes.
The cited report by Yandex reveals one of the aspects of the influence of behavioral factors on the formation of an approach to building pages of search results. The text of the study is provided in its entirety and for educational purposes.
Modern search engines show users heterogeneous information originating from sources of various types, also called “verticals”. Assessing this type of system is an important and difficult task, the solution of which has yet to be found. In this report, we consider the hypothesis that the use of models that capture data about user behavior in searches in relation to heterogeneous output pages can improve the quality of offline metrics. We offer two metrics for evaluating vertical sources of information, which are based on a user click model for parallel searches, and give them an estimate based on user query logs collected by the Yandex search engine. In our work, we show that, depending on the type of vertical,
Category and subject area
H 3.3 [Storage and access to data]: Search and retrieval of data.
Click model, evaluation, parallel search.
When evaluating a web search system, it is generally accepted that users receive a result page with ten snippets, also known as “ten blue links”, and that these snippets are viewed by the user from top to bottom. Nevertheless, existing search engines go beyond the ten blue links paradigm and show the user heterogeneous information from various search algorithms, also known as verticals (for example: images, news, maps, etc.). In this case, the user behavior is significantly different from that on the standard output page [3, 10]. Although changes in user behavior should be taken into account when compiling heterogeneous output pages, however, little research has yet been done in this direction .
The quality of the issuing page can be evaluated in two ways: online or offline. Online evaluation, such as split testing, collects feedback directly from users. Typically, feedback data includes clicks, page time, mouse movements, and other metrics. The system quality assessment is based on these signals. Also, the quality of the search can be evaluated offline manually based on the entire SERP page and / or its parts. Such an assessment can be made both with direct application and without offline performance metrics. Not so long ago, a mixed estimation method was proposed where offline metrics are built on the basis of user behavior models, the parameters of which are taken from the search query logs . Thus, the assessment is carried out offline and gives rezultaty immediately. Nonetheless,
In this article, we consider the problem of evaluating heterogeneous search algorithms based on the above facts. In particular, we are developing a model of performance metrics for heterogeneous output pages based on a click model.
The main question of our research is the following: is it possible to improve the quality of offline performance metrics for web searches using data on user behavior in situations of different vertical results?
The practical benefits of this study are as follows. First, we are developing two performance metrics for vertical results based on a click model for a combined search [3, 10]. Secondly, we evaluate the effectiveness of the proposed metrics based on a wide selection of search logs, based on search sessions with various types of vertical results, namely: images, videos, maps and news.
2. Metrics based on user click model
Internet search performance metrics should reflect how users perceive the quality of the proposed listings. Accordingly, these metrics are increasingly based on user behavior data. Traditional metrics, such as various methods for assessing accuracy, suggest that users are interested in relevant documents, and therefore focus on the relevance parameter. In addition to relevance, more advanced metrics such as nDCG  and RBP  suggest that users view results from top to bottom and accordingly rank documents according to relevance on the SERP page.
Recently, however, a number of metrics for evaluating results have been proposed to be based on a user click model. This kind of model evaluates the possibility of clicking on each document provided to the user in the SERP. Metrics based on user behavior models, in turn, use these probabilistic capabilities in order to measure the quality of search results. Metric Expected Reciprocal Rank metric (ERR)  (Ranking used by Yahoo - approx. Transl.) Uses a simplified version of the click model DBN  where the user scans the results of the results page from top to bottom until he finds a relevant document and will leave the search. The Expected Browsing Utility (EBU)  (a method proposed by Microsoft - approx. Transl.) Is also based on a simplified DBN model but, unlike ERR,
Chuklin et al.  proposed a general method for converting click models into metrics for assessing the effectiveness of building output pages. They applied this idea to existing search models such as DBN , DCM , and UBM . As a result, a number of metrics have been proposed based on utility indicators and effort. All of them gave higher accuracy indicators compared to standard methods that do not take into account click models.
3. Metrics for vertical data sources
The metrics mentioned above have been shown to be effective in a standard user scenario. However, existing offline evaluation methods for Internet searches do not take into account the presence of vertical results on the SERP page. Recent studies in this area have shown that user behavior deviates significantly from the standard scenario in this case. [1, 10].
The following clique models were proposed that reflect these deviations: the Unified Clique Model (FCM) , the Clique Model for Vertical Searches (VCM) . These models showed a greater approximation to real results and a lower degree of error compared to click models for a standard search. However, appropriate metrics for evaluating the effectiveness of issuance have not been developed. We will try to fill this gap by converting FCM and VCM models to the corresponding offline evaluation metrics based on click models.
We believe that these metrics are better correlated with data from online experiments compared to existing offline metrics when data from various (vertical) sources are present on the search results page.
Both FCM and VCM complement the tility Browsing Model (UBM) for Internet searches  (although the use of DCM and DBN is also acceptable).
Therefore , UBM can be used to create metrics based on the usefulness of search — and we will focus on similar assessment methods in this document.
Performance-based metrics (UBM) can be defined as follows:
where N denotes the number of documents on the issuing page, P (C k = 1) means the probability that a document will be clicked on the k-th account, and r k means relevance k-th account document on the issuance page.
In expression (1), relevance expressed by r k, is offline, while the click probability P (C k = 1) is calculated based on the user click model. We use the ERR  definition of relevance r based on the relevance of degree R, such as: r = (2R 1) = 2R max .
According to the UBM click model, a document is clicked only if it is noticed and has attractiveness to the user:
where E and A are arbitrary variables that record the occurrence of events that the document is noticed and has attractiveness. In the UBM model, attractiveness depends on the document and q query, and the fact that the document is seen probably depends on its location and distance from the last click.
During an offline evaluation of a web search, clicks are not available, therefore the distance d from the place of the last click on the document is not available. Therefore, this distance should not be taken into account in order to calculate the final click probabilities. According to , P UBM (C = 1) can be defined by the following formula:
where, for simplicity, it will be accepted that .
FCM based metric
A study of user behavior in a combined search shows that the presence of vertical results affects the likelihood of opening other documents on the issuance page [3, 10]. In order to build a model that demonstrates this difference from the standard search, FCM introduces an additional hidden variable F, which indicates whether user behavior changes when there are vertical results in the output. In this paper, we will call it “vertical appeal”. The probability that the document will be considered, according to the FCM model, will be the following equation:
where t represents the type of vertical search result, v represents its position, and l is the distance between the vertical results and the rest of the search results, which can be either positive or negative. Thus, the probability of considering a document in the FCM model can be calculated as follows:
In order to obtain the probability of a click P FCM (C = 1), we need to substitute the probability of considering P FCM (E = 1) in equation (2) instead of P FCM ( E = 1) = γ kd . In this case, the uFCM metric can be represented by adding P FCM (C = 1) to equation (1).
VCM based metric
Like FCM, VCM assumes that the likelihood of viewing a document changes when an attractive vertical search result is present on the SERP page (F = 1). Also, VCM assumes that in this case, the user considers the vertical result first and only then continues to consider other results in a top-down direction. This is controlled by the hidden variable B. Thus, the VCM models the probability of consideration as follows:
Thus, the equations describe three possible scenarios of the path of consideration for the output page:
(i) starting from the top of the document down (F = 0),
(ii) starting from the vertical, then again clicking on the top of the SERP page (F = 1; B = 1), and
(iii) from the vertical to the end of the SERP page (F = 1; B = 0).
The overall review probability in VCM is calculated as the average of the probabilities for considering these three paths:
where d, d 'and d' 'denote the distances between the last clicked documents according to each of the paths.
The overall likelihood of a click in the VCM model cannot be substituted directly into expression (2) because it uses different distances for different user behavior paths. Therefore, you need to highlight the click probability for each path and thus remove each of the distances from the equation. Then the total click probability for the VCM model can be represented as follows:
where P i denotes the probability of consideration in the i-th path. The metric uVCM is calculated by substituting P VCM (C = 1) into expression (1).
4.1 Experimental conditions
In order to evaluate the effectiveness of the proposed metrics for search, including vertical search results, we have compiled user search sessions from click logs based on a large commercial search engine Yandex. As in [3, 10], we used vertical results of three types: images and video as multimedia verticals, news as a text vertical, and maps as a mixed vertical composition containing textual and visual data. We highlighted sample sessions containing one of these vertical results in November 2013. The first 10 documents in the issuance in each session were evaluated by users on a standard five-step scale (ideal, excellent, good, good, bad, bad). The collected sessions were sorted by user ID and arranged in packages for training and testing (see table 1).
According to [2, 4], we evaluated the quality of the proposed metrics based on their compliance with online metrics such as UCTR and Max / Mean / MinRR. UCTR is a binary variable indicating whether there was a click during the session or not (the situation is the opposite of the exit from the session). MeanRR is the average inversely proportional rank of clicks in a session. MaxRR is the inverse proportional rank of the last click. For these online metrics, only clicks in the search results are considered.
Taking into account that for the same query the search results page may vary depending on the user, his location and other similar user factors, we focused on the structures , which are a request with a fixed search page (see statistics in table 1). Offline metrics give the same values for the same structures, while online metrics give an average for all sessions with the same structure. The measurable relationship between offline and online metrics is calculated based on all the structures, as shown in :
Where N represents the total number of configurations, nc represents the number of different configurations c, m i represents the value of the metric m ifor configuration c, a is the numerical value of the variable m i .
We compare our output metrics with vertical results with two types of input data:
(i) static offline metrics where the parameters are unchanged (DCG and ERR), and
(ii) metrics based on the click model for web search, where the parameters are taken from click logs (EBU, uDCM, uDBN and uUBM). Considering these model parameters, the probability of attractiveness for the user P (A = 1) (and the probability of satisfaction P (S = 1) for DBN) is considered dependent only on the degree of relevance of the document to this query as in .
4.2 Conclusions and discussion
The measurable relationship between offline and online metrics for various types of vertical search results is shown in Tables 2-5, where the best values are shown in bold.
Tables 2 and 3
Table 2 presents the results for the news vertical. News snippets contain mostly textual data and, therefore, are similar to standard web snippets. As a result, most offline metrics (with the exception of DCG) have a corresponding correlation with online metrics. At the same time, the proposed metrics for issuing with vertical results such as uFCM and uVCM are somewhat superior to other metrics.
Tables 3 and 4 present the results for the multimedia vertical, namely for the search results for images and videos. In both cases, uFCM shows higher correlation values with all online metrics compared to the original data. This result is intuitive, given that user behavior, according to the logs, changes significantly when visual stimuli are present in the vertical result (for example, an image) [3, 10]. The FCM model records these changes, which in turn are the result of a higher correlation of values between uFCM and online metrics.
The uVCM metric ranks second in terms of success in terms of correlation of values with online experiment values. However, it does not correlate with uFCM. This can be explained as follows. Click models FCM and VCM use the document vertical attractiveness parameter for the user , which shows how much the user behavior differs from the standard web search scenario when a vertical result of type t is present in rank v. The lower the value , the closer the vertical model to the corresponding UBM model. After using FCM and VCM on the verticals of images and videos, we observed that the expected value is relatively high, which in turn means that FCM is largely derived from UBM. In contrast, the meaningfor VCM it turned out to be very low, much closer to UBM. Indeed, Tables 3 and 4 show that the correlation of uVCM with online metrics is close to a similar correlation of uUBM.
Tables 4 and 5
Table 5 presents the results for a vertical map search, giving data in text and visual formats. DCG has the highest correlation with RR-based online metrics, followed by uDCM (which has the highest correlation with UCTR) and EBU. We used A \ B testing in order to observe this correlation. Testing was carried out on real users as part of the search engine used, where the vertical map was disabled for a week period.
This experiment showed that the level of exit from the search (when the vertical result is shown in the search) was much higher compared to the level of continued search (the vertical result is not issued to the user). We see two reasons explaining this phenomenon: (i) users are satisfied with the information presented in the search results (address, phone number, hours of operation, etc.) and exit the search without clicking, which is considered a positive result from exiting the search. (ii) some users consider the above result for the vertical of cards to be a banner (especially if vertical output occupies the top search line) and skip this result, which is considered as a variant of banner blindness.
For route requests, this gives no clicks on issuance. In both cases. Online metrics such as MeanRR and UCTR do not provide an assessment of the full picture of user behavior. Thus, the low correlation of offline metrics observed in the results of Table 5 cannot be interpreted as a negative result. Other means of assessing the quality of offline metrics should be used in this case (for example, classifying search exits as “positive” and “negative” as in  and calculating correlation indicators only for the last type of output), which will be the topic of our next work.
As a result of our work, we discovered several important trends. Firstly, they confirm the results of studies of previous works on the behavior of users in parallel searches, namely: user behavior depends on the type of vertical result included in the output page, where visually attractive verticals, such as video, affect user behavior more than text verticals, such as news. Mixed content verticals, such as maps, provoke more complex user behavior that requires further investigation.
Secondly, in response to the research question posed in Section 1 of this paper, we showed that, depending on the type of vertical, the proposed metric for parallel search, based on the click model, has a higher correlation value with online user behavior compared to offline metrics for web search. In particular, uFCM has the highest correlation in the case of visually attractive verticals such as images and videos included in the output page. The uVCM metric, by contrast, is more conservative and closer to the corresponding UBM model.
5. Conclusions and further research
In this paper, we examined the problem of offline estimation of a heterogeneous search engine environment, where standard search results compete with vertical search results. We investigated how data on user behavior using such mixed output pages as an example can help improve the quality of offline metrics. From this point of view, we examined the existing click models for parallel search, namely FCM and VCM, and converted them into performance metrics based on click models. The experimental results showed that, depending on the type of vertical, the proposed metrics have higher correlation rates with online metrics, especially if visually attractive vertical results such as images and videos are shown in the output.
In our future work, we plan to deepen the proposed metrics to evaluate not only web results, but also the results pages as a whole, including vertical results, sponsorship search and other components. We also plan to examine in more detail the behavior of users in the case of displaying the results of the vertical cards on the issuing page. First of all, we would like to understand the reason for the high search exit rates that we observed in this case, after which we plan to develop methods for separating positive and negative outputs for a more accurate assessment of the quality of offline metrics.
We express gratitude.The authors of the study would like to thank Evgeny Krokhlev and Sergey Protasov for the discussions in which we found inspiration to create our work and support from the technical side. This study was partially funded by grant P2T1P2_152269 of the Swiss Science Foundation <the following are the organizations, grants and programs involved in the creation of the work, see original text - approx. Transfer.>
List of references
 O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. In WWW '09, pages 1–10, 2009.
 O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected recip-rocal rank for graded relevance. In CIKM '09, pages 621–630, 2009.
 D. Chen, W. Chen, H. Wang, Z. Chen, and Q. Yang. Beyond ten blue links: enabling user click modeling in federated web search. In WSDM '12, pages 463–472, 2012.
 A. Chuklin, P. Serdyukov, and M. de Rijke. Click model-based infor-mation retrieval metrics. In SIGIR '13, pages 493–502, 2013.
 GE Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In SIGIR '08, pages 331–338, 2008.
 F. Guo, C. Liu, and YM Wang. Efficient multiple-click models in web search. In WSDM '09, pages 124–131, 2009.
 K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Information Systems, 20 (4): 422–446, 2002.
 A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Information Systems, 27 (1): 2: 1– 2:27, 2008.
 Y. Song, X. Shi, RW White, and A. Hassan. Context-aware web search abandonment prediction. In SIGIR '14, 2014.
 C. Wang, Y. Liu, M. Zhang, S. Ma, M. Zheng, J. Qian, and K. Zhang. Incorporating vertical results into search click models. In SIGIR '13, pages 503-512, 2013.
 E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. Expected browsing utility for web search evaluation. In CIKM '10, pages 1561–1564, 2010.
 K. Zhou, T. Sakai, M. Lalmas, Z. Dou, and JM Jose. Evaluating heterogeneous information access. In Proc. MUBE workshop, 2013.