How to set the order of visits to new pages by a search robot based on a prediction of the popularity of a web page (Part II)

Original author: Ludmila Ostroumova, Ivan Bogaty, Arseny Chelnokov, Alexey Tikhonov, Gleb Gusev
  • Transfer
What is the likelihood of a new webpage becoming popular?

We return to the vast topic of behavioral factors and continue to publish the translation of the Yandex report on the construction of a mathematical model for choosing the optimal indexing policy for new pages. The algorithm based on machine learning and probability theory allows us to predict the popularity of a new page in the format of visits (user clicks) recorded by the browser toolbar and operates with the additional concept of a decline in this popularity, which is also predicted for this algorithm. The translation is published with the support of the analytical department of ALTWeb Group and the team of the special project SERPClick, which allows you to increase the ranking of web pages by directly influencing behavioral factors. The second part of the article is abridged.

3. Algorithm

3.1 System
First, let's describe the general system of the proposed algorithm. According to previous studies in this area [7, 16, 20], we set a single source for calculating the cost of indexing across all pages. In other words, we take into account the fixed time that the robot will need to load one page. So everyseconds, the robot selects the page to load from its page queue for indexing. The indexing page queue is constantly updated as new page links are discovered. The indexing page queue can be updated for a number of reasons. New URLs can be detected by the robot or can come from the browser toolbar logs, for example. In this paper, we do not consider the problem of detecting new addresses, and in the course of our research, we took new addresses coming from the browser toolbar.

3.2 Metric
The following metric was proposed [12] for measuring the efficiency of an algorithm. The usefulness of the page that we get from indexing page i over timeafter its appearance, it is expressed in the total number of clicks that this page will receive from the issuing page after it (page i) is indexed. It is important to note that we only consider pages that are of interest in the short term. In this case, the quality of the algorithm at time t (arithmetic mean of the intervals T) will be expressed as follows:

It is also important to note that the number of clicks on the SERP page depends on a number of reasons, including the features of user behavior and the ranking system, and not only on the indexing algorithm. Therefore, it is not always obvious how to interpret the change in the number of clicks when changing the indexing policy. So, for example, does the number of clicks only change due to a change in the page indexing policy, or does it also depend on the ranking method, and then this number can be different when different equally good indexing policies are applied? In fact, we believe that in this case it is more reasonable to refer to user logs, which we can get from the toolbar data and select the number of visits as the main guideline that will replace us with the click rate - this will be enough to measure indexing efficiency. Thus,which we will receive if we index page i with a delay will in the future indicate the number of visits to this page after it has been indexed.
As will be shown in Section 4.1, we evaluate the success of the algorithm on the basis of data collected in one month. Thus, let T = 1 month, and will mean the average benefit (or utility) of applying this indexation policy per second for this month.

3.3 Indexing Strategy

As we agreed in section 3.1, every second our algorithm should select a page from the page queue for indexing. Our task is to select the appropriate URL each time in order to index the page, provided that the indexing queue changes dynamically.
To do this, we take into account the dynamics of changes in popularity as follows. It was shown in [12] that the utility , which is the result of indexing the page u with some delay , fades exponentially with time.
Thus, this utility can be approximated by using with some parameters and .

We can see that with , we get that will express the overall popularity of the page u, that is, in our case, the total number of visits to the page by users since its creation, and will indicate a decline in its popularity.

For each page u, we predict the parameters andas described in section 3.4. Finally, the following strategy will be chosen. Every second, we select a page with the maximum expected utility. That is, for each page u we every second calculate its rating, which is then used to rank all new pages in the indexing queue: where is the time that has passed since page u was found. After that, we index the page with the highest ranking in the queue.
The time when the page was discovered for simplicity is approximated by the time when it was created. Therefore, we assume that the first visit to this page recorded in the user toolbar logs is close to the time the page was created.

3.4 Expected page popularity

In this section, we discuss the parameters of a method of forecasting and for a specific URL u. We will look at a gradient boosting decision tree model with a list of features described later in this section. Recall that this is the total number of visits to page u recorded through the toolbar. Since the distribution of all for all pages u in our data block is large numbers, we are dealing with large values . In order to draw up an indexing policy, we need to know the order of magnitude, while determining the exact number is not so important. As it happens with the probability distribution expressed by large values ​​[19], we predict the value by reducing the standard deviation in the training sample.

We also determine the decline in popularity . Let it mean the number of visits over time t (measured in hours in our experiments) after the URL u was discovered. We train our decision tree model which predicts the ratio to the number of all visits to the page u, that is, the value (in this case, we reduce the standard deviation). Then we can determine the level of decline as follows. Based on the equation it follows that . We calculate the logarithm and get and then we can determine through
Thus, the predicted benefit of indexing the page u with a delay after its appearance will have the following meaning:


1. Abiteboul, S., Preda, M., Cobena, C .: Adaptive on-line page importance
computation. In: Proc. WWW Conference (2003)
2. Abramson, M., Aha, D .: What's in a URL? Genre classification from URLs. In:
Conference on Artificial Intelligence, pp. 262-263 (2012)
3. Bai, X., Cambazoglu, BB, Junqueira, FP: Discovering urls through user feed-
back. In: Proc. CIKM Conference, pp. 77-86 (2011)
4. Baykan, E., Henzinger, M., Marian, L., Weber, I .: A comprehensive study of
findings and algorithms for url-based topic classification. ACM Trans. Web (2011)
5. Baykan, E., Henzinger, M., Weber, I .: Eficient discovery of authoritative resources.
ACM Trans. Web (2013)
6. Cho, J., Schonfeld, U .: Rankmass crawler: a crawler with high personalized pager-
ank coverage guarantee. In: Proc. VLDB (2007)
7. Edwards, J., McCurley, KS, Tomlin, J. A.: Adaptive model for optimizing
performance of an incremental web crawler. In: Proc. WWW Conference (2001)
8. Fetterly, D., Craswell, N., Vinay, V .: The impact of crawl policy on web search
effectiveness. In: Proc. SICIR Conference, pp. 580-587 (2009)
9. Hastie, T., Tibshirani, R., Friedman, J .H .: The elements of statistical learning:
data mining, inference, and prediction: with 200 full-color illustrations. Springer,
New York (2001)
10. Kan, MY: Web page classification without the web page. In: Proc. WWW Con-
ference, pp. 262-263 (2004)
11. Kumar, R., Lang, K., Marlow, C., Tomkins, A .: Eficient discovery of authoritative
resources. Data Engineering (2008)
12. Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P .: Timely crawling of high-
quality ephemeral new content. In: Proc. CIKM Conference, pp. 745-750 (2011)
13. Lei, T., Cai, R., Yang, JM, Ke, Y., Fan, X., Zhang, L .: A pattern tree-
based approach to learning url normalization rules. In: Proc. WWW Conference,
pp. 611-620 (2010)
14. Liu, M., Cai, R., Zhang, M., Zhang, L .: User browsing behavior-driven web crawl-
ing. In: Proc. CIKM Conference, pp. 87-92 (2011)
15. Olston, C., Najork, M .: Web crawling. Foundations and Trends in Information
Retrieval 4 (3), 175-246 (2010)
16. Pandey, S., Olston, C .: User-centric web crawling. In: Proc. WWW Conference
17. Pandiy, S., Olston, C .: Crawl ordering by search impact. In: Proc. WSDM Conference
18. Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E .: Modeling
and predicting behavioral dynamics on the web. In: Proc. WWW Conference,
pp. 599-608 (2012)
19. Tsur, O., Rappoport, A .: What's in a hashtag ?: content based prediction of
the spread of ideas in microblogging communities. In: Proc. WSDM Conference,
pp. 643-652 (2012)
20. Wolf, JL, Squillante, MS, Yu, PS, Sethuraman, J., Ozsen, L .: Optimal crawling
strategies for web search engines. In: Proc. WWW Conference (2002)

You can find this and other translations in the ALTWEb Group's blog on Habré. The analytical department of the company conducts research and reviews of current search problems and uses the acquired knowledge to develop its own products, such as, for example, a product that does not have domestic (almost) or foreign (completely) analogues to increase the ranking of a site based on behavioral factors: SERPClick . Thank you for the translation provided. Subscribe to our blog if you want to always be in the know!

Only registered users can participate in the survey. Please come in.

Would you like to read further translations of reports on search engine algorithms?

  • 78.5% Yes, please, I'm interested. eleven
  • 21.4% No, thank you: there are too many “Theorwer”; brief conclusions and reviews are enough for me. 3
  • 0% Neither yes nor no, I have a different opinion. 0

Also popular now: