
How to set the order of visits to new pages by a search robot based on a prediction of the popularity of a web page (Part II)
- Transfer

What is the likelihood of a new webpage becoming popular?
We return to the vast topic of behavioral factors and continue to publish the translation of the Yandex report on the construction of a mathematical model for choosing the optimal indexing policy for new pages. The algorithm based on machine learning and probability theory allows us to predict the popularity of a new page in the format of visits (user clicks) recorded by the browser toolbar and operates with the additional concept of a decline in this popularity, which is also predicted for this algorithm. The translation is published with the support of the analytical department of ALTWeb Group and the team of the special project SERPClick, which allows you to increase the ranking of web pages by directly influencing behavioral factors. The second part of the article is abridged.
3. Algorithm
3.1 System
First, let's describe the general system of the proposed algorithm. According to previous studies in this area [7, 16, 20], we set a single source for calculating the cost of indexing across all pages. In other words, we take into account the fixed time


3.2 Metric
The following metric was proposed [12] for measuring the efficiency of an algorithm. The usefulness of the page



It is also important to note that the number of clicks on the SERP page depends on a number of reasons, including the features of user behavior and the ranking system, and not only on the indexing algorithm. Therefore, it is not always obvious how to interpret the change in the number of clicks when changing the indexing policy. So, for example, does the number of clicks only change due to a change in the page indexing policy, or does it also depend on the ranking method, and then this number can be different when different equally good indexing policies are applied? In fact, we believe that in this case it is more reasonable to refer to user logs, which we can get from the toolbar data and select the number of visits as the main guideline that will replace us with the click rate - this will be enough to measure indexing efficiency. Thus,


As will be shown in Section 4.1, we evaluate the success of the algorithm on the basis of data collected in one month. Thus, let T = 1 month, and

3.3 Indexing Strategy
As we agreed in section 3.1, every

To do this, we take into account the dynamics of changes in popularity as follows. It was shown in [12] that the utility


Thus, this utility can be approximated by using



We can see that with



For each page u, we predict the parameters






The time when the page was discovered for simplicity is approximated by the time when it was created. Therefore, we assume that the first visit to this page recorded in the user toolbar logs is close to the time the page was created.
3.4 Expected page popularity
In this section, we discuss the parameters of a method of forecasting






We also determine the decline in popularity












Thus, the predicted benefit of indexing the page u with a delay


References
1. Abiteboul, S., Preda, M., Cobena, C .: Adaptive on-line page importance
computation. In: Proc. WWW Conference (2003)
2. Abramson, M., Aha, D .: What's in a URL? Genre classification from URLs. In:
Conference on Artificial Intelligence, pp. 262-263 (2012)
3. Bai, X., Cambazoglu, BB, Junqueira, FP: Discovering urls through user feed-
back. In: Proc. CIKM Conference, pp. 77-86 (2011)
4. Baykan, E., Henzinger, M., Marian, L., Weber, I .: A comprehensive study of
findings and algorithms for url-based topic classification. ACM Trans. Web (2011)
5. Baykan, E., Henzinger, M., Weber, I .: Eficient discovery of authoritative resources.
ACM Trans. Web (2013)
6. Cho, J., Schonfeld, U .: Rankmass crawler: a crawler with high personalized pager-
ank coverage guarantee. In: Proc. VLDB (2007)
7. Edwards, J., McCurley, KS, Tomlin, J. A.: Adaptive model for optimizing
performance of an incremental web crawler. In: Proc. WWW Conference (2001)
8. Fetterly, D., Craswell, N., Vinay, V .: The impact of crawl policy on web search
effectiveness. In: Proc. SICIR Conference, pp. 580-587 (2009)
9. Hastie, T., Tibshirani, R., Friedman, J .H .: The elements of statistical learning:
data mining, inference, and prediction: with 200 full-color illustrations. Springer,
New York (2001)
10. Kan, MY: Web page classification without the web page. In: Proc. WWW Con-
ference, pp. 262-263 (2004)
11. Kumar, R., Lang, K., Marlow, C., Tomkins, A .: Eficient discovery of authoritative
resources. Data Engineering (2008)
12. Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P .: Timely crawling of high-
quality ephemeral new content. In: Proc. CIKM Conference, pp. 745-750 (2011)
13. Lei, T., Cai, R., Yang, JM, Ke, Y., Fan, X., Zhang, L .: A pattern tree-
based approach to learning url normalization rules. In: Proc. WWW Conference,
pp. 611-620 (2010)
14. Liu, M., Cai, R., Zhang, M., Zhang, L .: User browsing behavior-driven web crawl-
ing. In: Proc. CIKM Conference, pp. 87-92 (2011)
15. Olston, C., Najork, M .: Web crawling. Foundations and Trends in Information
Retrieval 4 (3), 175-246 (2010)
16. Pandey, S., Olston, C .: User-centric web crawling. In: Proc. WWW Conference
2005
17. Pandiy, S., Olston, C .: Crawl ordering by search impact. In: Proc. WSDM Conference
(2008)
18. Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E .: Modeling
and predicting behavioral dynamics on the web. In: Proc. WWW Conference,
pp. 599-608 (2012)
19. Tsur, O., Rappoport, A .: What's in a hashtag ?: content based prediction of
the spread of ideas in microblogging communities. In: Proc. WSDM Conference,
pp. 643-652 (2012)
20. Wolf, JL, Squillante, MS, Yu, PS, Sethuraman, J., Ozsen, L .: Optimal crawling
strategies for web search engines. In: Proc. WWW Conference (2002)
You can find this and other translations in the ALTWEb Group's blog on Habré. The analytical department of the company conducts research and reviews of current search problems and uses the acquired knowledge to develop its own products, such as, for example, a product that does not have domestic (almost) or foreign (completely) analogues to increase the ranking of a site based on behavioral factors: SERPClick . Thank you for the translation provided. Subscribe to our blog if you want to always be in the know!
Only registered users can participate in the survey. Please come in.
Would you like to read further translations of reports on search engine algorithms?
- 78.5% Yes, please, I'm interested. eleven
- 21.4% No, thank you: there are too many “Theorwer”; brief conclusions and reviews are enough for me. 3
- 0% Neither yes nor no, I have a different opinion. 0