How to set the order of visits to new pages by a search robot based on a prediction of the popularity of a web page (Part I)
Crawling on your page on the web (web) search robot
This report was published in the Yandex Technologies hub in March this year. The research conducted by the Yandex group of companies sets a goal: to determine the indexing order of new pages. The first part of the paper discusses previous studies on this topic. The method proposed by the research group suggests taking into account the prediction of user behavior for this page, which again brings us back to the topic of the relationship of behavioral factors with ranking and indexing speed. The translation is published with the support of the working group of the SERPClick project, aimed at improving the ranking of your site by directly influencing the behavioral factors for your site.
In this document, we focus on search engine standards for new sites. Since it is impossible to index all new pages immediately after they appear, the most important (or popular) pages should be indexed first. The most natural indicator of the importance of a page is the number of visitors to it. However, the popularity of new sites cannot be determined immediately, and therefore it must be predicted based on the characteristics of a new page or site. In this document, we consider several methods for predicting the popularity of new pages using the previously studied search engine performance indicators, and also offer new settings to measure this effectiveness, more close to the real situation. In particular, we compare the short-term and long-term popularity of new pages, based on data about the decline in popularity. In the course of the experiments, we were able to establish that the data on the decline in popularity can be successfully used to adjust the priority of page checking by a search robot. Further research will need to focus on finer tuning of this mechanism.
Keywords: indexing order, new web pages, popularity prediction.
The route planning of the search robot is responsible for which address will be selected from the waiting list and visited by the search robot. Although the same strategy can have several goals, it is primarily aimed at the implementation of the following two tasks:
downloading discovered new web pages that are not yet reflected in the index, as well as
updating copies of pages in which important updates have appeared.
In our work, we focus on the first task: indexing new web pages. It is impossible to index all new pages immediately after they appear due to the fast growth in the number of pages on the network and limited resources, even among reputable search engines. Consequently, the most important pages should be indexed first.
There are several ways to measure the importance of a page, which allow you to set a certain order of page visits for a search engine and at the same time measure the success of indexing. Among the many indicators of the importance of the page, such as a
link graph, for example in PageRank as the most promising method, there is also a
user search activity recorded in the search engine logs.
The goal of any approach [to calculating page importance] is to determine the overall usefulness of indexed pages for a search engine. From this point of view, it is justifiable to use as the measure of page popularity - the number of user clicks (or visits) of a particular page, or its popularity. This is the so-called approach based on data on user behavior in the search proposed in . It has already been proven that the popularity of almost any page is short-lived: they are popular for some time after their creation and then the interest of users decreases over time. In this document, we focus only on such pages with short-term user interest in them, and predict the maximum for this indicator after the page is indexed.
The popularity of a new page cannot be known in advance, and therefore it must be predicted based on the parameters of the page that are known at the time of its discovery on the network. We analyzed the problem of predicting popularity for new pages, in particular, we took into account the dynamics of popularity, predicting both the popularity of the page itself and its decline for new URLs. The predetermined sequence of page indexing proposed earlier in  is based on predicting the popularity of the page as a whole, and therefore does not take into account the dynamics of this indicator over time. In fact, with this approach, if we take two new pages, one of which is popular today, and the other will be even more popular, but after a few days,
We believe that data on the dynamics of popularity can be effectively used to optimize the behavior of a search robot, but at the same time it is difficult to predict this dynamics.
We predict the total number of visits that will be fixed on a new page over time. Unlike , our prediction is based on a model that takes into account indicators from various sources, including the page address itself and the domain. We predict the dynamics of page popularity over time using the corresponding exponent, as was proposed in .
We give an assessment of the functionality of various ways of how you can set the order of indexing pages based on the prediction of page popularity. The algorithm that we propose in this paper takes into account the predicted decline in popularity of web pages and dynamically shuffles the indexing queue in accordance with the dynamics of popularity. It is worth mentioning that the method of indexing sequence tasks based on data on user behavior requires us to experimentally evaluate it in real conditions, where it is necessary to take into account the changing nature of the task itself: indexation delay, the appearance of new pages and previously popular pages that are larger do not receive visits. As far as we know, such experiments have not yet been conducted.
We came to the conclusion that the indexing priority strategy that takes into account the decline in page popularity is more effective than methods that rely solely on popularity as such. This conclusion confirms our assumption that it is more important to index those pages that are popular right now - so as not to lose this part of the traffic that can go through the search engine.
Summarizing all of the above, this study is useful due to the following two points:
- We solve the problem of forecasting the general popularity and the degree of decline in popularity for new web pages, and also offer an effective method for predicting the general popularity, which is more effective compared to the method of forecasting general popularity that is currently used.
- In real conditions, we test various indexing strategies based on user behavior data and find evidence that a strategy that takes into account the change in popularity is more effective than a strategy based on total popularity only and thus offer an effective forecasting method the decline in popularity of the new page.
Further presentation is organized in the following order:
In the next section, we review a previous study of indexing methods for new pages and forecasting page popularity. In Section 3, we describe the principles and method of the indexing algorithm that we propose in this paper. In section 4, we present the results of testing the new algorithm and compare it with the strategy currently used. Section 5 summarizes the work.
2. Previous studies
There are already a number of works devoted to forecasting popularity for various elements of the Internet: texts, news, social network users, tweets, Twitter hash tags, videos, etc. However, only a few works are devoted to the popularity of pages, which is calculated based on user visits. . One of them offers a model that predicts the corresponding popularity for a particular request, the number of clicks from a search on a given page, and also considers a pair of request-page. This model is based on data (from the logs) about the previously known dynamics of this request and clicks on the corresponding document. Therefore, this approach cannot be applied to solving the problem of predicting popularity for new pages for which the search engine does not yet have enough data from the logs, because
Another study focuses on recently discovered pages and predicting the traffic that will go through them. However, the forecast is based only on the page address. This is a really important aspect for planning the future sequence of indexing pages, because we need to predict the popularity of the page before we even start loading it.
Our work is a continuation of this research, as we predict the popularity of new pages in dynamics, and for this we use a combination of a forecast of the overall popularity of a page with a forecast of a decline in its popularity.
Also, our machine-learning-based algorithm greatly improves the current approach to predicting overall page popularity. Since the problem of determining popularity, solved by analyzing the page address, is relatively new, there are several studies devoted to forecasting various parameters of the page based on its address even before the content is downloaded, such as:
- webpage category
Some of these works suggest an approach that can be successfully used to build our popularity forecasting model.
The pioneering work  suggests evaluating the effectiveness of indexing based on the usefulness of indexed pages for search users, which relies on a specific ranking method and search query logs. The authors define the quality of the issuing page as an indicator of the average number of all user queries and compare the changes in this indicator for various methods of constructing an indexing strategy for a search robot. They offer an algorithm that allows you to effectively index pages repeatedly in order to timely update their local copies. The benefits of re-indexing a particular page are evaluated based on the logs, which reflect the benefits [for the search engine] of its previous indexes. In connection with this limitation of the work, it does not consider the procedure for indexing new pages.
Our work, on the contrary, focuses on predicting the usefulness of a new page, which should be based on the parameters of its URL, which we can determine without loading the page. The question of in what order to send new URLs for indexing was considered in . In our work, as in , the measurement of the effectiveness of the entire algorithm is based on the following factor: the usefulness of indexed pages within the existing ranking method and taking into account search query logs. As an appendix of this to new pages, their expected utility should be calculated on the basis of only the page address, incoming links, domain indicators and corresponding anchors.
The method of evaluating the indexing strategy proposed in  and  can be interpreted as the expected number of clicks that will fall on the indexed page with the existing ranking method and based on the logs of search queries that we fix for a certain time period. Indeed, if a certain amount of data about requests Q consists of requests and their frequency, the authors determine the overall usefulness of page p as:
where f (q) is the frequency of the query q, and I (p, q) can be defined as the probability that the document p will receive clicks on the SERP page generated by the current ranking method in response to the request q received from the user. It is believed that we obtain a certain amount of data about Q queries from user queries logs that arose in real conditions for a certain period of time close to the present moment. Thus, the usefulness of page p is the expected frequency of user clicks to this page from the output. Unlike  and , we not only measure the current popularity of pages, but also the general usefulness of these pages for building up search engine performance indicators, for example, the number of future visits. Therefore, our quality measurement method is calculated based on overall efficiency, in which the search engine will “win” if it indexes this or that page, and not just at this efficiency at the moment. In particular, our approach takes into account the fact that a particular page is becoming less popular with its own rate of loss of this popularity.
In , strategies were proposed for indexing a page in which user interest has recently appeared. Also, the problem of power distribution of a search robot for indexing new and reindexing old pages (in order to detect new links) is considered. Nevertheless, in , the popularity of new pages was predicted only on the basis of data on the domains that link to it (more precisely, the page on which the link was found). Our work offers a forecasting model that allows you to decide who to index first, even if the links were found on the same page or on similar clusters of pages.
From translators: further in the text, an algorithm for solving the problem with all the relevant mathematical calculations is considered. Did you already have the above part of the article for review or would you like to know all the details of the study in detail? Your opinion is important to us!
Only registered users can participate in the survey. Please come in.
Would you like to read the second part of our blog post?
- 70% Yes, please, I am interested in the details. 7
- 20% No thanks, it’s enough for me to find out in general terms. 2
- 10% Neither yes nor no, I just wanted to see the voting results. 1