About information retrieval, finding optimal ways to view search results and much more
The task of finding the best ways to view search results is my main topic as a candidate for work. Today I want to share the intermediate research results, as well as the applications and SDKs that were used in the work.
The decision to write this article was made after watching a seminar from the cycle “Information Search and Data Analysis” on the topic “Semantic analysis of texts using Wikipedia”, the speaker of which was Maxim Grinev - Associate Professor, Senior Lecturer, Department of System Programming, Head of the ISP RAS Department .
You can see the report , download the report or see the schedule of other reports .
The contents of the workshop and the main results obtained will be briefly described below.
The report examined new approaches and methods of semantic search, principles for assessing semantic proximity based on data from the English Wikipedia.
The main principles used in the report: the term (which is described by the relevant article) can have several meanings, so it is necessary to highlight the most relevant article. The term (article) contains links to other terms (articles) both within the body of the text and in blocks see also, links, etc.
The semantic distance between articles A and B can be calculated by counting the number of general articles referred to in articles A and B:
Figure 1
Semantic affinity is a key point in the methods described below. I want to mention that the SimRank method [1] described last year was not considered successful.
From the author: in addition to semantic proximity, to determine the distance between two web documents, you can use the shingle method or the Pearson's xi-square test . Also on this issue there is my published article “Method for assessing the similarity of web pages” [2], which describes a generalized method for assessing the similarity of web pages based on some semantic features.
Further, it was the extraction of keywords for a given term and the construction of the so-called community (community) or semantic graphs [3]:
Figure 2
The essence of these columns is that the terms (articles) that are included in a single community are included in a certain general category. Those. The classical text classification problem is being solved. This category can be calculated by defining a common “parent” category, which includes the selected terms. To determine the communities, the clustering method is used (which does not require setting the number of clusters and cluster size), communities with a small rank are discarded.
An example of a real semantic graph:
Figure 3
In the course of research it turned out that “good” communities have a much higher rank than the rest - less relevant.
Figure 4
This approach allows you to filter out non-core content (top, bottom, see also), since the communities obtained from the terms of these blocks will have a small rank, and, accordingly, will be eliminated during the calculations.
After watching the report, I had a sense of déjà vu, since I do many things in my work and some of the results very much overlap with it.
First of all, I want to focus on some of the weaknesses of the described techniques and methods:
Below I will describe my own developments in the context of the above methods.
Advanced Clustering Method
The method is a refinement of the classical k-means algorithm, which is simple in terms of implementation, but not accurate. There are two reasons why this algorithm is not accurate: the algorithm is sensitive to the choice of starting points and the number of clusters. Thus, if inaccurate data is input, then the quality of the result will be better.
In his work “The clustering method based on clusters distributed according to the normal law” [4], it was proposed to check the law of distribution of objects within clusters. If the cluster is distributed according to a certain law, then we leave it; if not, we divide it into two subsidiaries and the verification process continues until all clusters distributed according to a certain law are found or when we exceed the limit on the number of clusters. Thus, we solve the problem of the number of clusters. The problem of choosing the starting points is solved by setting the maximally separated points within the larger cluster as the initial centers. On test data, the method showed 95% accuracy.
The importance of site information blocks
When we talk about a modern web page, we mean not only the main content for which we actually came, but also a lot of additional information on the sides, at the bottom, etc. These information blocks can have different purposes - lists of links, statistics, related articles, advertising. It is clear that the importance of this content is much less than the main one.
He developed a method for clearing web pages of information noise [5], which I wrote about earlier in general[6]. I will dwell on an important point - the procedure for determining “important” blocks is based on a fuzzy clustering method, and in simple words, blocks with maximum numerical ratings (which are very different from others) are searched for. In fact, the same approach was used to identify “good” communities, which Maxim Grinev talked about (see Figure 4).
The prototype is available for download on the codeplex website :
Let's take a closer look at Figure 3, a graph of the relationship of terms. In fact, this is nothing more than a graph of the relationships of web pages that are responsible for a particular term.
We developed a similar system to assess the relationship between sites, but in our case we use search engines as a data source (for us there is no fundamental difference with which search engine to take the results).
A real example of the system’s operation at the request of Microsoft can be seen below:
Figure 5 (in this case, a shallow analysis of links is selected)
If you look closely, you can also see some “communities” that indicate belonging to different categories. Accordingly, by highlighting keywords (or other properties of each web page), you can cluster the search results in runtime. Note that this works for the entire web, not just Wikipedia.
Now I have come to the most interesting part, since all of the above in itself does not matter much.
After we have a graph of relationships, information about each web page (Google PageRank, distance between web pages, page "weight"), we can find the best way to view the search results on the graph. Those. in other words, we get not a linear list of sites ranked according to a certain algorithm (weight), but a set of recommendations on the order in which it is necessary to view the results of a web search [7] .
To achieve the goal, we use a modified ant algorithm that simulates user behavior, namely random transitions when surfing. Each path is evaluated using a certain formula and in the end we get the optimal path (the amount of information, the amount of duplicated information and several other parameters are taken into account).
In addition, the user can select:
Thus, the considered methods and algorithms allow you to gain knowledge not only on Wikipedia, but also for the entire web. The ideas and methods presented at the seminar and received by us, on the whole, coincide, in some ways even surpass them. Among the shortcomings, I can name the inability to test methods on large amounts of data, which Yandex has for its research and the need to work on formalities, and not on the essence of the problems.
I hope this article will help assess the state of affairs in the field of information retrieval.
PS I can not help but mention the developed Data Extracting SDK , with the help of which almost all the applications that were used for research were written.
PSS If something is not clear or there is a desire to get acquainted with some methods (ideas) in more detail - write in the comments, I will try to answer them.
[1] Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
[2] Web site similarity assessment method
[3] Maria Grineva, Maxim Grinev, Dmitry Lizorkin.Extracting Key Terms From Noisy and Multitheme Documents. WWW2009: 18th International World Wide Web Conference
[4] Krakovetsky O.Yu. The method of clustering on the basis of clusters divided by the normal law // International Science and Technology Journal "Information Technology and Computer Engineering" No. 1 (11). - 2008 p. - p. 56-60.
[5]Method for clearing web pages of information noise
[6] V.M. Dubovoi, O.Yu. Krakovetsky, O.V. Thunder. Factor analysis of evaluations of important information blocks of sites // News of the Vinnitsa Polytechnic Institute. - 2008. - No. 6. - C. 103-107
[7] Volodimir Dubovoi, Oleksandar Krakovecki, Olga Glon Method of encouraging optimal grooms to look at the results of a web-based joke based on heuristic algorithms
The decision to write this article was made after watching a seminar from the cycle “Information Search and Data Analysis” on the topic “Semantic analysis of texts using Wikipedia”, the speaker of which was Maxim Grinev - Associate Professor, Senior Lecturer, Department of System Programming, Head of the ISP RAS Department .
You can see the report , download the report or see the schedule of other reports .
Brief scientific findings from the workshop
The contents of the workshop and the main results obtained will be briefly described below.
The report examined new approaches and methods of semantic search, principles for assessing semantic proximity based on data from the English Wikipedia.
The main principles used in the report: the term (which is described by the relevant article) can have several meanings, so it is necessary to highlight the most relevant article. The term (article) contains links to other terms (articles) both within the body of the text and in blocks see also, links, etc.
The semantic distance between articles A and B can be calculated by counting the number of general articles referred to in articles A and B:
Figure 1
Semantic affinity is a key point in the methods described below. I want to mention that the SimRank method [1] described last year was not considered successful.
From the author: in addition to semantic proximity, to determine the distance between two web documents, you can use the shingle method or the Pearson's xi-square test . Also on this issue there is my published article “Method for assessing the similarity of web pages” [2], which describes a generalized method for assessing the similarity of web pages based on some semantic features.
Further, it was the extraction of keywords for a given term and the construction of the so-called community (community) or semantic graphs [3]:
Figure 2
The essence of these columns is that the terms (articles) that are included in a single community are included in a certain general category. Those. The classical text classification problem is being solved. This category can be calculated by defining a common “parent” category, which includes the selected terms. To determine the communities, the clustering method is used (which does not require setting the number of clusters and cluster size), communities with a small rank are discarded.
An example of a real semantic graph:
Figure 3
In the course of research it turned out that “good” communities have a much higher rank than the rest - less relevant.
Figure 4
This approach allows you to filter out non-core content (top, bottom, see also), since the communities obtained from the terms of these blocks will have a small rank, and, accordingly, will be eliminated during the calculations.
Remarks and comments
After watching the report, I had a sense of déjà vu, since I do many things in my work and some of the results very much overlap with it.
First of all, I want to focus on some of the weaknesses of the described techniques and methods:
- all methods are tied to Wikipedia. This means that a priori we believe that the circle of knowledge is limited only to Wikipedia and that this knowledge is absolute;
- Wikipedia is a strictly structured resource, i.e. the erroneous classification of knowledge within it is practically absent. This fact greatly simplifies the preliminary processing of texts in classical problems of classification and clustering (in fact, this step is skipped, since all the work was done before the analysis began, and it is known that the quality of the preliminary processing is key when analyzing large data sets);
- Wikipedia has a single structure (layout), which makes it easy to clear an article of unimportant information with very high accuracy;
Below I will describe my own developments in the context of the above methods.
Advanced Clustering Method
The method is a refinement of the classical k-means algorithm, which is simple in terms of implementation, but not accurate. There are two reasons why this algorithm is not accurate: the algorithm is sensitive to the choice of starting points and the number of clusters. Thus, if inaccurate data is input, then the quality of the result will be better.
In his work “The clustering method based on clusters distributed according to the normal law” [4], it was proposed to check the law of distribution of objects within clusters. If the cluster is distributed according to a certain law, then we leave it; if not, we divide it into two subsidiaries and the verification process continues until all clusters distributed according to a certain law are found or when we exceed the limit on the number of clusters. Thus, we solve the problem of the number of clusters. The problem of choosing the starting points is solved by setting the maximally separated points within the larger cluster as the initial centers. On test data, the method showed 95% accuracy.
The importance of site information blocks
When we talk about a modern web page, we mean not only the main content for which we actually came, but also a lot of additional information on the sides, at the bottom, etc. These information blocks can have different purposes - lists of links, statistics, related articles, advertising. It is clear that the importance of this content is much less than the main one.
He developed a method for clearing web pages of information noise [5], which I wrote about earlier in general[6]. I will dwell on an important point - the procedure for determining “important” blocks is based on a fuzzy clustering method, and in simple words, blocks with maximum numerical ratings (which are very different from others) are searched for. In fact, the same approach was used to identify “good” communities, which Maxim Grinev talked about (see Figure 4).
The prototype is available for download on the codeplex website :
Let's take a closer look at Figure 3, a graph of the relationship of terms. In fact, this is nothing more than a graph of the relationships of web pages that are responsible for a particular term.
We developed a similar system to assess the relationship between sites, but in our case we use search engines as a data source (for us there is no fundamental difference with which search engine to take the results).
A real example of the system’s operation at the request of Microsoft can be seen below:
Figure 5 (in this case, a shallow analysis of links is selected)
If you look closely, you can also see some “communities” that indicate belonging to different categories. Accordingly, by highlighting keywords (or other properties of each web page), you can cluster the search results in runtime. Note that this works for the entire web, not just Wikipedia.
Finding the Best Way to View Search Results
Now I have come to the most interesting part, since all of the above in itself does not matter much.
After we have a graph of relationships, information about each web page (Google PageRank, distance between web pages, page "weight"), we can find the best way to view the search results on the graph. Those. in other words, we get not a linear list of sites ranked according to a certain algorithm (weight), but a set of recommendations on the order in which it is necessary to view the results of a web search [7] .
To achieve the goal, we use a modified ant algorithm that simulates user behavior, namely random transitions when surfing. Each path is evaluated using a certain formula and in the end we get the optimal path (the amount of information, the amount of duplicated information and several other parameters are taken into account).
In addition, the user can select:
- depth of analysis
- maximum (optimal) number of sites
- choose what is important for him - speed of execution (ie queries like “what is Wikipedia”), or an in-depth analysis of a certain issue (for example, “African economies”)
- Of course, he can save all this in his personal settings, thus reconfiguring the algorithm for himself
conclusions
Thus, the considered methods and algorithms allow you to gain knowledge not only on Wikipedia, but also for the entire web. The ideas and methods presented at the seminar and received by us, on the whole, coincide, in some ways even surpass them. Among the shortcomings, I can name the inability to test methods on large amounts of data, which Yandex has for its research and the need to work on formalities, and not on the essence of the problems.
I hope this article will help assess the state of affairs in the field of information retrieval.
PS I can not help but mention the developed Data Extracting SDK , with the help of which almost all the applications that were used for research were written.
PSS If something is not clear or there is a desire to get acquainted with some methods (ideas) in more detail - write in the comments, I will try to answer them.
[1] Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
[2] Web site similarity assessment method
[3] Maria Grineva, Maxim Grinev, Dmitry Lizorkin.Extracting Key Terms From Noisy and Multitheme Documents. WWW2009: 18th International World Wide Web Conference
[4] Krakovetsky O.Yu. The method of clustering on the basis of clusters divided by the normal law // International Science and Technology Journal "Information Technology and Computer Engineering" No. 1 (11). - 2008 p. - p. 56-60.
[5]Method for clearing web pages of information noise
[6] V.M. Dubovoi, O.Yu. Krakovetsky, O.V. Thunder. Factor analysis of evaluations of important information blocks of sites // News of the Vinnitsa Polytechnic Institute. - 2008. - No. 6. - C. 103-107
[7] Volodimir Dubovoi, Oleksandar Krakovecki, Olga Glon Method of encouraging optimal grooms to look at the results of a web-based joke based on heuristic algorithms