Improving the quality of text classification by connecting Wikipedia
We use a large structured source of multilingual texts - Wikipedia to improve the classification of texts. The approach is good with a high degree of automatism and independence from which particular classification problem is being solved. The greatest effect, however, is expected on the tasks of determining the topic.
The main idea is to extract from Wikipedia only those texts that help us solve our classification problem, ignoring others. If we classify texts about cats, it is unlikely that we will need texts on quantum physics, although texts on other types of animals can be useful. The automatic separation of such texts from each other is the essence of the described approach.
Wikipedia, as you know, is a collection of articles on many areas of knowledge and interests. At the same time, a significant part of the articles has links to articles of a similar subject, but in other languages. These are not translations, namely articles of a general subject. Also, most of the articles fall into one or more categories. Categories, in turn, are for the most part organized in the form of a hierarchical tree. That is, the task of grouping Wikipedia articles on topics of interest to us can be solved.
We use the resource DBPedia - a pre-wired and structured version of Wikipedia. DBPedia gives us all the necessary information - the names of the articles, their annotations, the categories of the articles and the superior categories for the categories. We start with the most widely represented language on Wikipedia - English. If your task does not have, or few, English-language texts, use the language for which there are many documents.
Focus on article categories. For now, ignore their content. Categories form a graph, mostly tree-like, but there are cycles too. Articles are the end points of the graph (leaves) connected to one or more nodes of the graph. We use the Node2Vec tool to obtain a vector representation of each category and each article. Articles of similar subjects are grouped together in the vector space.
We cluster by any convenient method of the article into a rather large (hundreds) number of clusters.
We replace the names of the articles in the resulting clusters with their annotations (Long Abstract and Short Abstract - approximately one paragraph of text per article). Now we have hundreds of clusters defined as sets of texts. We use a convenient model and build a classifier that solves the problem of multiclass classification: one cluster - one class. We used FastText.
At the output, we get a model that takes text as input, and at the output it gives a vector of estimates of the degree to which the text belongs to our hundreds of class clusters.
If the first step is to cluster Wikipedia articles not by their categories, but by their content, then, firstly, we will lose information by categories, but it is important, and secondly, we will get a degenerate system - which, by texts, is clustered and builds classifier model. The final quality will probably be worse than with a separate approach. Although I did not check.
We use a selection of our combat data and submit each document to the input of the model from step 2. The model returns a vector of estimates. We use this vector as a feature vector for the document in question. As a result, having processed all our training sample of combat documents, we get a table in the standard form for machine learning - a class label, a set of numerical signs. We call this table a training set.
We build on the training sample a classifier that can evaluate the information content of individual attributes. Decision trees and any random forest variations of them are well suited. The most informative signs are those clusters of Wikipedia articles that not only have similar themes with the themes of our combat documents, but, most importantly, the topics of these articles allow us to separate our fighting classes well. At the first iterations, the histogram of the informativeness of signs is usually quite flat - several informative clusters and a long tail are almost equal in terms of informativeness to the remaining hundreds of signs.
After studying the histogram of the information content of the characters, an inflection point was determined empirically each time, and approximately 10 to 30% of the clusters went to the next iteration. The essence of the iteration is that articles from the selected informative clusters were combined, submitted to steps 1–3, where they were clustered again, two classifiers were built again, and everything ended with an analysis of the information content histogram. It will take 3-4 iterations.
It turned out that on our data the digital signs, especially the numbers of the years, have a very strong weight and drag the informativeness of the entire cluster onto themselves. As a logical result, the clusters devoted to annual sports events became the most informative - a mass of numbers and dates, narrow vocabulary. I had to remove all numbers in the texts of article annotations (in the second step). It became noticeably better, clusters of articles really having a targeted subject began to stand out (as we imagined it). At the same time, unexpected clusters appeared that logically fell on our combat mission, had the right vocabulary, but it was very difficult to guess a priori the usefulness of such clusters.
After several iterations of steps 1-3, we have a reasonable number of articles selected from Wikipedia, whose topics help to share our combat documents. We are expanding the selection with similar articles in other languages of interest to us and building final clusters, this time tens. These clusters can be used in two ways - either build a classifier similar to step 2, and use it to expand the digital feature vector in your combat mission, or use these sets of texts as a source of additional vocabulary and integrate them into your combat classifier. We used the second way.
Our combat classifier is an ensemble of two models - truncated naive bayes and xgboost. Naive Bayes works on long grams, these are grams with lengths from 1 to 16 elements, and each found gram inclines the total to one of the classes, but Bayes does not make a final decision - it only gives the sum of the gram weights related to each from classes. Xgboost accepts the output of bayes, other classifiers, and some digital attributes that are constructed independently from the text, and xgboost already gives the final model and final evaluation. This approach makes it easy to connect any sets of texts to the gram bayes model, including the resulting sets of Wikipedia articles, and xgboost is already looking for patterns in the form of typical reactions of wikipedia clusters to battle texts.
The first result gave an increase from the conditional 60% accuracy to 62%. When replacing the annotations of Wikipedia articles in step 4 with the deflated articles themselves, the accuracy increased to 66%. The result is natural, because the size of the annotation is two or three phrases, and the size of the article is orders of magnitude larger. More linguistic material - higher effect.
We should expect that having completed the entire procedure on the texts of articles, rather than annotations, the quality increase will be even greater, but there is already a technical number problem - it is difficult to pump out and process the entire Wikipedia, or its noticeable part (if you start not from the first iteration). Also, if you initially use not only English, but all languages of interest, you can still win something else. In this case, the growth in processed volumes is multiple, and not by orders of magnitude, as in the first case.
For each document, a vector is constructed of the relationship of the document to the given topics based on Wikipedia categories. The vector is costed either by the method described in step 3 or by our gram bayes. Accordingly, combat documents can be clustered according to these vectors and get a grouping of combat documents by subject. It remains only to put down the hashtags and each new document can already fall into the database with tags. Which then users can search. This is the case if you affix tags in an explicit and visible way to the user. It looks fashionable, although I'm not a supporter.
A more interesting method of using semantic document vectors is adaptive search. Observing the user's activity, on which documents he lingers on and which ones he doesn’t even read, you can outline the user's area of interest in the long-term sense (after all, users also have a division of responsibilities and everyone is mainly looking for his own) and within the framework of the current search session.
Documents with similar topics have similar semantic vectors with a high cosine measure, and this allows you to evaluate documents in the search results on the fly according to the degree of their expected compliance with the interests of the user, as a result of which you can increase the necessary documents in the search results.
As a result, even with identical search queries for each user, search results can be personalized for him and depending on which of the documents in the previous step the user was interested in, the next search step will be adjusted to the user's needs, even if the search query itself has not changed.
We are now working on the problem of adaptive search.
Business periodically comes with bright ideas that are very difficult to implement. We must learn to find documents by their description, without having either a marked-up sample for training, or the ability to submit assessors some set of documents for marking. This usually happens when target documents are rarely found with respect to the general flow of documents, and as a result, by submitting a pool of 10 thousand documents to the assessors without preliminary filtering, you can get 1-2 necessary or even less output.
Our approach is to create an iterative learning process based on semantic vectors. At the first step, we find several texts that set our target topic - these may be Wikipedia articles, or texts from other sources. For each text, its semantic vector is produced. If the target topic is complex, the algebra of sets works - unification, intersection, exclusion of some topics from others. For example - there are Wikipedia articles about “Research and Development” and about “Cosmetics”, the intersection of sets will give “R&D about cosmetics”.
All documents in the database can be sorted by the degree of their compliance with the given topics, then the algebra of sets works on the documents themselves as follows - the document is considered relevant to the topic if its semantic vector is closer to the vector of Wikipedia articles of a given topic than the average for the database. Intersection - if at the same time the semantic vector of the document is closer to both topics than the average for the database. Other operations are similar.
We find a set of hundreds or two documents that have the closest proximity to all positive topics and, at the same time, the closest closeness to all negative topics (if we are not interested in financial issues in the research we are looking for, we will set the article from the “Finance” category as a negative example) ) We will give these documents to assessors, they will find several positive examples in them, based on these examples we will look for other documents with close semantic vectors, mark them up, and at the output we will get enough documents for the positive class to build any convenient classifier. It may take several iterations.
The described approach allows automatically, without manual analysis, to select from Wikipedia or another source sets of texts that help solve the classification problem. By simply connecting the clusters from Wikipedia to a working classifier, one can expect a significant increase in quality, without requiring adaptation of the classifier itself.
Well, adaptive search is interesting.
The main idea is to extract from Wikipedia only those texts that help us solve our classification problem, ignoring others. If we classify texts about cats, it is unlikely that we will need texts on quantum physics, although texts on other types of animals can be useful. The automatic separation of such texts from each other is the essence of the described approach.
Wikipedia, as you know, is a collection of articles on many areas of knowledge and interests. At the same time, a significant part of the articles has links to articles of a similar subject, but in other languages. These are not translations, namely articles of a general subject. Also, most of the articles fall into one or more categories. Categories, in turn, are for the most part organized in the form of a hierarchical tree. That is, the task of grouping Wikipedia articles on topics of interest to us can be solved.
We use the resource DBPedia - a pre-wired and structured version of Wikipedia. DBPedia gives us all the necessary information - the names of the articles, their annotations, the categories of the articles and the superior categories for the categories. We start with the most widely represented language on Wikipedia - English. If your task does not have, or few, English-language texts, use the language for which there are many documents.
Step 1. Clustering Wikipedia
Focus on article categories. For now, ignore their content. Categories form a graph, mostly tree-like, but there are cycles too. Articles are the end points of the graph (leaves) connected to one or more nodes of the graph. We use the Node2Vec tool to obtain a vector representation of each category and each article. Articles of similar subjects are grouped together in the vector space.
We cluster by any convenient method of the article into a rather large (hundreds) number of clusters.
Step 2. Classifier training on Wikipedia
We replace the names of the articles in the resulting clusters with their annotations (Long Abstract and Short Abstract - approximately one paragraph of text per article). Now we have hundreds of clusters defined as sets of texts. We use a convenient model and build a classifier that solves the problem of multiclass classification: one cluster - one class. We used FastText.
At the output, we get a model that takes text as input, and at the output it gives a vector of estimates of the degree to which the text belongs to our hundreds of class clusters.
If the first step is to cluster Wikipedia articles not by their categories, but by their content, then, firstly, we will lose information by categories, but it is important, and secondly, we will get a degenerate system - which, by texts, is clustered and builds classifier model. The final quality will probably be worse than with a separate approach. Although I did not check.
Step 3. Building a model on your own, combat, data
We use a selection of our combat data and submit each document to the input of the model from step 2. The model returns a vector of estimates. We use this vector as a feature vector for the document in question. As a result, having processed all our training sample of combat documents, we get a table in the standard form for machine learning - a class label, a set of numerical signs. We call this table a training set.
We build on the training sample a classifier that can evaluate the information content of individual attributes. Decision trees and any random forest variations of them are well suited. The most informative signs are those clusters of Wikipedia articles that not only have similar themes with the themes of our combat documents, but, most importantly, the topics of these articles allow us to separate our fighting classes well. At the first iterations, the histogram of the informativeness of signs is usually quite flat - several informative clusters and a long tail are almost equal in terms of informativeness to the remaining hundreds of signs.
After studying the histogram of the information content of the characters, an inflection point was determined empirically each time, and approximately 10 to 30% of the clusters went to the next iteration. The essence of the iteration is that articles from the selected informative clusters were combined, submitted to steps 1–3, where they were clustered again, two classifiers were built again, and everything ended with an analysis of the information content histogram. It will take 3-4 iterations.
It turned out that on our data the digital signs, especially the numbers of the years, have a very strong weight and drag the informativeness of the entire cluster onto themselves. As a logical result, the clusters devoted to annual sports events became the most informative - a mass of numbers and dates, narrow vocabulary. I had to remove all numbers in the texts of article annotations (in the second step). It became noticeably better, clusters of articles really having a targeted subject began to stand out (as we imagined it). At the same time, unexpected clusters appeared that logically fell on our combat mission, had the right vocabulary, but it was very difficult to guess a priori the usefulness of such clusters.
Step 4. Finalize the model
After several iterations of steps 1-3, we have a reasonable number of articles selected from Wikipedia, whose topics help to share our combat documents. We are expanding the selection with similar articles in other languages of interest to us and building final clusters, this time tens. These clusters can be used in two ways - either build a classifier similar to step 2, and use it to expand the digital feature vector in your combat mission, or use these sets of texts as a source of additional vocabulary and integrate them into your combat classifier. We used the second way.
Our combat classifier is an ensemble of two models - truncated naive bayes and xgboost. Naive Bayes works on long grams, these are grams with lengths from 1 to 16 elements, and each found gram inclines the total to one of the classes, but Bayes does not make a final decision - it only gives the sum of the gram weights related to each from classes. Xgboost accepts the output of bayes, other classifiers, and some digital attributes that are constructed independently from the text, and xgboost already gives the final model and final evaluation. This approach makes it easy to connect any sets of texts to the gram bayes model, including the resulting sets of Wikipedia articles, and xgboost is already looking for patterns in the form of typical reactions of wikipedia clusters to battle texts.
Results and Conclusions
The first result gave an increase from the conditional 60% accuracy to 62%. When replacing the annotations of Wikipedia articles in step 4 with the deflated articles themselves, the accuracy increased to 66%. The result is natural, because the size of the annotation is two or three phrases, and the size of the article is orders of magnitude larger. More linguistic material - higher effect.
We should expect that having completed the entire procedure on the texts of articles, rather than annotations, the quality increase will be even greater, but there is already a technical number problem - it is difficult to pump out and process the entire Wikipedia, or its noticeable part (if you start not from the first iteration). Also, if you initially use not only English, but all languages of interest, you can still win something else. In this case, the growth in processed volumes is multiple, and not by orders of magnitude, as in the first case.
Semantic document vector
For each document, a vector is constructed of the relationship of the document to the given topics based on Wikipedia categories. The vector is costed either by the method described in step 3 or by our gram bayes. Accordingly, combat documents can be clustered according to these vectors and get a grouping of combat documents by subject. It remains only to put down the hashtags and each new document can already fall into the database with tags. Which then users can search. This is the case if you affix tags in an explicit and visible way to the user. It looks fashionable, although I'm not a supporter.
Adaptive search
A more interesting method of using semantic document vectors is adaptive search. Observing the user's activity, on which documents he lingers on and which ones he doesn’t even read, you can outline the user's area of interest in the long-term sense (after all, users also have a division of responsibilities and everyone is mainly looking for his own) and within the framework of the current search session.
Documents with similar topics have similar semantic vectors with a high cosine measure, and this allows you to evaluate documents in the search results on the fly according to the degree of their expected compliance with the interests of the user, as a result of which you can increase the necessary documents in the search results.
As a result, even with identical search queries for each user, search results can be personalized for him and depending on which of the documents in the previous step the user was interested in, the next search step will be adjusted to the user's needs, even if the search query itself has not changed.
We are now working on the problem of adaptive search.
Business Hypothesis Testing
Business periodically comes with bright ideas that are very difficult to implement. We must learn to find documents by their description, without having either a marked-up sample for training, or the ability to submit assessors some set of documents for marking. This usually happens when target documents are rarely found with respect to the general flow of documents, and as a result, by submitting a pool of 10 thousand documents to the assessors without preliminary filtering, you can get 1-2 necessary or even less output.
Our approach is to create an iterative learning process based on semantic vectors. At the first step, we find several texts that set our target topic - these may be Wikipedia articles, or texts from other sources. For each text, its semantic vector is produced. If the target topic is complex, the algebra of sets works - unification, intersection, exclusion of some topics from others. For example - there are Wikipedia articles about “Research and Development” and about “Cosmetics”, the intersection of sets will give “R&D about cosmetics”.
All documents in the database can be sorted by the degree of their compliance with the given topics, then the algebra of sets works on the documents themselves as follows - the document is considered relevant to the topic if its semantic vector is closer to the vector of Wikipedia articles of a given topic than the average for the database. Intersection - if at the same time the semantic vector of the document is closer to both topics than the average for the database. Other operations are similar.
We find a set of hundreds or two documents that have the closest proximity to all positive topics and, at the same time, the closest closeness to all negative topics (if we are not interested in financial issues in the research we are looking for, we will set the article from the “Finance” category as a negative example) ) We will give these documents to assessors, they will find several positive examples in them, based on these examples we will look for other documents with close semantic vectors, mark them up, and at the output we will get enough documents for the positive class to build any convenient classifier. It may take several iterations.
Total
The described approach allows automatically, without manual analysis, to select from Wikipedia or another source sets of texts that help solve the classification problem. By simply connecting the clusters from Wikipedia to a working classifier, one can expect a significant increase in quality, without requiring adaptation of the classifier itself.
Well, adaptive search is interesting.