Point-to-point mutual information monitoring system
Hello.
If you are engaged in DataMining, analysis of texts to identify opinions or you are just interested in statistical models for assessing the emotional coloring of sentences - this article may be interesting.
Further, in order not to waste the time of the potential reader in vain on a pile of theory and reasoning, immediately brief results.
The implemented approach works with approximately 55% accuracy in three classes: negative, neutral, positive. According to Wikipedia , 70% accuracy is approximately equal to the accuracy of human judgments on average (due to the subjectivity of each interpretation).
It should be noted that there are many utilities with an accuracy higher than that obtained by me, but the described approach can be quite simply improved (it will be described below) and get as a result 65-70%. If after all of the above you still have the desire to read, welcome to cat.
To determine the emotional coloring of a sentence (SO - sentiment orientation), you need to understand what it is all about. This is logical. But how to explain to the car what is good and what is bad?
The first option that immediately appears on the mind is the sum of the number of bad / good words multiplied by the weight of each. The so-called “bag of words” approach . Surprisingly simple and fast algorithm, combined with rule- based preprocessing giving good results (up to 60 - 80% accuracy depending on the case ). In essence, this approach is an example of a unigram model, which means that in the most naive case, the sentences “This product rather good than bad” and “This product rather bad than good” will have the same SO. This problem can be solved by moving from a unigram to a multinomial model. It should also be noted that it is necessary to have a solid constantly updated dictionary containing bad and good terms + their weight, which may be specific depending on the data.
An example of the simplest multinomial model is the naive Bayes method . There are several articles on the habr dedicated to it, in particular this one .
The advantage of the multinomial model over the unigram model is that we can take into account the context in which a statement was uttered. This solves the problem with the sentences described above, but introduces a new limitation: if the selected n-gram is not in the training set , then the SO on the test data will be 0. This problem has always been and will be. It can be solved in 2 ways: by increasing the volume of the training sample (not forgetting that you can grab the effect of retraining along the way), or by using smoothing (for example, Laplace or Good-Turing).
Finally, we smoothly approached the idea of PMI.
Along with the Bayes formula
, we introduce the concept of

PMI - pointwise mutual information, pointwise mutual information.
in the above formula, A and B are words / bigrams / n-ngrams, P (A), P (B) are the a priori probabilities of the appearance of the term A and B, respectively, in the training set (the ratio of the number of occurrences to the total number of words in the corpus), P (A near B) - the probability of the term A meeting together / next to the term B; “Nearby” can be configured manually, by default, the distance is 10 terms left and right; the base of the logarithm does not play a role, for simplicity we take it equal to 2. The
positive sign of the logarithm will mean positive color A in comparison with B, negative - negative.
To find neutral reviews, you can take some kind of sliding window (in this paper, the segment [-0.154, 0.154] is responsible for this). The window can be either constant or floating depending on the data (shown below).
From the above, one can come to the following statements:

Indeed, to determine which class the statements “good weather”, “fast ride” belong to, it is enough to check in the training sample how often “good weather” and “fast ride” occur next to the known (established person depending on the data model and test sample) good and bad words and make a difference.
Let's go a little further and instead of comparing with 1 support word on the negative and positive sides, we will use a set of obviously good and bad words (here, for example, the following words were used:
Positive: good, nice, excellent, perfect, correct, super
Negative : bad, nasty, poor, terrible, wrong, awful
Accordingly, the final formula

So, with the calculation of SO figured out, but how to choose the right candidates?
For example, we have a proposal “Today is a wonderful morning, it would be nice to go to the lake”.
It is logical to assume that the emotional color of the sentence is added mostly by adjectives and adverbs. Therefore, to take advantage of this, we construct a finite state machine, which, according to the given patterns of speech parts, is selected from the proposal of candidates for the SO assessment. It is not difficult to guess that the proposals will be considered positive feedback if the total SO of all candidates> 0.154.
The following templates were used in this work:

In this case, the candidates will be:
1. a wonderful morning
2. a good trip. It
remains only to collect everything together and test it.
Here you will find Java sources. There is little beauty there - it was written just to try and decide whether it will be used further
Case: Amazon Product Review Data (> 5.8 million reviews) liu2.cs.uic.edu/data
An inverted index was built on this case using Lucene , which was used to produce Search.
In the absence of data in the index, Google search engines (api) and Yahoo! (with their operator around and near, respectively). But, unfortunately, due to the speed of work and inaccuracy of the results (for high-frequency queries, search engines give an approximate value of the number of results), the solution is not perfect.
The OpenNLP library was used to identify parts of speech and tokenization .
Based on the foregoing, the most preferable vectors of improvements are:
1. Building a more complete tree for analyzing parts of speech for filtering candidates
2. Using a larger corpus as a training sample
3. If possible, using a training corps from the same socialmedia as the test sample
4. Formation key words (good | bad) depending on the data source and subject matter
5. Embedding negation in the parsing tree of patterns
6. Definition of sarcasm
In general, a PMI-based system can compete with systems based on the principle of “bag of words”, but in an ideal implementation these two systems should complement each other: in the absence of data in the training set, a specific word counting system should come into play.
1. Introduction to information retrieval. C. Manning, P. Raghavan, H. Schütze
2. Foundations of statistical Natural Language Processing. C. Manning, H. Schutze
3.Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. Peter D. Turney
If you are engaged in DataMining, analysis of texts to identify opinions or you are just interested in statistical models for assessing the emotional coloring of sentences - this article may be interesting.
Further, in order not to waste the time of the potential reader in vain on a pile of theory and reasoning, immediately brief results.
The implemented approach works with approximately 55% accuracy in three classes: negative, neutral, positive. According to Wikipedia , 70% accuracy is approximately equal to the accuracy of human judgments on average (due to the subjectivity of each interpretation).
It should be noted that there are many utilities with an accuracy higher than that obtained by me, but the described approach can be quite simply improved (it will be described below) and get as a result 65-70%. If after all of the above you still have the desire to read, welcome to cat.
Summary of Principle
To determine the emotional coloring of a sentence (SO - sentiment orientation), you need to understand what it is all about. This is logical. But how to explain to the car what is good and what is bad?
The first option that immediately appears on the mind is the sum of the number of bad / good words multiplied by the weight of each. The so-called “bag of words” approach . Surprisingly simple and fast algorithm, combined with rule- based preprocessing giving good results (up to 60 - 80% accuracy depending on the case ). In essence, this approach is an example of a unigram model, which means that in the most naive case, the sentences “This product rather good than bad” and “This product rather bad than good” will have the same SO. This problem can be solved by moving from a unigram to a multinomial model. It should also be noted that it is necessary to have a solid constantly updated dictionary containing bad and good terms + their weight, which may be specific depending on the data.
An example of the simplest multinomial model is the naive Bayes method . There are several articles on the habr dedicated to it, in particular this one .
The advantage of the multinomial model over the unigram model is that we can take into account the context in which a statement was uttered. This solves the problem with the sentences described above, but introduces a new limitation: if the selected n-gram is not in the training set , then the SO on the test data will be 0. This problem has always been and will be. It can be solved in 2 ways: by increasing the volume of the training sample (not forgetting that you can grab the effect of retraining along the way), or by using smoothing (for example, Laplace or Good-Turing).
Finally, we smoothly approached the idea of PMI.
Along with the Bayes formula


PMI - pointwise mutual information, pointwise mutual information.
in the above formula, A and B are words / bigrams / n-ngrams, P (A), P (B) are the a priori probabilities of the appearance of the term A and B, respectively, in the training set (the ratio of the number of occurrences to the total number of words in the corpus), P (A near B) - the probability of the term A meeting together / next to the term B; “Nearby” can be configured manually, by default, the distance is 10 terms left and right; the base of the logarithm does not play a role, for simplicity we take it equal to 2. The
positive sign of the logarithm will mean positive color A in comparison with B, negative - negative.
To find neutral reviews, you can take some kind of sliding window (in this paper, the segment [-0.154, 0.154] is responsible for this). The window can be either constant or floating depending on the data (shown below).
From the above, one can come to the following statements:

Indeed, to determine which class the statements “good weather”, “fast ride” belong to, it is enough to check in the training sample how often “good weather” and “fast ride” occur next to the known (established person depending on the data model and test sample) good and bad words and make a difference.
Let's go a little further and instead of comparing with 1 support word on the negative and positive sides, we will use a set of obviously good and bad words (here, for example, the following words were used:
Positive: good, nice, excellent, perfect, correct, super
Negative : bad, nasty, poor, terrible, wrong, awful
Accordingly, the final formula

So, with the calculation of SO figured out, but how to choose the right candidates?
For example, we have a proposal “Today is a wonderful morning, it would be nice to go to the lake”.
It is logical to assume that the emotional color of the sentence is added mostly by adjectives and adverbs. Therefore, to take advantage of this, we construct a finite state machine, which, according to the given patterns of speech parts, is selected from the proposal of candidates for the SO assessment. It is not difficult to guess that the proposals will be considered positive feedback if the total SO of all candidates> 0.154.
The following templates were used in this work:

In this case, the candidates will be:
1. a wonderful morning
2. a good trip. It
remains only to collect everything together and test it.
Implementation
Here you will find Java sources. There is little beauty there - it was written just to try and decide whether it will be used further
Case: Amazon Product Review Data (> 5.8 million reviews) liu2.cs.uic.edu/data
An inverted index was built on this case using Lucene , which was used to produce Search.
In the absence of data in the index, Google search engines (api) and Yahoo! (with their operator around and near, respectively). But, unfortunately, due to the speed of work and inaccuracy of the results (for high-frequency queries, search engines give an approximate value of the number of results), the solution is not perfect.
The OpenNLP library was used to identify parts of speech and tokenization .
Which is better?
Based on the foregoing, the most preferable vectors of improvements are:
1. Building a more complete tree for analyzing parts of speech for filtering candidates
2. Using a larger corpus as a training sample
3. If possible, using a training corps from the same socialmedia as the test sample
4. Formation key words (good | bad) depending on the data source and subject matter
5. Embedding negation in the parsing tree of patterns
6. Definition of sarcasm
conclusions
In general, a PMI-based system can compete with systems based on the principle of “bag of words”, but in an ideal implementation these two systems should complement each other: in the absence of data in the training set, a specific word counting system should come into play.
References:
1. Introduction to information retrieval. C. Manning, P. Raghavan, H. Schütze
2. Foundations of statistical Natural Language Processing. C. Manning, H. Schutze
3.Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. Peter D. Turney