Weka project for the task of recognizing tonality (sentiment)
- Tutorial
This is a translation of my publication in English .
The Internet is full of articles, notes, blogs, and successful stories of using machine learning (ML) to solve practical problems. Someone uses it for good and just to cheer up, like this picture:

True, for a person who is not an expert in these areas, it is sometimes not so easy to get to existing tools. There are certainly good and relatively quick paths to practical machine learning, such as the scikit Python library . By the way, this projectcontains code written in the SkyNet team (the author happened to be its leading participant) and illustrating the ease of interaction with the library. If you are a Java developer, there are a couple of good tools: Weka and Apache Mahout . Both libraries are universal in terms of applicability to a specific task: from recommendation systems to the classification of texts. There is a toolbox and more sharpened for text-based machine learning: Mallet and a set of Stanford libraries . There are lesser-known libraries like Java-ML .
In this post, we will focus on the Weka library and make a draft project or template project for text machine learning with a specific example: the task of sentiment analysis or sentiment detection. Despite all this, the project is fully operational and even under a commercial-friendly license (Weka itself under GPL 3.0), i.e. with a strong desire, you can even apply the code in your projects.
From the entire set of Weka algorithms that are generally suitable for the selected problem, we will use the Multinomial Naive Bayes algorithm . In this post I almost always cite the same links as in the English version. But since translation is a creative task, I allow myself to provide a link on a topic to a domestic resource on machine learning .
In my opinion and experience in interacting with machine learning tools, usually a programmer is in search of a solution to three problems when using one or another ML library: setting up an algorithm, training an algorithm and I / O, i.e. saving to a disk and loading from a disk of a trained model into memory. In addition to the listed purely practical theoretical aspects, perhaps the most important is the assessment of the quality of the model. We will touch on this too.
So, in order.
Let's start with the task of recognizing tonality into three classes.
In the above code, the following happens:
Similarly, but with more output labels, a classifier for 5 classes is created:
The training of a classification algorithm or classifier consists in communicating to the algorithm examples (object, label) placed in a pair (x, y). An object is described by some features, by the set (or vector) of which one can qualitatively distinguish an object of one class from an object of another class. Say, in the task of classifying fruit objects, for example, into two classes: oranges and apples, such signs could be: size, color, the presence of pimples, the presence of a tail. In the context of the tonality recognition problem, a feature vector can consist of words (unigrams) or pairs of words (bigrams). And the labels will be the names (or serial numbers) of key classes: NEGATIVE, NEUTRAL or POSITIVE. Based on examples, we expect the algorithm to be able to learn and generalize to the level of prediction of the unknown label y 'by the feature vector x'.
We implement the method of adding a pair (x, y) to classify the key into three classes. We assume that the feature vector is a list of words.
In fact, as a second parameter, we could pass in a method and a string instead of an array of strings. But we deliberately work with an array of elements so that the code has the ability to impose the filters that we want. For tonality analysis, a quite relevant filter is the gluing together of negation words (particles, etc.) with the following word: do not like => do not _ like . Thus, the signs of love and not _ like form bipolar nature. Without gluing, we would get the word likecan occur in both positive and negative contexts, which means that it does not carry the desired signal (unlike reality). At the next step, when constructing the classifier, a string of string elements will be tokenized and turned into a vector.
Actually, the training of the classifier is implemented in one line:
Simply!
A quite common scenario in the field of machine learning is the training of the classifier model in memory and the subsequent recognition / classification of new objects. However, to work as part of a product, the model must be delivered to disk and loaded into memory. Saving to disk and loading from a disk into the memory of a trained model in Weka is achieved very simply because the classes of classification algorithms implement, among many others, the Serializable interface.
Saving a trained model:
Loading a trained model:
After loading the model from disk, we will deal with the classification of texts. For a three-class prediction, we implement the following method:
Pay attention to the line if (distribution [0]! = Distribution [1]) . As you remember, we defined the list of class labels for this case as: {NEGATIVE, POSITIVE}. Therefore, in principle, our classifier should be at least binary. But! If the probability distribution of the two given labels is the same (50% each), we can quite confidently assume that we are dealing with a neutral class. Thus, we get a classifier into three classes.
If the classifier is built correctly, then the following unit test should work out correctly:
For completeness, we implement a wrapper class that builds and trains the classifier, saves the model to disk, and tests the model for quality:
As you may have already guessed, testing the quality of a model is also quite easy to implement with Weka. The calculation of the qualitative characteristics of the model is necessary, for example, in order to check whether our model is retrained or underachieved. With the understatement of the model, it is intuitively clear: we did not find the optimal number of signs of classified objects, and the model turned out to be too simple. Retraining means that the model is too tailored to the examples, i.e. it does not generalize to the real world, being overly complex.
There are different ways to test a model. One of these methods is to extract a test sample from the training set (say, one third) and run through cross-validation. Those. at each new iteration, we take a new third of the training set as a test sample and calculate the quality parameters relevant to the problem being solved, for example, accuracy / completeness / accuracy, etc. At the end of such a run, we calculate the average of all iterations. This will be the amortized quality of the model. That is, in practice, it can be lower than in the full training data set, but closer to the quality in real life.
However, for a quick look at the accuracy of the model, it is enough to calculate accuracy, i.e. number of correct answers to incorrect:
This method displays the following statistics:
Thus, the accuracy of the model throughout the training set is 83.35%. The complete project with code can be found on my github . The code uses data from kaggle . Therefore, if you decide to use the code (or even compete in the competition), you will need to accept the terms of participation and download the data. The task of implementing the complete code for classifying tonality into 5 classes remains with the reader. Good luck!
The Internet is full of articles, notes, blogs, and successful stories of using machine learning (ML) to solve practical problems. Someone uses it for good and just to cheer up, like this picture:

True, for a person who is not an expert in these areas, it is sometimes not so easy to get to existing tools. There are certainly good and relatively quick paths to practical machine learning, such as the scikit Python library . By the way, this projectcontains code written in the SkyNet team (the author happened to be its leading participant) and illustrating the ease of interaction with the library. If you are a Java developer, there are a couple of good tools: Weka and Apache Mahout . Both libraries are universal in terms of applicability to a specific task: from recommendation systems to the classification of texts. There is a toolbox and more sharpened for text-based machine learning: Mallet and a set of Stanford libraries . There are lesser-known libraries like Java-ML .
In this post, we will focus on the Weka library and make a draft project or template project for text machine learning with a specific example: the task of sentiment analysis or sentiment detection. Despite all this, the project is fully operational and even under a commercial-friendly license (Weka itself under GPL 3.0), i.e. with a strong desire, you can even apply the code in your projects.
From the entire set of Weka algorithms that are generally suitable for the selected problem, we will use the Multinomial Naive Bayes algorithm . In this post I almost always cite the same links as in the English version. But since translation is a creative task, I allow myself to provide a link on a topic to a domestic resource on machine learning .
In my opinion and experience in interacting with machine learning tools, usually a programmer is in search of a solution to three problems when using one or another ML library: setting up an algorithm, training an algorithm and I / O, i.e. saving to a disk and loading from a disk of a trained model into memory. In addition to the listed purely practical theoretical aspects, perhaps the most important is the assessment of the quality of the model. We will touch on this too.
So, in order.
Classification algorithm setup
Let's start with the task of recognizing tonality into three classes.
publicclassThreeWayMNBTrainer{
private NaiveBayesMultinomialText classifier;
private String modelFile;
private Instances dataRaw;
publicThreeWayMNBTrainer(String outputModel){
// create the classifier
classifier = new NaiveBayesMultinomialText();
// filename for outputting the trained model
modelFile = outputModel;
// listing class labels
ArrayList<attribute> atts = new ArrayList<attribute>(2);
ArrayList<string> classVal = new ArrayList<string>();
classVal.add(SentimentClass.ThreeWayClazz.NEGATIVE.name());
classVal.add(SentimentClass.ThreeWayClazz.POSITIVE.name());
atts.add(new Attribute("content",(ArrayList<string>)null));
atts.add(new Attribute("@@class@@",classVal));
// create the instances data structure
dataRaw = new Instances("TrainingInstances",atts,10);
}
}
In the above code, the following happens:
- An object of the classification algorithm class is created (we love puns)
- A list of target class labels is provided: NEGATIVE and POSITIVE
- A data structure is created for storing pairs (object, class label)
Similarly, but with more output labels, a classifier for 5 classes is created:
publicclassFiveWayMNBTrainer{
private NaiveBayesMultinomialText classifier;
private String modelFile;
private Instances dataRaw;
publicFiveWayMNBTrainer(String outputModel){
classifier = new NaiveBayesMultinomialText();
classifier.setLowercaseTokens(true);
classifier.setUseWordFrequencies(true);
modelFile = outputModel;
ArrayList<Attribute> atts = new ArrayList<Attribute>(2);
ArrayList<String> classVal = new ArrayList<String>();
classVal.add(SentimentClass.FiveWayClazz.NEGATIVE.name());
classVal.add(SentimentClass.FiveWayClazz.SOMEWHAT_NEGATIVE.name());
classVal.add(SentimentClass.FiveWayClazz.NEUTRAL.name());
classVal.add(SentimentClass.FiveWayClazz.SOMEWHAT_POSITIVE.name());
classVal.add(SentimentClass.FiveWayClazz.POSITIVE.name());
atts.add(new Attribute("content",(ArrayList<String>)null));
atts.add(new Attribute("@@class@@",classVal));
dataRaw = new Instances("TrainingInstances",atts,10);
}
}
Classifier Training
The training of a classification algorithm or classifier consists in communicating to the algorithm examples (object, label) placed in a pair (x, y). An object is described by some features, by the set (or vector) of which one can qualitatively distinguish an object of one class from an object of another class. Say, in the task of classifying fruit objects, for example, into two classes: oranges and apples, such signs could be: size, color, the presence of pimples, the presence of a tail. In the context of the tonality recognition problem, a feature vector can consist of words (unigrams) or pairs of words (bigrams). And the labels will be the names (or serial numbers) of key classes: NEGATIVE, NEUTRAL or POSITIVE. Based on examples, we expect the algorithm to be able to learn and generalize to the level of prediction of the unknown label y 'by the feature vector x'.
We implement the method of adding a pair (x, y) to classify the key into three classes. We assume that the feature vector is a list of words.
publicvoidaddTrainingInstance(SentimentClass.ThreeWayClazz threeWayClazz, String[] words){
double[] instanceValue = newdouble[dataRaw.numAttributes()];
instanceValue[0] = dataRaw.attribute(0).addStringValue(Join.join(" ", words));
instanceValue[1] = threeWayClazz.ordinal();
dataRaw.add(new DenseInstance(1.0, instanceValue));
dataRaw.setClassIndex(1);
}
In fact, as a second parameter, we could pass in a method and a string instead of an array of strings. But we deliberately work with an array of elements so that the code has the ability to impose the filters that we want. For tonality analysis, a quite relevant filter is the gluing together of negation words (particles, etc.) with the following word: do not like => do not _ like . Thus, the signs of love and not _ like form bipolar nature. Without gluing, we would get the word likecan occur in both positive and negative contexts, which means that it does not carry the desired signal (unlike reality). At the next step, when constructing the classifier, a string of string elements will be tokenized and turned into a vector.
Actually, the training of the classifier is implemented in one line:
publicvoidtrainModel()throws Exception {
classifier.buildClassifier(dataRaw);
}
Simply!
I / O (save and load model)
A quite common scenario in the field of machine learning is the training of the classifier model in memory and the subsequent recognition / classification of new objects. However, to work as part of a product, the model must be delivered to disk and loaded into memory. Saving to disk and loading from a disk into the memory of a trained model in Weka is achieved very simply because the classes of classification algorithms implement, among many others, the Serializable interface.
Saving a trained model:
publicvoidsaveModel()throws Exception {
weka.core.SerializationHelper.write(modelFile, classifier);
}
Loading a trained model:
publicvoidloadModel(String _modelFile)throws Exception {
NaiveBayesMultinomialText classifier = (NaiveBayesMultinomialText) weka.core.SerializationHelper.read(_modelFile);
this.classifier = classifier;
}
After loading the model from disk, we will deal with the classification of texts. For a three-class prediction, we implement the following method:
public SentimentClass.ThreeWayClazz classify(String sentence)throws Exception {
double[] instanceValue = newdouble[dataRaw.numAttributes()];
instanceValue[0] = dataRaw.attribute(0).addStringValue(sentence);
Instance toClassify = new DenseInstance(1.0, instanceValue);
dataRaw.setClassIndex(1);
toClassify.setDataset(dataRaw);
double prediction = this.classifier.classifyInstance(toClassify);
double distribution[] = this.classifier.distributionForInstance(toClassify);
if (distribution[0] != distribution[1])
return SentimentClass.ThreeWayClazz.values()[(int)prediction];
elsereturn SentimentClass.ThreeWayClazz.NEUTRAL;
}
Pay attention to the line if (distribution [0]! = Distribution [1]) . As you remember, we defined the list of class labels for this case as: {NEGATIVE, POSITIVE}. Therefore, in principle, our classifier should be at least binary. But! If the probability distribution of the two given labels is the same (50% each), we can quite confidently assume that we are dealing with a neutral class. Thus, we get a classifier into three classes.
If the classifier is built correctly, then the following unit test should work out correctly:
@org.junit.Test
publicvoidtestArbitraryTextPositive()throws Exception {
threeWayMnbTrainer.loadModel(modelFile);
Assert.assertEquals(SentimentClass.ThreeWayClazz.POSITIVE, threeWayMnbTrainer.classify("I like this weather"));
}
For completeness, we implement a wrapper class that builds and trains the classifier, saves the model to disk, and tests the model for quality:
publicclassThreeWayMNBTrainerRunner{
publicstaticvoidmain(String[] args)throws Exception {
KaggleCSVReader kaggleCSVReader = new KaggleCSVReader();
kaggleCSVReader.readKaggleCSV("kaggle/train.tsv");
KaggleCSVReader.CSVInstanceThreeWay csvInstanceThreeWay;
String outputModel = "models/three-way-sentiment-mnb.model";
ThreeWayMNBTrainer threeWayMNBTrainer = new ThreeWayMNBTrainer(outputModel);
System.out.println("Adding training instances");
int addedNum = 0;
while ((csvInstanceThreeWay = kaggleCSVReader.next()) != null) {
if (csvInstanceThreeWay.isValidInstance) {
threeWayMNBTrainer.addTrainingInstance(csvInstanceThreeWay.sentiment, csvInstanceThreeWay.phrase.split("\\s+"));
addedNum++;
}
}
kaggleCSVReader.close();
System.out.println("Added " + addedNum + " instances");
System.out.println("Training and saving Model");
threeWayMNBTrainer.trainModel();
threeWayMNBTrainer.saveModel();
System.out.println("Testing model");
threeWayMNBTrainer.testModel();
}
}
Model quality
As you may have already guessed, testing the quality of a model is also quite easy to implement with Weka. The calculation of the qualitative characteristics of the model is necessary, for example, in order to check whether our model is retrained or underachieved. With the understatement of the model, it is intuitively clear: we did not find the optimal number of signs of classified objects, and the model turned out to be too simple. Retraining means that the model is too tailored to the examples, i.e. it does not generalize to the real world, being overly complex.
There are different ways to test a model. One of these methods is to extract a test sample from the training set (say, one third) and run through cross-validation. Those. at each new iteration, we take a new third of the training set as a test sample and calculate the quality parameters relevant to the problem being solved, for example, accuracy / completeness / accuracy, etc. At the end of such a run, we calculate the average of all iterations. This will be the amortized quality of the model. That is, in practice, it can be lower than in the full training data set, but closer to the quality in real life.
However, for a quick look at the accuracy of the model, it is enough to calculate accuracy, i.e. number of correct answers to incorrect:
publicvoidtestModel()throws Exception {
Evaluation eTest = new Evaluation(dataRaw);
eTest.evaluateModel(classifier, dataRaw);
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);
}
This method displays the following statistics:
Correctly Classified Instances 28625 83.3455 %
Incorrectly Classified Instances 5720 16.6545 %
Kappa statistic 0.4643
Mean absolute error 0.2354
Root mean squared error 0.3555
Relative absolute error 71.991 %
Root relative squared error 87.9228 %
Coverage of cases (0.95 level) 97.7697 %
Mean rel. region size (0.95 level) 83.3426 %
Total Number of Instances 34345
Thus, the accuracy of the model throughout the training set is 83.35%. The complete project with code can be found on my github . The code uses data from kaggle . Therefore, if you decide to use the code (or even compete in the competition), you will need to accept the terms of participation and download the data. The task of implementing the complete code for classifying tonality into 5 classes remains with the reader. Good luck!