VotingClassifier in scikit-learn: building and optimizing an ensemble of classification models

From the sandbox

As part of the implementation of a large task on Sentiment Analysis (analysis of reviews), I decided to devote some time to additional study of its separate element - using the VotingClassifier from sklearn.ensemble as a tool for building an ensemble of classification models and improving the final quality of predictions. Why is this important and what are the nuances?

It often happens that in the course of solving an applied problem of analyzing data, it is not immediately obvious (or not at all) what training model is best suited. One solution may be to choose the most popular and / or intuitively suitable model based on the nature of the data available. In this case, the parameters of the selected model are optimized (for example, via GridSearchCV) and it is used in the work. Another approach may be the use of an ensemble of models, when the results of several of them are involved in the formation of the final result. At once I will make a reservation that the purpose of the article is not to describe the advantages of using an ensemble of models or the principles of its construction (you can read about it here), and rather in one separately applied approach to solving the problem on a specific example and analysis of the nuances arising in the course of such a decision.

Setting a global problem is the following : there are only 100 reviews on mobile phones as a test sample, and we need a pre-trained model that, with these 100 reviews, will show the best result — that is, determine whether a review is positive or negative. An additional complexity, as follows from the conditions of the problem, is the absence of a training sample. To overcome this difficulty, 10,000 reviews of mobile phones and ratings from one of the Russian sites were successfully sparred with the help of the Beautiful Soup library.

Skipping the steps of parsing, data preprocessing and studying their original structure, we are transitioning to the moment when there are:

a training sample of 10,000 phone reviews, each feedback is marked binary (positive or negative). The markup of receiving the definition of reviews with grades 1-3 as negative and grades 4-5 as positive.
using Count Vectorizer, data is presented in a form that is suitable for training classifier models

How to decide which model will work best? We do not have the ability to manually iterate through the models, since A test sample of only 100 reviews creates a huge risk that some model will simply better fit this test sample, but if you use it on an additional sample hidden from us or in a “battle”, the result will be below average.

To solve this problem, in the Scikit-learn library there is a VotingClassifier module , which is an excellent tool for using several machine learning models that are not similar among themselves, and combining them into one classifier. This reduces the risk of retraining, as well as misinterpretation of the results of any one single model. The VotingClassifier module is imported by the following command :
from sklearn.ensemble import VotingClassifier

Practical details when working with this module:

1) The first and most important thing is how a separate taken prediction of a combined classifier is obtained after receiving predictions from each of its member models. Among the VotingClassifier parameters there is a voting parameter with two possible values: 'hard' and 'soft'.

1.1) In the first case, the final answer of the combined classifier will correspond to the “opinion” of the majority of its members. For example, your combined classifier uses data from three different models. Two of them on a specific observation predict a response “positive feedback”, the third - “negative feedback”. Thus, for this observation, the final prediction will be a “positive feedback”, since we have 2 - “for” and 1 “against”.

1.2) In the second case, i.e. when using the parameter 'soft', the voting parameter goes to full “voting” and weighting model predictions for each class, so the final answer of the combined classifier is argmax of the sum of the predicted probabilities. IMPORTANT!To be able to use this “voting” method, each classifier from within your ensemble must support the predict_proba () method in order to obtain a quantitative estimate of the probability of entering each of the classes. Please note that not all models of classifiers support this method and, accordingly, can be used within the VotingClassifier when using the method of weighted probability (Soft Voting).

We will understand an example: there are three classifiers and two classes of feedback: positive and negative. Each classifier through the method predict_proba will give a certain value of probability (p), with which a specific observation is assigned to it to class 1 and, accordingly, with probability (1-p) to class two. The combined classifier after receiving a response from each of the models weighs the estimates obtained and gives the final result, obtained as

$$$ display $$ max (w1 * p1 + w2 * p1 + w3 * p1, w1 * p2 + w2 * p2 + w3 * p3) $$ display $$$

where w1, w2, w3 are the weights of your classifier that are included in the ensemble, have equal weights by default, and p1, p2 is the score for belonging to class 1 or class 2 of each of them. Note also that the weights of classifiers using Soft Vote can be changed using the weights parameter, so the module call should look like this:
... = VotingClassifier(estimators=[('..', clf1), ('..', clf2), ('...', clf3)], voting='soft', weights=[*,*,*]) where the asterisks can be specified the required weights for each model.

2) The ability to simultaneously use the module VotingClassifier and GridSearch to optimize the hyperparameters of each of the classifiers in the ensemble.

When you plan to use an ensemble and you want the models included in it to be optimized, you can use GridSearch already in the combined classifier. And the code below shows how you can work with the models included in it (Logistic regression, naive Bayes, stochastic gradient descent) while remaining within the combined classifier (VotingClassifier):

clf1 = LogisticRegression()
clf2 = MultinomialNB()
clf3 = SGDClassifier(max_iter=1000, loss='log')
eclf = VotingClassifier(estimators=[ ('lr', clf1), ('nb', clf2),('sgd', clf3)], voting='hard') # задаем метод голосования через большинство (hard voting), см. п. 1.1
<b>params = {'lr__C' : [0.5,1,1.5], 'lr__class_weight': [None,'balanced'],
          'nb__alpha' : [0.1,1,2], 
          'sgd__penalty' : ['l2', 'l1'], 'sgd__alpha': [0.0001,0.001,0.01]} </b> # задаем сетку параметров для перебора и сравнения, важен синтаксис, чтобы оптимизировалась нужная модель  
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5, scoring='accuracy', n_jobs=-1)
grid = grid.fit(data_messages_vectorized, df_texts['Binary_Rate']) # когда задали все условия, проводим обучение и оптимизацию на 5 фолдах на собранной обучающий выборке

Thus, the params dictionary must be defined in such a way that when accessing it via GridSearch, it is possible to determine which of the models in the ensemble is the parameter whose value is to be optimized.

That's all you need to know to fully use the VotingClassifier tool as a way to build an ensemble of models and optimize it. Let's look at the results:

 print grid.best_params_
{'lr__class_weight': 'balanced', 'sgd__penalty': 'l1', 'nb__alpha': 1, 'lr__C': 1, 'sgd__alpha': 0.001}

The optimal values of the parameters are found, it remains to compare the results for the ensemble of classifiers (VotingClassifier) with the optimal parameters, cross-validate the training set and compare the models with the optimal parameters and the ensemble consisting of them:

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Naive Bayes', 'SGD', 'Ensemble_HardVoting']):
    scores = cross_val_score(clf, data_messages_vectorized, df_texts['Binary_Rate'], cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Final result:

Accuracy: 0.75 (± 0.02) [Logistic Regression]
Accuracy: 0.79 (± 0.02) [Naive Bayes]
Accuracy: 0.79 (± 0.02) [SGD]
Accuracy: 0.79 (± 0.02) [Ensemble_HardVoting]

As you can see, the models showed themselves in several different ways on the training set (with the standard parameters this difference was more noticeable). In this case, the total value (for the accuracy metric) of the ensemble does not have to exceed the best value of the models included in it, since The ensemble is more likely to be a more stable model that can show ± similar results on a test sample and in “combat”, and thus reduce the risk of retraining, fitting a training sample, and other related classifiers of problems. Good luck in solving applied problems and thank you for your attention!

PS Considering the specifics and rules of publication in the sandbox, I cannot provide a link to github and the source code for the analysis given in this article, as well as references to Kaggle, in the framework of the InClass competition which provided a test set and tools for testing models on it. I can only say that this ensemble significantly broke the baseline and took a worthy place on the leaderboard after checking on the test set. I hope in the following publications I can share.

Tags:

VotingClassifier in scikit-learn: building and optimizing an ensemble of classification models

Also popular now: