
Classification of contact center contact topics
Hello colleagues! In this article, I will briefly talk about the features of constructing a solution to classify the topics of customer calls to the contact center that we encountered during development.
The definition of topics of calls is used to track trends and listen to recordings of interest. Traditionally, this problem is solved by affixing the appropriate tag to the operator, but with this approach, the “human” factor plays a large role, and many man-hours of operators are spent.

To solve this problem, our team - Data4 developed a system for determining the topic based on the classification of texts.
At the input, a 2 channel WAV file was used, with a frequency of 8 kHz. The file was transcribed using a speech recognition system. Experience has shown that the recognition quality of Russian spontaneous speech on our data was 60-70% according to the WER metric . This quality makes it difficult to use methods for decomposing sentences into columns, etc., and for visual analysis, but is sufficient for statistical analysis.
The hypothesis was tested that in addition to the text, the quality of the forecast can be affected by speech parameters, such as pauses, interruptions, and the ratio of the operator’s speech quantity to the subscriber’s speech quantity. To identify these signs, we used a speech presence detector, which works as follows:
Verification showed that the signs obtained from signal processing do not make a positive contribution to our model. The training was conducted on a small sample (1 thousand records for each class), perhaps with a larger training sample, a different result is possible.
To build a classifier based on texts, it was necessary to translate texts into feature vectors. For this we used the TF method - IDF. TF - IDF is a statistical measure to assess the importance of a word in the context of a document that is part of a collection of documents in which the weight of the word is proportional to the number of uses of the word in the document and inversely proportional to the frequency of use of the word in other documents in the collection. To reduce the dimension, the lemmatization procedure of word forms was used.

In order not to take into account rarely used words, and frequently used words, we use the stop word list for the Russian language and experimentally limited the length of the feature vector 3000, and the minimum frequency of the token 2. In addition, words from foul language, interjections were added to the list of stop words unions, particles, since in the vast majority of cases they were the result of erroneous operation of the speech recognition system or did not carry significant information. The remaining words carry enough information to use their vector representation to train the classifier of topics.

The metric of quality was the F measure. The F measure takes into account the accuracy and recall values and is calculated by the formula: F = 2 P * R / (P + R), where P is the accuracy, R is the completeness.
To minimize the effect of retraining, L2 regularization and cross-validation with 10 blocks were used.
We used a binary classifier based on the assumption that a topic can be distinguished by contrasting the remaining topics, and the topics inside the topics can be represented in the form of a tree.

Algorithm testing showed that for the task of classifying call texts, logistic regression and a random decision forest give the best results. At the same time, logistic regression showed stable results on several data sets, while a random forest showed maximum quality, but the need for additional manual tuning when changing the data set.
According to the quality metric F1, a measure of 0.98 was achieved for weighted classes containing at least 1 thousand examples. It should be noted that this quality was achieved only for a number of test data. For some classes containing 250–300 examples, the maximum value was 0.7. This is explained by the formalization of the separation of topics and the frequency of occurrence of the topic in a set of texts for teaching the model. Thus, the quality of classification of targeted and non-targeted calls will be higher than the classification of customer requests for specific services and for those types that are more common.
Summary:
To classify the topics of calls to the contact center, it is rational to use an algorithm based on logistic regression to achieve sustainable quality, or an algorithm based on a random decision forest that needs to be pre-configured. The input of the algorithm is a feature vector derived from the text. To achieve high quality by metric F1 measure, you should use the training sample, which contains at least 1 thousand examples of each class.
Useful links for working with texts:
Big-ARTM - State-of-the-art Topic Modeling
Gensim - Topic Modeling for Human
Overview of approaches to text
classification Classification by neural network
classification Classification by SVM
PS I thank Anna Larionova for her contribution to the preparation of the article and the development of the solution.
The definition of topics of calls is used to track trends and listen to recordings of interest. Traditionally, this problem is solved by affixing the appropriate tag to the operator, but with this approach, the “human” factor plays a large role, and many man-hours of operators are spent.

To solve this problem, our team - Data4 developed a system for determining the topic based on the classification of texts.
At the input, a 2 channel WAV file was used, with a frequency of 8 kHz. The file was transcribed using a speech recognition system. Experience has shown that the recognition quality of Russian spontaneous speech on our data was 60-70% according to the WER metric . This quality makes it difficult to use methods for decomposing sentences into columns, etc., and for visual analysis, but is sufficient for statistical analysis.
The hypothesis was tested that in addition to the text, the quality of the forecast can be affected by speech parameters, such as pauses, interruptions, and the ratio of the operator’s speech quantity to the subscriber’s speech quantity. To identify these signs, we used a speech presence detector, which works as follows:
- the signal is translated into the frequency domain by the fast Fourier transform;
- the signal is divided into frames of 25 milliseconds .;
- for each frame, the first 13 mel-frequency cepstral coefficients and their first and second delta are determined ;
- the resulting feature vector is fed to the classifier based on XGboost.
Verification showed that the signs obtained from signal processing do not make a positive contribution to our model. The training was conducted on a small sample (1 thousand records for each class), perhaps with a larger training sample, a different result is possible.
To build a classifier based on texts, it was necessary to translate texts into feature vectors. For this we used the TF method - IDF. TF - IDF is a statistical measure to assess the importance of a word in the context of a document that is part of a collection of documents in which the weight of the word is proportional to the number of uses of the word in the document and inversely proportional to the frequency of use of the word in other documents in the collection. To reduce the dimension, the lemmatization procedure of word forms was used.

In order not to take into account rarely used words, and frequently used words, we use the stop word list for the Russian language and experimentally limited the length of the feature vector 3000, and the minimum frequency of the token 2. In addition, words from foul language, interjections were added to the list of stop words unions, particles, since in the vast majority of cases they were the result of erroneous operation of the speech recognition system or did not carry significant information. The remaining words carry enough information to use their vector representation to train the classifier of topics.

The metric of quality was the F measure. The F measure takes into account the accuracy and recall values and is calculated by the formula: F = 2 P * R / (P + R), where P is the accuracy, R is the completeness.
To minimize the effect of retraining, L2 regularization and cross-validation with 10 blocks were used.
We used a binary classifier based on the assumption that a topic can be distinguished by contrasting the remaining topics, and the topics inside the topics can be represented in the form of a tree.

Algorithm testing showed that for the task of classifying call texts, logistic regression and a random decision forest give the best results. At the same time, logistic regression showed stable results on several data sets, while a random forest showed maximum quality, but the need for additional manual tuning when changing the data set.
According to the quality metric F1, a measure of 0.98 was achieved for weighted classes containing at least 1 thousand examples. It should be noted that this quality was achieved only for a number of test data. For some classes containing 250–300 examples, the maximum value was 0.7. This is explained by the formalization of the separation of topics and the frequency of occurrence of the topic in a set of texts for teaching the model. Thus, the quality of classification of targeted and non-targeted calls will be higher than the classification of customer requests for specific services and for those types that are more common.
Summary:
To classify the topics of calls to the contact center, it is rational to use an algorithm based on logistic regression to achieve sustainable quality, or an algorithm based on a random decision forest that needs to be pre-configured. The input of the algorithm is a feature vector derived from the text. To achieve high quality by metric F1 measure, you should use the training sample, which contains at least 1 thousand examples of each class.
Useful links for working with texts:
Big-ARTM - State-of-the-art Topic Modeling
Gensim - Topic Modeling for Human
Overview of approaches to text
classification Classification by neural network
classification Classification by SVM
PS I thank Anna Larionova for her contribution to the preparation of the article and the development of the solution.