Classification of Russian text using the Natural library on NodeJS

    Preamble


    I won’t surprise anyone if I say that a modern person, and especially a programmer, receives a lot of information every day. For example, my RSS client issues about 500 articles a week to me. And, of course, this is far from the only source of information.

    I thought about making an RSS client for myself with a trained article filter on NodeJS. In principle, under the node there are ready-made RSS readers, there are ready-made neural networks with classifiers, so it seemed to me not a particularly difficult task to write a prototype.

    I decided to start by testing the neural networks that came to hand. I took a small amount of input. I copied the positive data from articles on nodejs from a geek magazine I found negative data on "tape.ru". The task of the classifier was to sort articles on programming and nodejs from ordinary news uninteresting for my development. I don’t want to show the

    results of working with Brain and Fann - I do not think that I have enough expertise to judge them. I’ll just say that they didn’t suit me completely out of the box - on my input they did not give an adequate number of correct answers. But the Natural library really impressed me.

    Next I will show how I trained the classifier, checked its work and made me understand the Russian language.



    Input data


    The data on which I trained and tested the classifier can be viewed here . There are a lot of them for the article, so I got them out of here.

    The code


    'use strict';
    var data = require('./data');
    var natural = require('natural'),
      porterStemmer = natural.PorterStemmerRu,
      classifier = new natural.BayesClassifier(porterStemmer);
    // Даем classifier'у примеры хороших и плохих данных.
    for (var i = 0; i < data.good.length; i++) {
      classifier.addDocument(data.good[i], 'good'); 
    };
    for (var i = 0; i < data.bad.length; i++) {
      classifier.addDocument(data.bad[i], 'bads');
    };
    // Запускаем обучение на переданных текстах.
    classifier.train();
    // А теперь классифицируем тестовые тексты.
    console.log('START CLASSIFICATION');
    console.log('Test on good');
    for (var i = 0; i < data.test_good.length; i++) {
      console.log("> ",classifier.classify(data.test_good[i]));
    };
    console.log('Test on bad');
    for (var i = 0; i < data.test_bad.length; i++) {
      console.log("> ",classifier.classify(data.test_bad[i]));
    };
    


    Result


    START CLASSIFICATION
    Test on good
    > good
    > good
    > good
    > good
    Test on bad
    > bads
    > bads
    > bads
    > bads
    > good
    > bads
    > bads
    > good


    Russian language support


    For a high-quality classification, Natural uses the “stemmer” component, which splits the text into an array of words, removes useless words (the so-called stopwords ) and truncates word endings.

    By default, the classifier ignores Russian words, although there is support for the Russian language in the project. To make the classifier understand the Russian language, it is necessary to initialize the classifier, passing the stimulator for the Russian language into it, replacing the default English stimulator in this way. It is very easy to do:

    var classifier = new natural.BayesClassifier(natural.PorterStemmerRu);
    


    Now the text inside the classifier will be processed correctly, taking into account the peculiarities of the Russian language.

    Experiment lovers


    I specially created a repository with a working classifier. Installation is trivial:

    git clone git@github.com:shuvalov-anton/classifier.git
    cd classifier
    npm i
    node app.js
    


    Next, change the data in data.js to your own and see the result.

    PS


    Honestly, I have no experience in classifying information to evaluate the result, but the results of Natural's work as a simple user really impressed me. Unfortunately, I did not find any more or less serious project documentation other than readme on github. And in order to understand how to include the Russian language, I had to rummage through the source, but there was something super complex in this, and I think the result was worth it!

    Also popular now: