alex4321 November 13, 2016 at 07:45

Implementation of text classification by convolutional network on keras

Strange as it may seem, it will be a text classifier using a convolutional network (vectorization of individual words is another question). The code, test data and examples of their application are on bitbucket (I ran into github size limits and the proposal to use Git Large File Storage (LFS) until I mastered the proposed solution).

Datasets

Converted sets were used: reuters - 22000 records , watson-th - 530 records , and 1 more watson-th - 50 records . By the way, I wouldn’t refuse a set of texts in Russian thrown into comments / drugs (but still better into comments).

Network device

Based on one implementation of the network described here . The code of the used implementation on github .

In my case, there are word vectors at the input of the network (the gensim implementation of word2vec is used). The network structure is shown below:

In short:

The text is represented as a matrix of the form word_count x word_vector_size. The vectors of individual words are from word2vec, which can be read about, for example, in this post . Since I don’t know in advance which text the user will palm off, I take a length of 2 * N, where N is the number of vectors in the longest text of the training set. Yes, poked my fingers into the sky.
The matrix is processed by convolutional sections of the network (at the output we get the converted features of the word)
Selected features are processed by a fully connected network section.

I filter out the word stop beforehand (it didn’t affect the reuter-dataset, but in smaller sets - it affected). About it below.

Installing the necessary software (keras / theano, cuda) on Windows

Installing for linux was significantly easier. Required:

python3.5
python header files (python-dev in debian)
gcc
cuda
python libraries are the same as in the list below

In my case with win10 x64, the approximate sequence was as follows:

Anaconda with python3.5 .
Cuda 8.0 . You can run it on the CPU (gcc is enough then and the next 4 steps are not needed), but on relatively large datasets the drop in speed should be significant (did not check).
The path to nvcc has been added to PATH (otherwise, theano will not detect it).
Visual Studio 2015 with C ++, including the windows 10 kit (corecrt.h required).
The path to cl.exe has been added to PATH.
The path to corecrt.h in INCLUDE (in my case, C: \ Program Files (x86) \ Windows Kits \ 10 \ Include \ 10.0.10240.0 \ ucrt).
conda install mingw libpython - gcc and libpython will be required when compiling the grid.
Well, and pip install keras theano python-levenshtein gensim nltk(maybe it will start with the replacement of the keras-th backend from theano to tensorflow, but I have not tested it).
in .theanorc the following flag is specified for gcc:
```
		[gcc]
		cxxflags = -D_hypot=hypot
  
```
Run python and execute
```
		import nltk
		nltk.download()
  
```

Word processing

At this stage, the removal of stopwords that are not included in the combination of the “white list” (about it below) and the remaining ones are vectorized. Input data for the applied algorithm:

language - nltk is required to tokenize and return a list of stopwords
A “white list” of word combinations that use stopwords. For example - “on” is attributed to stopwords, but [“turn”, “on”] is another matter
vectors word2vec

Well, the algorithm (I see at least 2 possible improvements, but did not master):

I break the input text into ntlk.tokenize tokens (conditionally - “Hello, world!” Is converted to [“hello”, ",", "world", "!"]).
I drop tokens that are not in the word2vec dictionary.

In fact - which are not there and it was not possible to highlight a similar one in distance. So far, only Levenshtein distance, there is an idea to filter tokens with the smallest Levenshtein distance according to the distance from their vectors to the vectors included in the training set.
Select tokens:
- which are not in the list of stopwords (reduced the error on the weather dataset, but without the next step - it really ruined the result on “car_intents”).
- if the token is in the stop-word list, check for the occurrence of white-list sequences in the text in which it exists (conditionally - by finding “on”, check for the presence of sequences from the list [[“turn”, “on”]]. If there is one - still add it. There is something to improve - now I am checking (in our example) the presence of "turn", but it may not apply to this "on" either.
Replace selected tokens with their vectors.

Code to us, code

Actually, the code by which I evaluated the impact of changes

import itertools
import json
import numpy
from gensim.models import Word2Vec
from pynlc.test_data import reuters_classes, word2vec, car_classes, weather_classes
from pynlc.text_classifier import TextClassifier
from pynlc.text_processor import TextProcessor
from sklearn.metrics import mean_squared_error
def classification_demo(data_path, train_before, test_before, train_epochs, test_labels_path, instantiated_test_labels_path, trained_path):
    with open(data_path, 'r', encoding='utf-8') as data_source:
        data = json.load(data_source)
    texts = [item["text"] for item in data]
    class_names = [item["classes"] for item in data]
    train_texts = texts[:train_before]
    train_classes = class_names[:train_before]
    test_texts = texts[train_before:test_before]
    test_classes = class_names[train_before:test_before]
    text_processor = TextProcessor("english", [["turn", "on"], ["turn", "off"]], Word2Vec.load_word2vec_format(word2vec))
    classifier = TextClassifier(text_processor)
    classifier.train(train_texts, train_classes, train_epochs, True)
    prediction = classifier.predict(test_texts)
    with open(test_labels_path, "w", encoding="utf-8") as test_labels_output:
        test_labels_output_lst = []
        for i in range(0, len(prediction)):
            test_labels_output_lst.append({
                "real": test_classes[i],
                "classified": prediction[i]
            })
        json.dump(test_labels_output_lst, test_labels_output)
    instantiated_classifier = TextClassifier(text_processor, **classifier.config)
    instantiated_prediction = instantiated_classifier.predict(test_texts)
    with open(instantiated_test_labels_path, "w", encoding="utf-8") as instantiated_test_labels_output:
        instantiated_test_labels_output_lst = []
        for i in range(0, len(instantiated_prediction)):
            instantiated_test_labels_output_lst.append({
                "real": test_classes[i],
                "classified": instantiated_prediction[i]
            })
        json.dump(instantiated_test_labels_output_lst, instantiated_test_labels_output)
    with open(trained_path, "w", encoding="utf-8") as trained_output:
        json.dump(classifier.config, trained_output, ensure_ascii=True)
def classification_error(files):
    for name in files:
        with open(name, "r", encoding="utf-8") as src:
            data = json.load(src)
        classes = []
        real = []
        for row in data:
            classes.append(row["real"])
            classified = row["classified"]
            row_classes = list(classified.keys())
            row_classes.sort()
            real.append([classified[class_name] for class_name in row_classes])
        labels = []
        class_names = list(set(itertools.chain(*classes)))
        class_names.sort()
        for item_classes in classes:
            labels.append([int(class_name in item_classes) for class_name in class_names])
        real_np = numpy.array(real)
        mse = mean_squared_error(numpy.array(labels), real_np)
        print(name, mse)
if __name__ == '__main__':
    print("Reuters:\n")
    classification_demo(reuters_classes, 10000, 15000, 10,
                        "reuters_test_labels.json", "reuters_car_test_labels.json",
                        "reuters_trained.json")
    classification_error(["reuters_test_labels.json", "reuters_car_test_labels.json"])
    print("Car intents:\n")
    classification_demo(car_classes, 400, 500, 20,
                        "car_test_labels.json", "instantiated_car_test_labels.json",
                        "car_trained.json")
    classification_error(["cars_test_labels.json", "instantiated_cars_test_labels.json"])
    print("Weather:\n")
    classification_demo(weather_classes, 40, 50, 30,
                        "weather_test_labels.json", "instantiated_weather_test_labels.json",
                        "weather_trained.json")
    classification_error(["weather_test_labels.json", "instantiated_weather_test_labels.json"])

Here you see:

Data preparation:

with open(data_path, 'r', encoding='utf-8') as data_source:
   data = json.load(data_source)
texts = [item["text"] for item in data]
class_names = [item["classes"] for item in data]
train_texts = texts[:train_before]
train_classes = class_names[:train_before]
test_texts = texts[train_before:test_before]
test_classes = class_names[train_before:test_before]

Creating a new classifier:

text_processor = TextProcessor("english", [["turn", "on"], ["turn", "off"]], Word2Vec.load_word2vec_format(word2vec))
classifier = TextClassifier(text_processor)

His training:

classifier.train(train_texts, train_classes, train_epochs, True)

Predicting classes for a test sample and saving pairs of “real classes” - “predicted class probabilities”:

prediction = classifier.predict(test_texts)
with open(test_labels_path, "w", encoding="utf-8") as test_labels_output:
        test_labels_output_lst = []
        for i in range(0, len(prediction)):
            test_labels_output_lst.append({
                "real": test_classes[i],
                "classified": prediction[i]
            })
        json.dump(test_labels_output_lst, test_labels_output)

Creating a new instance of the classifier by configuration (dict, can be serialized to / deserialized from, for example json):
```
instantiated_classifier = TextClassifier(text_processor, **classifier.config)
			
```

The exhaust is approximately as follows:

C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Using Theano backend.
Using gpu device 0: GeForce GT 730 (CNMeM is disabled, cuDNN not available)
Reuters:
Train on 3000 samples, validate on 7000 samples
Epoch 1/10
20/3000 [..............................] - ETA: 307s - loss: 0.6968 - acc: 0.5376
....
3000/3000 [==============================] - 640s - loss: 0.0018 - acc: 0.9996 - val_loss: 0.0019 - val_acc: 0.9996
Epoch 8/10
20/3000 [..............................] - ETA: 323s - loss: 0.0012 - acc: 0.9994
...
3000/3000 [==============================] - 635s - loss: 0.0012 - acc: 0.9997 - val_loss: 9.2200e-04 - val_acc: 0.9998
Epoch 9/10
20/3000 [..............................] - ETA: 315s - loss: 3.4387e-05 - acc: 1.0000
...
3000/3000 [==============================] - 879s - loss: 0.0012 - acc: 0.9997 - val_loss: 0.0016 - val_acc: 0.9995
Epoch 10/10
20/3000 [..............................] - ETA: 327s - loss: 8.0144e-04 - acc: 0.9997
...
3000/3000 [==============================] - 655s - loss: 0.0012 - acc: 0.9997 - val_loss: 7.4761e-04 - val_acc: 0.9998
reuters_test_labels.json 0.000151774189194
reuters_car_test_labels.json 0.000151774189194
Car intents:
Train on 280 samples, validate on 120 samples
Epoch 1/20
20/280 [=>............................] - ETA: 0s - loss: 0.6729 - acc: 0.5250
...
280/280 [==============================] - 0s - loss: 0.2914 - acc: 0.8980 - val_loss: 0.2282 - val_acc: 0.9375
...
Epoch 19/20
20/280 [=>............................] - ETA: 0s - loss: 0.0552 - acc: 0.9857
...
280/280 [==============================] - 0s - loss: 0.0464 - acc: 0.9842 - val_loss: 0.1647 - val_acc: 0.9494
Epoch 20/20
20/280 [=>............................] - ETA: 0s - loss: 0.0636 - acc: 0.9714
...
280/280 [==============================] - 0s - loss: 0.0447 - acc: 0.9849 - val_loss: 0.1583 - val_acc: 0.9530
cars_test_labels.json 0.0520754688092
instantiated_cars_test_labels.json 0.0520754688092
Weather:
Train on 28 samples, validate on 12 samples
Epoch 1/30
20/28 [====================>.........] - ETA: 0s - loss: 0.6457 - acc: 0.6000
...
Epoch 29/30
20/28 [====================>.........] - ETA: 0s - loss: 0.0021 - acc: 1.0000
...
28/28 [==============================] - 0s - loss: 0.0019 - acc: 1.0000 - val_loss: 0.1487 - val_acc: 0.9167
Epoch 30/30
...
28/28 [==============================] - 0s - loss: 0.0018 - acc: 1.0000 - val_loss: 0.1517 - val_acc: 0.9167
weather_test_labels.json 0.0136964029149
instantiated_weather_test_labels.json 0.0136964029149

In the course of experiments with stopwords:

the error in the reuter set remained comparable regardless of the deletion / saving of stop words.
error in weather - fell from 8% when deleting stopwords. The complication of the algorithm did not affect (because there are no combinations for which the stop word still needs to be saved).
error in car_intent - increased to about 15% when deleting stopwords (for example, the conditional “turn on” was reduced to “turn”). When adding the processing of the "white list" - returned to the previous level.

An example of running a pre-trained classifier

Actually, the TextClassifier.config property is a dictionary that can be rendered, for example, in json and after restoring from json, pass its elements to the TextClassifier constructor. For instance:

import json
from gensim.models import Word2Vec
from pynlc.test_data import word2vec
from pynlc import TextProcessor, TextClassifier
if __name__ == '__main__':
    text_processor = TextProcessor("english", [["turn", "on"], ["turn", "off"]],
                                   Word2Vec.load_word2vec_format(word2vec))
    with open("weather_trained.json", "r", encoding="utf-8") as classifier_data_source:
        classifier_data = json.load(classifier_data_source)
    classifier = TextClassifier(text_processor, **classifier_data)
    texts = [
        "Will it be windy or rainy at evening?",
        "How cold it'll be today?"
    ]
    predictions = classifier.predict(texts)
    for i in range(0, len(texts)):
        print(texts[i])
        print(predictions[i])

And his exhaust:

C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
 warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Using Theano backend.
Will it be windy or rainy at evening?
{'temperature': 0.039208538830280304, 'conditions': 0.9617446660995483}
How cold it'll be today?
{'temperature': 0.9986168146133423, 'conditions': 0.0016815820708870888}

And yes, the network config trained on the dataset from reuters is here . Gigabyte of mesh for 19Mb dataset, yes :-)

Tags: