Implementation of text classification by convolutional network on keras

    Strange as it may seem, it will be a text classifier using a convolutional network (vectorization of individual words is another question). The code, test data and examples of their application are on bitbucket (I ran into github size limits and the proposal to use Git Large File Storage (LFS) until I mastered the proposed solution).

    Datasets


    Converted sets were used: reuters - 22000 records , watson-th - 530 records , and 1 more watson-th - 50 records . By the way, I wouldn’t refuse a set of texts in Russian thrown into comments / drugs (but still better into comments).

    Network device


    Based on one implementation of the network described here . The code of the used implementation on github .

    In my case, there are word vectors at the input of the network (the gensim implementation of word2vec is used). The network structure is shown below:


    In short:

    • The text is represented as a matrix of the form word_count x word_vector_size. The vectors of individual words are from word2vec, which can be read about, for example, in this post . Since I don’t know in advance which text the user will palm off, I take a length of 2 * N, where N is the number of vectors in the longest text of the training set. Yes, poked my fingers into the sky.
    • The matrix is ​​processed by convolutional sections of the network (at the output we get the converted features of the word)
    • Selected features are processed by a fully connected network section.

    I filter out the word stop beforehand (it didn’t affect the reuter-dataset, but in smaller sets - it affected). About it below.

    Installing the necessary software (keras / theano, cuda) on Windows


    Installing for linux was significantly easier. Required:

    • python3.5
    • python header files (python-dev in debian)
    • gcc
    • cuda
    • python libraries are the same as in the list below

    In my case with win10 x64, the approximate sequence was as follows:

    • Anaconda with python3.5 .
    • Cuda 8.0 . You can run it on the CPU (gcc is enough then and the next 4 steps are not needed), but on relatively large datasets the drop in speed should be significant (did not check).
    • The path to nvcc has been added to PATH (otherwise, theano will not detect it).
    • Visual Studio 2015 with C ++, including the windows 10 kit (corecrt.h required).
    • The path to cl.exe has been added to PATH.
    • The path to corecrt.h in INCLUDE (in my case, C: \ Program Files (x86) \ Windows Kits \ 10 \ Include \ 10.0.10240.0 \ ucrt).
    • conda install mingw libpython - gcc and libpython will be required when compiling the grid.
    • Well, and pip install keras theano python-levenshtein gensim nltk(maybe it will start with the replacement of the keras-th backend from theano to tensorflow, but I have not tested it).
    • in .theanorc the following flag is specified for gcc:

      		[gcc]
      		cxxflags = -D_hypot=hypot
        

    • Run python and execute

      		import nltk
      		nltk.download()
        


    Word processing


    At this stage, the removal of stopwords that are not included in the combination of the “white list” (about it below) and the remaining ones are vectorized. Input data for the applied algorithm:

    • language - nltk is required to tokenize and return a list of stopwords
    • A “white list” of word combinations that use stopwords. For example - “on” is attributed to stopwords, but [“turn”, “on”] is another matter
    • vectors word2vec

    Well, the algorithm (I see at least 2 possible improvements, but did not master):

    • I break the input text into ntlk.tokenize tokens (conditionally - “Hello, world!” Is converted to [“hello”, ",", "world", "!"]).
    • I drop tokens that are not in the word2vec dictionary.

      In fact - which are not there and it was not possible to highlight a similar one in distance. So far, only Levenshtein distance, there is an idea to filter tokens with the smallest Levenshtein distance according to the distance from their vectors to the vectors included in the training set.

    • Select tokens:

      • which are not in the list of stopwords (reduced the error on the weather dataset, but without the next step - it really ruined the result on “car_intents”).

      • if the token is in the stop-word list, check for the occurrence of white-list sequences in the text in which it exists (conditionally - by finding “on”, check for the presence of sequences from the list [[“turn”, “on”]]. If there is one - still add it. There is something to improve - now I am checking (in our example) the presence of "turn", but it may not apply to this "on" either.

    • Replace selected tokens with their vectors.

    Code to us, code


    Actually, the code by which I evaluated the impact of changes
    import itertools
    import json
    import numpy
    from gensim.models import Word2Vec
    from pynlc.test_data import reuters_classes, word2vec, car_classes, weather_classes
    from pynlc.text_classifier import TextClassifier
    from pynlc.text_processor import TextProcessor
    from sklearn.metrics import mean_squared_error
    def classification_demo(data_path, train_before, test_before, train_epochs, test_labels_path, instantiated_test_labels_path, trained_path):
        with open(data_path, 'r', encoding='utf-8') as data_source:
            data = json.load(data_source)
        texts = [item["text"] for item in data]
        class_names = [item["classes"] for item in data]
        train_texts = texts[:train_before]
        train_classes = class_names[:train_before]
        test_texts = texts[train_before:test_before]
        test_classes = class_names[train_before:test_before]
        text_processor = TextProcessor("english", [["turn", "on"], ["turn", "off"]], Word2Vec.load_word2vec_format(word2vec))
        classifier = TextClassifier(text_processor)
        classifier.train(train_texts, train_classes, train_epochs, True)
        prediction = classifier.predict(test_texts)
        with open(test_labels_path, "w", encoding="utf-8") as test_labels_output:
            test_labels_output_lst = []
            for i in range(0, len(prediction)):
                test_labels_output_lst.append({
                    "real": test_classes[i],
                    "classified": prediction[i]
                })
            json.dump(test_labels_output_lst, test_labels_output)
        instantiated_classifier = TextClassifier(text_processor, **classifier.config)
        instantiated_prediction = instantiated_classifier.predict(test_texts)
        with open(instantiated_test_labels_path, "w", encoding="utf-8") as instantiated_test_labels_output:
            instantiated_test_labels_output_lst = []
            for i in range(0, len(instantiated_prediction)):
                instantiated_test_labels_output_lst.append({
                    "real": test_classes[i],
                    "classified": instantiated_prediction[i]
                })
            json.dump(instantiated_test_labels_output_lst, instantiated_test_labels_output)
        with open(trained_path, "w", encoding="utf-8") as trained_output:
            json.dump(classifier.config, trained_output, ensure_ascii=True)
    def classification_error(files):
        for name in files:
            with open(name, "r", encoding="utf-8") as src:
                data = json.load(src)
            classes = []
            real = []
            for row in data:
                classes.append(row["real"])
                classified = row["classified"]
                row_classes = list(classified.keys())
                row_classes.sort()
                real.append([classified[class_name] for class_name in row_classes])
            labels = []
            class_names = list(set(itertools.chain(*classes)))
            class_names.sort()
            for item_classes in classes:
                labels.append([int(class_name in item_classes) for class_name in class_names])
            real_np = numpy.array(real)
            mse = mean_squared_error(numpy.array(labels), real_np)
            print(name, mse)
    if __name__ == '__main__':
        print("Reuters:\n")
        classification_demo(reuters_classes, 10000, 15000, 10,
                            "reuters_test_labels.json", "reuters_car_test_labels.json",
                            "reuters_trained.json")
        classification_error(["reuters_test_labels.json", "reuters_car_test_labels.json"])
        print("Car intents:\n")
        classification_demo(car_classes, 400, 500, 20,
                            "car_test_labels.json", "instantiated_car_test_labels.json",
                            "car_trained.json")
        classification_error(["cars_test_labels.json", "instantiated_cars_test_labels.json"])
        print("Weather:\n")
        classification_demo(weather_classes, 40, 50, 30,
                            "weather_test_labels.json", "instantiated_weather_test_labels.json",
                            "weather_trained.json")
        classification_error(["weather_test_labels.json", "instantiated_weather_test_labels.json"])
    


    Here you see:

    • Data preparation:

      with open(data_path, 'r', encoding='utf-8') as data_source:
         data = json.load(data_source)
      texts = [item["text"] for item in data]
      class_names = [item["classes"] for item in data]
      train_texts = texts[:train_before]
      train_classes = class_names[:train_before]
      test_texts = texts[train_before:test_before]
      test_classes = class_names[train_before:test_before]
      

    • Creating a new classifier:

      text_processor = TextProcessor("english", [["turn", "on"], ["turn", "off"]], Word2Vec.load_word2vec_format(word2vec))
      classifier = TextClassifier(text_processor)
      

    • His training:

      classifier.train(train_texts, train_classes, train_epochs, True)
      

    • Predicting classes for a test sample and saving pairs of “real classes” - “predicted class probabilities”:

      prediction = classifier.predict(test_texts)
      with open(test_labels_path, "w", encoding="utf-8") as test_labels_output:
              test_labels_output_lst = []
              for i in range(0, len(prediction)):
                  test_labels_output_lst.append({
                      "real": test_classes[i],
                      "classified": prediction[i]
                  })
              json.dump(test_labels_output_lst, test_labels_output)
      

    • Creating a new instance of the classifier by configuration (dict, can be serialized to / deserialized from, for example json):

      instantiated_classifier = TextClassifier(text_processor, **classifier.config)
      			

    The exhaust is approximately as follows:

    C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
      warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
    C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
      warnings.warn("Pattern library is not installed, lemmatization won't be available.")
    Using Theano backend.
    Using gpu device 0: GeForce GT 730 (CNMeM is disabled, cuDNN not available)
    Reuters:
    Train on 3000 samples, validate on 7000 samples
    Epoch 1/10
    20/3000 [..............................] - ETA: 307s - loss: 0.6968 - acc: 0.5376
    ....
    3000/3000 [==============================] - 640s - loss: 0.0018 - acc: 0.9996 - val_loss: 0.0019 - val_acc: 0.9996
    Epoch 8/10
    20/3000 [..............................] - ETA: 323s - loss: 0.0012 - acc: 0.9994
    ...
    3000/3000 [==============================] - 635s - loss: 0.0012 - acc: 0.9997 - val_loss: 9.2200e-04 - val_acc: 0.9998
    Epoch 9/10
    20/3000 [..............................] - ETA: 315s - loss: 3.4387e-05 - acc: 1.0000
    ...
    3000/3000 [==============================] - 879s - loss: 0.0012 - acc: 0.9997 - val_loss: 0.0016 - val_acc: 0.9995
    Epoch 10/10
    20/3000 [..............................] - ETA: 327s - loss: 8.0144e-04 - acc: 0.9997
    ...
    3000/3000 [==============================] - 655s - loss: 0.0012 - acc: 0.9997 - val_loss: 7.4761e-04 - val_acc: 0.9998
    reuters_test_labels.json 0.000151774189194
    reuters_car_test_labels.json 0.000151774189194
    Car intents:
    Train on 280 samples, validate on 120 samples
    Epoch 1/20
    20/280 [=>............................] - ETA: 0s - loss: 0.6729 - acc: 0.5250
    ...
    280/280 [==============================] - 0s - loss: 0.2914 - acc: 0.8980 - val_loss: 0.2282 - val_acc: 0.9375
    ...
    Epoch 19/20
    20/280 [=>............................] - ETA: 0s - loss: 0.0552 - acc: 0.9857
    ...
    280/280 [==============================] - 0s - loss: 0.0464 - acc: 0.9842 - val_loss: 0.1647 - val_acc: 0.9494
    Epoch 20/20
    20/280 [=>............................] - ETA: 0s - loss: 0.0636 - acc: 0.9714
    ...
    280/280 [==============================] - 0s - loss: 0.0447 - acc: 0.9849 - val_loss: 0.1583 - val_acc: 0.9530
    cars_test_labels.json 0.0520754688092
    instantiated_cars_test_labels.json 0.0520754688092
    Weather:
    Train on 28 samples, validate on 12 samples
    Epoch 1/30
    20/28 [====================>.........] - ETA: 0s - loss: 0.6457 - acc: 0.6000
    ...
    Epoch 29/30
    20/28 [====================>.........] - ETA: 0s - loss: 0.0021 - acc: 1.0000
    ...
    28/28 [==============================] - 0s - loss: 0.0019 - acc: 1.0000 - val_loss: 0.1487 - val_acc: 0.9167
    Epoch 30/30
    ...
    28/28 [==============================] - 0s - loss: 0.0018 - acc: 1.0000 - val_loss: 0.1517 - val_acc: 0.9167
    weather_test_labels.json 0.0136964029149
    instantiated_weather_test_labels.json 0.0136964029149
    

    In the course of experiments with stopwords:

    • the error in the reuter set remained comparable regardless of the deletion / saving of stop words.

    • error in weather - fell from 8% when deleting stopwords. The complication of the algorithm did not affect (because there are no combinations for which the stop word still needs to be saved).

    • error in car_intent - increased to about 15% when deleting stopwords (for example, the conditional “turn on” was reduced to “turn”). When adding the processing of the "white list" - returned to the previous level.

    An example of running a pre-trained classifier


    Actually, the TextClassifier.config property is a dictionary that can be rendered, for example, in json and after restoring from json, pass its elements to the TextClassifier constructor. For instance:

    import json
    from gensim.models import Word2Vec
    from pynlc.test_data import word2vec
    from pynlc import TextProcessor, TextClassifier
    if __name__ == '__main__':
        text_processor = TextProcessor("english", [["turn", "on"], ["turn", "off"]],
                                       Word2Vec.load_word2vec_format(word2vec))
        with open("weather_trained.json", "r", encoding="utf-8") as classifier_data_source:
            classifier_data = json.load(classifier_data_source)
        classifier = TextClassifier(text_processor, **classifier_data)
        texts = [
            "Will it be windy or rainy at evening?",
            "How cold it'll be today?"
        ]
        predictions = classifier.predict(texts)
        for i in range(0, len(texts)):
            print(texts[i])
            print(predictions[i])
    

    And his exhaust:

    C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
     warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
    C:\Users\user\pynlc-env\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
      warnings.warn("Pattern library is not installed, lemmatization won't be available.")
    Using Theano backend.
    Will it be windy or rainy at evening?
    {'temperature': 0.039208538830280304, 'conditions': 0.9617446660995483}
    How cold it'll be today?
    {'temperature': 0.9986168146133423, 'conditions': 0.0016815820708870888}
      

    And yes, the network config trained on the dataset from reuters is here . Gigabyte of mesh for 19Mb dataset, yes :-)

    Also popular now: