Google TensorFlow Machine Learning Library - First Impressions and Comparison with Your Own Implementation

    More recently, Google made its machine learning library, TensorFlow, available to all. For us, this turned out to be interesting also because the structure includes the most advanced neural network models for text processing, in particular, training of the type “sequence-in-sequence” (sequence-to-sequence learning). Since we have several projects related to this technology, we decided that this is a great opportunity to stop reinventing the wheel (probably already) and quickly improve the results. Having imagined the satisfied faces of our customers, we set to work. And so what came of it ...

    First you need to clarify that we have our own neural network library. Of course, it is far from being so vast and ambitious, but intended for solving certain tasks of word processing. At one time, I decided to go by writing my own solution, instead of using the ready-made one. For various reasons, including cross-platform, compactness, ease of integration into the rest of the code, aesthetic reluctance to have dozens of dependencies on generally unnecessary tools from different third-party authors. The result was a tool written in F #, convenient, taking about 2mb of space, and doing what was required of it. On the other hand, it’s rather slow in terms of speed, it doesn’t support GPU computing, is trimmed in functionality, and leaves doubt in its compliance with modern realities.

    The question is to reinvent the wheel or use it ready, in general, is eternal. And the worm of doubt gnawed all the time that maintaining one's own means is an unjustified thing, costly in resources and limiting in possibilities. With periodic exacerbations, ideas came to switch to using, for example, Theano or Torch, as all normal people do. But the hands did not reach. And with the release of TensorFlow, another additional motivation arose.

    In this context, I began to deal with this system.

    Short notes about the installation process
    TensorFlow is perhaps easier to install than a number of other modern neural network libraries. If you are lucky, then the matter may be limited to one line entered in the Linux terminal line. Yes, Windows is traditionally not supported, but such a trifle certainly will not stop the real developer.

    We were not lucky, on Ubuntu 11 TensorFlow refused to install. But after the upgrade to 04/14 and some dances with a tambourine, something still worked. At least, it was possible to execute a code fragment from the getting started section. Therefore, you can safely write that installing TensorFlow is simple and not difficult, especially if you have a fresh distribution, and python 2 is at least 2.7.9. For Linux, this is normal if a complex software package is not installed immediately (well, maybe not normal, but this is a common situation, it often happens).

    Verification of work.
    Here I must say the following. All of the following should be considered as a private example from personal experience, made in order to satisfy personal curiosity in a fairly narrow area. The author does not pretend that the findings are of global importance, or in general of any value. Discussions about the flaws in the results should not be attributed to the TensorFlow system itself (which basically works perfectly and most importantly quickly), but to specific models and training examples.

    Chat bot
    Introductory lessons about the recognition of a set of numbers MNIST I left for the future, and immediately opened section on sequence-to-sequence learning. To make it clear what is at stake, I will describe the essence of the matter in a little more detail.

    The task of sequence-to-sequence learning in general is to generate a new sequence of characters based on the input sequence of characters. In our particular case, the symbols are words. The most famous application of this task is probably machine translation - when a sentence in one language is submitted to the input of the model, and at the output we get a translation into another language. As shown in the figures from the introductory lessons, there are two main classes of such models. The first uses an encoder-decoder option. The original sequence is encoded using a neural network into a fixed-length representation, and then the second neural network decodes this representation into a new sequence (Fig. 1) [1]. It is as if a person was given a text, asked to remember, and then they were selected and asked to write a translation into another language. The second class uses the selective attention model, when the decoder can “spy” into the original sequence during operation (Fig. 2) [2].

    image
    Figure 1. The basic model for training a neural sequence-in-sequence model. On the left (before the character) the initial sequence, on the right - the generated sequence. Image from tensorflow.org

    image
    Figure 2. The basic model for training a neural sequence-in-sequence model. Image from tensorflow.org Chatbot

    is a special case of the sequence-in-sequence learning task (question at the input, answer at the output). Earlier we wrote about our chatbot implementation using this method . A little later, an article by Google employees [3] appeared on the same topic, but using a more advanced (as we thought) model similar to that shown in Fig. 1.

    The model from Google uses LSTM cells, while I, when designing a chatbot, applied a regular recurrent network, screwing only one modification, for better work. The dialogs shown in the article [3] look impressive and more interesting than I managed to get (In addition, we are talking about the fact that the chat bot trained on the dialogs of the user support service could provide meaningful help). But the chatbot from Google is trained on a much larger collection of data than my old sample.

    Modifying the standard example from the TensorFlow suite intended for machine translation, I downloaded data from the collection of dialogs used to train my chat bot (there are only 3000 examples, in contrast to hundreds of thousands on which Google chat bot was trained). In the lesson from Google, it is stated that the model from [2], plus selective softmax [4], is implemented in the example, that is, all the most modern results from this field are practically applied.

    My chatbot, as described earlier, uses a convolutional network as an encoder (one layer with 16 filters of 2 and 3 words in size + a maximum union layer), and the decoder uses a simple Elman recurrent network. Neither one nor the other, according to other researchers, has shown itself well in sequence-to-sequence learning tasks. Therefore, one modification was applied to the system, which I came up with more than a year ago for another task (generating reviews). Instead of one convolutional network - an encoder, two convolutional networks are used - one for encoding the source text, and the other for encoding a new one, just generated. The outputs of the last layers of the union are connected in pairs in the next layer (i.e., one neuron receives one input from the neuron of the first network and one input from the neuron of the second). The idea was that when something was generated, corresponding to the input signal, this signal is suppressed due to the second network, and the system will continue to generate the remainder of the text. As far as I know, such a solution has not been described anywhere, and it was definitely not described at the time of implementation. It worked as it seemed to me bad then (although the chat bot turned out better than with one encoder), and I abandoned it when I saw an article about the selective attention mechanism, so I decided that the solution used there was more “cool” and there was nothing to invent any nonsense.

    Here are the results now:

    Figure 3. Dialog with a chat bot using the seq2seq.embedding_attention_seq2seq model from the TensorFlow kit

    Dialogs with chat bots, with an approximate translation:
    Model from TensorFlow KitMy model
    H: Hello (hi)
    K: hey there (hi)
    H: how are you? (how are you
    )
    Q: i'm fine (excellent) W: where are you? (where you are)
    K: i'm at fine (I am fine)
    H: Tell me something (say something)
    K: yes (yes)
    H: ok, see you soon (see you soon)
    K: see you nice clean are nice (see you well clean it's good)
    H: bye (bye)
    K: bye (bye)

    R : Hello (hi) R: hi (hi) R
    : how are you? (how are you)
    K: i feel strange (I feel strange)
    W: where are you? (where you are)
    K: I'm at home now (I’m at home)
    H: Tell me something (say something)
    K: world (world)
    H: ok, see you soon (see you soon)
    K: I 'm not sure ... (not sure)
    H: bye (bye)
    K: ok (ok)


    Both models work, but, alas, failed to get a great improvement. Moreover, the dialogue on the right looks, in my subjective opinion, the dialogue on the right looks even more interesting and more correct. Most likely, the matter is in the amount of data. The model from Fig. 1. has significant representative power, but it needs a lot of data to start producing meaningful results (the situation we discussed in the previous article). My chatbot model may not be so good, but it can produce meaningful results in the face of a data shortage. That allows, for example, to create models of communication between different people on a limited selection of dialogs. For example, if I take the phrases answers of only one of the interlocutors from my set of 3000 pairs of phrases, the following is obtained:


    Figure 4.

    I (again subjectively) have a feeling of a more positive and friendly communication. And you? From the model from the TensorFlow kit, I could not achieve the best dialogue shown in the table, although I checked only five different configurations, maybe a person with more experience with it can do better.

    Reconstructing phrases from a set of words
    Reconstructing phrases from a set of words is a synthetic task that I use here instead of the practical task that one of our customers set for us. The customer was against the publication of both the task itself and the examples with its data, so for this article I came up with another task, similar in form.

    As phrases, I used user queries to search engines, because they are a good source of short meaningful phrases, and besides, they were at hand in sufficient quantities. The essence of the task is as follows. Suppose we have a request “Make a diploma to order” and its spoiled version “a diploma to order”. It is necessary to make a meaningful request again from the corrupted version, while preserving the meaning of the original. That is, "make windows to order" will be considered an incorrect result. Spoiled versions were automatically generated by rearranging words, changing gender, number, case, and deleting all words shorter than four letters. In total, 120,000 training examples were produced in this way, of which 1,000 were set aside for testing. The task seems simpler than the problem of machine translation, but at the same time has something in common with it.

    To solve the original problem, we had to develop a special model based on the idea described above in the chatbot section. Since the quality for the needs of the customer was insufficient, I also added a tool for working with very large dictionaries, something reminiscent of the idea of ​​selective softmax. By the way, I learned about selective softmax for the first time four days ago, from a TensorFlow tutorial. Before that, I looked at an article about him, as it turned out, fortunately, because the results ... but about the results a little later. The model is also equipped with a means to control the degree of “fantasy” of the neural network and a means, not only taking into account morphology, but rather a means of circumventing the problem of taking into account morphology, in which each word form is also represented by a separate vector, as in systems without taking into account morphology, but not There are problems associated with this. However, this solution lacks the normal mechanism of selective attention, has a primitive mechanism for representing the original sequence, and does not use LSTM or GRU modules. Due to which the speed of his work in my implementation is quite adequate.

    So the results:

    Figure 5.

    Here's what the model did from the TensorFlow kit. The generated search queries have something to do with the original keywords, and, in general, are made according to the rules of the Russian language. But with the meaning of trouble. One “nanny for cutting the roof” is worth it. On the other hand, what delights me in these constructions is a model understanding of the principles of language, and a creative approach. For example, the “apparatus for hematite monitor” sounds plausible and somehow even medically, if you do not know that hematite is “iron ore mineral Fe2O3” and there is no apparatus for hematite monitor. But for practical purposes this is not suitable. Of the 100 tested test cases, not one was correct.

    My model created the following options:

    Figure 6.

    Here, original is a corrupted version, generated is a generated one, human is an initial search query. Out of 100 verified examples, the correct 72%. Improvements from the model from the TensorFlow kit did not work out.

    Speed ​​and breadth of functionality

    In these parameters, the package from Google certainly surpasses my library by an order of magnitude. It also has convolutional networks for image analysis with all modern methods, LSTM and GRU, automatic differentiation, and in general the ability to easily create all kinds of models, not just neural networks. Everything is done quite intuitively, and well documented. You can safely recommend it, especially for beginners to do machine learning, of course, if you have Linux or MacOs (perhaps the source code can be compiled on Windows using cygwin or mingw or some other way, but this is not officially supported).

    In terms of speed, I have not yet made accurate measurements, but the feeling is that TensorFlow models on the CPU run two to three times faster than my implementations, with approximately the same number of parameters (one could expect a greater difference in performance). And consume significantly less memory. And the GPU version is faster than the CPU implementation ten times (again, this is a general impression, so far without accurate measurements). All this is natural, because Google has a lot of resources and programmers engaged in code optimization (on the page about TensorFlow in the list of developers of 40 names), but I don’t have such an opportunity - it works fine.

    On the other hand, my library takes up little space, runs on Windows and under Linux when using mono. In certain situations, this can be a plus.

    As for the results, of course, they must be interpreted with great care. They relate to certain special cases, and in addition, these are the results of specific models within an entire library, the functionality of which is much wider. Therefore, if you transfer my models to TensorFlow, then the results should be the same, and everything will be done much faster. In this sense, knowledge of the correct architecture of a neural network is more important than knowledge of a specific technology stack.

    The truth is one philosophical question. If I had initial access to TensorFlow or worked with a similar ready-made tool, would I be able to make the same models and get the same results? Does system programming from scratch help to understand deeper the basics of neural networks or is it a waste of time? Are performance limitations an incentive to develop new models or an annoying hindrance?

    Conclusion
    The findings suggest everyone to do their own. I have not yet received obvious answers to my questions from these experiments, the problem of the invention / not invention of bicycles remains, but the information seems quite instructive.

    References

    1. Sutskever, Ilya, Oriol Vinyals, and Quoc VV Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems . 2014.
    2. Bahdanau, by Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv: 1409.0473 (2014).
    3. Vinyals, Oriol, and Quoc Le. “A neural conversational model.” arXiv preprint arXiv: 1506.05869 (2015).
    4. Jean, Sébastien, et al. "On using very large target vocabulary for neural machine translation." arXiv preprint arXiv: 1412.2007 (2014).

    PS
    All trademarks mentioned in the article are the property of their respective owners.

    Also popular now: