AI teaches language: why do I need a machine translation hackathon

    image

    On December 18, the selection tour for participation in the DeepHack.Babel hackathon from the Laboratory of Neural Systems and Deep Training at the Moscow Institute of Physics and Technology began. Emphasis will be placed on neural network machine translation, which is gaining popularity in the research community and is already used in commercial products. Moreover, it will be necessary to train the machine translation system, contrary to generally accepted practice, on non-parallel data - that is, in terms of machine learning, without involving a teacher. If you are still thinking about registering, we explain why this is necessary.

    What was before


    Until recently (before neural networks became popular) machine translation systems were essentially tables of translation options: for each word or phrase in the source language, a number of possible translations into the target language were indicated. These translations were distinguished from a large number of parallel texts (texts in two languages, which are exact translations of each other) by analyzing the frequency of joint occurrence of words and expressions. To translate the line, it was necessary to combine the translations for individual words and phrases into sentences and choose the most plausible option:

    image

    Options for translating individual words and phrases of the sentence “er geht ja nicht nach hause” (“he does not go home”) into English. The quality of the options is determined by the weighted sum of the values ​​of the attributes, such as, for example, probabilities p (e | f ) and p ( f | e ), where e and f are the source and target phrases. In addition to suitable translations, you still need to choose the order of phrases. Illustration taken from a presentation by Philipp Koehn.

    Here the second component of the machine translation system came into play - the probabilistic model of the language. Its classic version - the language model on n- grams - like the translation table, is based on the joint occurrence of words, but this time we are talking about the probability of meeting a word after a certain prefix ( nprevious words). The greater the likelihood for each of the words of the generated sentence (that is, the less we “surprise” the language model with the choice of words), the more natural it sounds and the greater the likelihood that this is the correct translation. Such a technique, despite its seeming limitedness, made it possible to achieve a very high quality of translation - not least because the probabilistic model of the language is trained only in a monolingual (and not parallel) corpus, so it can be trained on a very large amount of data, and she will be well informed about how to speak, and how not to.

    What has changed with the advent of neural networks


    Neural networks have changed the approach to machine translation. Now the translation is done by “encoding” the entire sentence into a vector representation (containing the general meaning of this sentence in a language-independent form) and then “decoding” this representation into words in the target language. These transformations are often performed using recurrent neural networks, which are designed specifically for processing sequences of objects (in our case, sequences of words).

    image

    The scheme of the three-layer model encoder-decoder. The encoder (the red part) generates a sentence representation: at each step it combines a new input word with a representation for the words read earlier. The blue part is the presentation of the whole proposal. The decoder (green part) issues a word in the output language based on the representation of the original sentence and the previous generated word. The illustration was taken from the neural network machine translation tutorial at ACL-2016.

    At each step, such a neural network combines a new input word (more precisely, its vector representation) with information about previous words. The parameters of the neural network determine how much to “forget” and how much to “remember” at each step, so the presentation of the entire sentence contains the most important information from it. The encoder-decoder architecture has already become classic, you can read its description, for example, in [1].

    In fact, the standard version of this system does not work exactly as expected, so additional tricks are necessary for a good translation quality. For example, a recurrent network with ordinary cells is subject to exploding or decaying gradients (that is, the gradients converge to zero or very large values ​​and no longer change, which makes learning the network impossible) - neurons of a different structure — LSTM [2] and GRU [were proposed for this] 3], at each step deciding which information to “forget” and which to pass on. When reading long sentences, the system forgets where they started - in this case, it helps to use bidirectional networks that read the sentence from the beginning and from the end, as was done in [4]. Besides,this post ). It consists in the fact that when decoding (generating a translation) the system receives information about which word of the original sentence it should translate at this step.

    image

    Attention mechanism in encoder-decoder architecture. The current state of the decoder (B) is multiplied with each of the states of the encoder (A) - this way we determine which of the input words is most relevant for the decoder at the moment (instead of multiplication, another operation can be used to determine the similarity). Then the result of this multiplication is converted into a probability distribution by the softmax function - it returns the weight of each input word for the decoder. Combinations of weighted encoder states are supplied to the decoder. Illustration taken from a post by Chris Olah.

    Using all these additional techniques, machine translation on neural networks confidently wins statistical systems: for example, at the last competition of machine translation systemsneural network models were the first in almost all pairs of languages. However, in statistical models that have gone back in time, there is a feature that has not yet been able to be transferred to neural networks - this is the ability to use a large amount of non-parallel data (that is, those for which there is no translation into another language).

    Why use non-parallel data


    One may ask, why use non-parallel data if neural network systems are pretty good even without them? The fact is that good quality requires a very large amount of data, which is not always available. It is known that neural networks are demanding on the amount of training data. It is easy to verify that on a very small dataset, any classical method (for example, Support Vector Machines) will bypass the neural network. There are enough data in machine translation for some of the most popular language pairs (English ⇔ main European languages, English ⇔ Chinese, English) Russian), and neural network architectures for such language pairs show very good results. But where there are fewer than a few million offers of parallel data, neural networks are useless. There are very few such language-rich pairs of languages, however, monolingual texts are available in very many languages, and in large numbers: news, blogs, social networks, works of government organizations - new content is constantly being generated. All of these texts could be used to improve the quality of neural network machine translation, just as they helped improve statistical systems - but, unfortunately, such techniques have not yet been developed.

    More precisely, there are several examples of training a neural network translation system in monolingual texts: in [6] a combination of the encoder-decoder architecture with a probabilistic language model is described, in [7] the missing translation for a monolingual corpus is generated by the model itself. All these methods improve the quality of translation, however, the use of monolingual corpuses in neural network machine translation systems has not yet become a common practice: it is not yet clear how to use non-parallel texts in training, it is not clear which approach is better suited, are there any differences in its use for different pairs of languages, different architectures, etc. And these are the issues we will try to solve on our hackathon DeepHack.Babel.

    Non-Parallel Data and DeepHack.Babel


    We will try to conduct controlled experiments: the participants will be given a very small set of parallel data, it will be proposed to train a neural network machine translator on it, and then improve its quality using monolingual data. All participants will be on an equal footing: the same data, the same restrictions on the size and training time of models - so we can find out which of the methods implemented by the participants work better and come closer to understanding how to improve the quality of translation for non-common pairs of languages. We will conduct experiments in several pairs of languages ​​of varying degrees of complexity in order to test how versatile different solutions are.

    In addition, we come close to an even more ambitious task, which seemed impossible to do with statistical translation: translation without parallel data. The technology for teaching translation in parallel texts is already known and worked out, although many related issues have not yet been resolved. Comparable corps (pairs of texts with a common theme and similar content) are also actively used in machine translation [8] - this makes it possible to use resources such as Wikipedia (corresponding articles in different languages ​​do not match word for word there, but describe the same objects) . But what if there is no information at all about whether the texts correspond to each other or not? For example, when analyzing two news buildings in a given year in different languages, we can be sure that the same events were discussed - which means

    Is it possible to use such data? Although this sounds like science fiction, there are already several examples in the scientific literature that this is possible - for example, a recent publication [9] describes such a system based on denoising autoencoders. Hackathon participants will be able to reproduce these methods and try to bypass systems trained in parallel texts.

    image

    The principle of operation of a machine translation system without parallel data. Auto Encoder (left): The model is learning how to recover a sentence from a distorted version. x is a sentence, C (x) is its distorted version, x̂ is a reconstruction. Translator (right): The model learns to translate the sentence into another language. At its input, a distorted translation is generated, generated by the version of the model from the previous iteration. The first version of the model is a lexicon (dictionary), also trained without using parallel data. The combination of the two models achieves translation quality comparable to systems trained on parallel data. The illustration is taken from the article [9].

    How to take part


    Applications for the hackathon qualifying round will be accepted until January 8th. The task of the qualifying round is to teach the machine translation system from English to German. There are no restrictions on data and methods yet: participants can use any cases and pre-trained models and choose the system architecture to their taste. However, it should be remembered that the system will be tested on a set of proposals on IT topics - this involves the use of data from relevant sources. And, although there are no restrictions on architecture, it is assumed that when completing the selection task, participants will get acquainted with neural network translation models in order to better cope with the main task of the hackathon.

    50 people whose systems will show the best translation quality (which will be measured by the BLEU metric [10]) will be able to take part in the hackathon. Those who do not pass the selection should not be upset - they can attend the hackathon as listeners: every day, lectures will be held by specialists in machine translation, machine learning and word processing, open to everyone.

    Bibliography
    1. Sequence to Sequence Learning with Neural Networks . I. Sutskever, O. Vinyals, QVLe.
    2. LSTM: A Search Space Odyssey . K. Greff, RKSrivastava, J. Koutník, BR Steunebrink, J. Schmidhuber.
    3. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation . K.Cho, Bv Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio.
    4. Neural Machine Translation by Jointly Learning to Align and Translate . D.Bahdanau, K.Cho, Y.Bengio.
    5. Effective Approaches to Attention-based Neural Machine Translation . M.-T.Luong, H.Pham, CDManning.
    6. On Using Monolingual Corpora in Neural Machine Translation . C. Gulcehre, O.Firat, K.Xu, K.Cho, L. Barrault, H.-C.Lin, F. Bougares, H. Schwenk, Y.Bengio.
    7. Improving Neural Machine Translation Models with Monolingual Data . R. Sennrich, B. Haddow, A. Birch.
    8. Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover's Distance Regularization . M.Zhang, Y. Liu, H. Luan, Y. Liu, M. Sun.
    9. Unsupervised Machine Translation Using Monolingual Corpora Only . G. Lample, L. Denoyer, M. Ranzato.
    10. BLEU: a method for automatic evaluation of machine translation . K. Papineni, S. Roukos, T. Ward, W.-J.Zhu.

    Also popular now: