
Application of automatic machine learning to neural networks with transformer architecture
- Transfer
From the Google AI blog
Since the publication of information about them in 2017, transformer architecture neural networks have been applied to tasks of various kinds, from generating fantasy-style texts to writing musical harmonies . What is important, the high quality of the work of “transformers” has shown that when applied to sequential tasks, such as language modeling and translation, direct distribution neural networks can be as effective as recurrent ones. Although the popularity of transformer and other direct distribution models used in sequential tasks is growing, their architectures are almost always created manually, in contrast to the field of computer vision, where approachesautomatic machine learning ( AOM ) has already discovered advanced models that are ahead of those that have been manually tuned. Naturally, we were interested in whether the application of AOM to sequential tasks can achieve the same success.
By conducting an evolutionary search for neural architecture search (NAS), and using translation as an example of sequential tasks, we discovered an evolving transformer (ET) - a new transformer architecture that demonstrates improvements in various natural language processing tasks(OEYA). ET not only achieves cutting-edge results in translation, but also demonstrates improved efficiency in modeling the language compared to the original transformer. We publish a new model in the Tensor2Tensor library , where it can be used for any sequential task.
To begin the evolutionary search for neuroarchitecture, we needed to develop new techniques, since the task used to evaluate the “fitness” of each architecture, translation from English into German WMT'14 , was demanding on computing resources. As a result, these searches turn out to be more demanding than similar searches in the field of computer vision, which can operate with smaller databases, for example, CIFAR-10 . The first of these techniques is a warm start, sowing the original evolutionary population with transformer-type architectures instead of random models. This helps to concentrate searches in the obviously strong area of the search space, which allows us to quickly find the best models.
The second technique is a new method developed by us called Progressive Dynamic Hurdles (PDH). This algorithm complements the evolutionary search, allowing you to allocate more resources to the strongest candidates, unlike previous works, where each candidate model in the NAS was allocated the same amount of resources. PDH allows us to finish evaluating a model earlier if it is terribly bad, while rewarding promising architectures with plenty of resources.
Using these methods, we conducted a large-scale NAS search on our translation task and discovered ETs. Like most neural network architectures of the type “sequence to sequence” (sequence to sequence, seq2seq), it has an encoder that encodes the input sequence into the inserts, and a decoder that uses these inserts to create the output sequence. In the case of a translation, the input sequence is a translation offer, and the output sequence is a translation.
The most interesting feature of ETs is the convolutional layers at the bottom of the modules of both the encoder and the decoder, added in a similar branching manner to both of these places (that is, the inputs go through two different convolutional layers before folding).

Comparison of the architecture of conventional encoder and ET encoders. Pay attention to the branching convolutional structure at the bottom of the module, independently formed both in the encoder and in the decoder. The decoder is described in detail in our work .
This is especially interesting since the encoder and decoder during NAS do not share architectures with each other, and the usefulness of this architecture was discovered independently in the encoder and decoder, which speaks in favor of such a scheme. If the original transformer relied entirely on applying attention to the same data that he himself generated [self-attention], ET is a hybrid that takes advantage of both self-attention and wide convolution.
To test the effectiveness of this new architecture, we first compared it with the original transformer, which worked with the task of translating from English into German, which we used during the search. We found that ET has the best BLEU indicators and connectivity on all parameter sizes, and the largest gain in size is comparable to mobile devices (~ 7 million parameters), which indicates the efficient use of parameters. On larger sizes, ET achieves cutting-edge results on WMT '14 En-De with a BLEU of 29.8 and a SacreBLEU of 29.2.

Comparison of ET and the original transformer on WMT'14 En-De with different volumes. The greatest advantage is achieved with small sizes, while ET shows good performance on larger sizes, ahead of the largest transformer with 37.6% less parameters (comparable models are in circles).
To check generalizability, we compared ET with a transformer on additional problems of natural language processing. First, we checked the translations for different pairs of languages, and found that the effectiveness of ET is higher, and its separation is approximately the same as that demonstrated in the English-German translation; and again, thanks to the efficient use of parameters, the largest gap is observed on medium-sized models. We also compared the decoders of both models on language modeling in LM1B, and saw a significant improvement in connectivity.

These results are the first step in exploring the architecture search application for sequential direct distribution models. ET is distributed as open source in the framework of the Tensor2Tensor project , where it can be used on any consecutive problems. To improve reproducibility, we also open the search space code that we used in our search, and Colab with the PDH implementation. We look forward to the results from the research community, armed with new models, and we hope that others will be able to take these new search techniques as a basis!
Since the publication of information about them in 2017, transformer architecture neural networks have been applied to tasks of various kinds, from generating fantasy-style texts to writing musical harmonies . What is important, the high quality of the work of “transformers” has shown that when applied to sequential tasks, such as language modeling and translation, direct distribution neural networks can be as effective as recurrent ones. Although the popularity of transformer and other direct distribution models used in sequential tasks is growing, their architectures are almost always created manually, in contrast to the field of computer vision, where approachesautomatic machine learning ( AOM ) has already discovered advanced models that are ahead of those that have been manually tuned. Naturally, we were interested in whether the application of AOM to sequential tasks can achieve the same success.
By conducting an evolutionary search for neural architecture search (NAS), and using translation as an example of sequential tasks, we discovered an evolving transformer (ET) - a new transformer architecture that demonstrates improvements in various natural language processing tasks(OEYA). ET not only achieves cutting-edge results in translation, but also demonstrates improved efficiency in modeling the language compared to the original transformer. We publish a new model in the Tensor2Tensor library , where it can be used for any sequential task.
Technician Development
To begin the evolutionary search for neuroarchitecture, we needed to develop new techniques, since the task used to evaluate the “fitness” of each architecture, translation from English into German WMT'14 , was demanding on computing resources. As a result, these searches turn out to be more demanding than similar searches in the field of computer vision, which can operate with smaller databases, for example, CIFAR-10 . The first of these techniques is a warm start, sowing the original evolutionary population with transformer-type architectures instead of random models. This helps to concentrate searches in the obviously strong area of the search space, which allows us to quickly find the best models.
The second technique is a new method developed by us called Progressive Dynamic Hurdles (PDH). This algorithm complements the evolutionary search, allowing you to allocate more resources to the strongest candidates, unlike previous works, where each candidate model in the NAS was allocated the same amount of resources. PDH allows us to finish evaluating a model earlier if it is terribly bad, while rewarding promising architectures with plenty of resources.
Evolved Transformer
Using these methods, we conducted a large-scale NAS search on our translation task and discovered ETs. Like most neural network architectures of the type “sequence to sequence” (sequence to sequence, seq2seq), it has an encoder that encodes the input sequence into the inserts, and a decoder that uses these inserts to create the output sequence. In the case of a translation, the input sequence is a translation offer, and the output sequence is a translation.
The most interesting feature of ETs is the convolutional layers at the bottom of the modules of both the encoder and the decoder, added in a similar branching manner to both of these places (that is, the inputs go through two different convolutional layers before folding).

Comparison of the architecture of conventional encoder and ET encoders. Pay attention to the branching convolutional structure at the bottom of the module, independently formed both in the encoder and in the decoder. The decoder is described in detail in our work .
This is especially interesting since the encoder and decoder during NAS do not share architectures with each other, and the usefulness of this architecture was discovered independently in the encoder and decoder, which speaks in favor of such a scheme. If the original transformer relied entirely on applying attention to the same data that he himself generated [self-attention], ET is a hybrid that takes advantage of both self-attention and wide convolution.
ET score
To test the effectiveness of this new architecture, we first compared it with the original transformer, which worked with the task of translating from English into German, which we used during the search. We found that ET has the best BLEU indicators and connectivity on all parameter sizes, and the largest gain in size is comparable to mobile devices (~ 7 million parameters), which indicates the efficient use of parameters. On larger sizes, ET achieves cutting-edge results on WMT '14 En-De with a BLEU of 29.8 and a SacreBLEU of 29.2.

Comparison of ET and the original transformer on WMT'14 En-De with different volumes. The greatest advantage is achieved with small sizes, while ET shows good performance on larger sizes, ahead of the largest transformer with 37.6% less parameters (comparable models are in circles).
To check generalizability, we compared ET with a transformer on additional problems of natural language processing. First, we checked the translations for different pairs of languages, and found that the effectiveness of ET is higher, and its separation is approximately the same as that demonstrated in the English-German translation; and again, thanks to the efficient use of parameters, the largest gap is observed on medium-sized models. We also compared the decoders of both models on language modeling in LM1B, and saw a significant improvement in connectivity.

Future plans
These results are the first step in exploring the architecture search application for sequential direct distribution models. ET is distributed as open source in the framework of the Tensor2Tensor project , where it can be used on any consecutive problems. To improve reproducibility, we also open the search space code that we used in our search, and Colab with the PDH implementation. We look forward to the results from the research community, armed with new models, and we hope that others will be able to take these new search techniques as a basis!