Sequence-to-Sequence Part 2 models
- Transfer
Hello!
The second part of the translation, which we posted a couple of weeks ago, in preparation for the start of the second stream of the Data Scientist course . There is another interesting material and an open lesson ahead.
In the meantime, we went further into the wilds of models.
Neural Translation Model
While the core of the sequence-to-sequence model is created by functions from

Sampled softmax and projection of the output
As mentioned above, we want to use the sampled softmax to work with a large output dictionary. To decode from it, you have to track the projection of the output. Both the softmax sampled loss and the projection of the output are generated by the following code in
First, note that we create a sampled softmax only if the number of samples (512 by default) is smaller than the target dictionary size. For dictionaries smaller than 512, it is better to use standard softmax loss.
Then, create a projection of the output. This is a pair consisting of a matrix of weights and a displacement vector. When used, the rnn cell returns the shape vectors of the number of training samples on
Bucketing and padding
In addition to the softmax sampled, our translation model also uses bucketing , a method that allows you to effectively manage sentences of different lengths. To begin with, we will explain the problem. When translating from English to French, we have English sentences of different lengths L1 at the entrance and French sentences of different lengths L2 at the exit. Since the English sentence is transmitted as
As a compromise between creating a graph for each pair of lengths and stuffing up to a single length, we use a certain number of groups (buckets) and stuff each sentence up to the length of the group above. In
Thus, if the input comes in an English sentence with 3 tokens, and the corresponding French sentence in the output contains 6 tokens, they will go into the first group and will be packed to length 5 at the input of the encoder and length 10 at the input of the decoder. And if in the English sentence there are 8 tokens, and in the corresponding French 18, then they will not fall into the group (10, 15) and will be transferred to the group (20, 25), that is, the English sentence will increase to 20 tokens, and the French to 25.
Remember that when creating the decoder input we add a special character
Launch it
To train the model described above, you will need a large Anglo-French corps. For training, we will use 10 ^ 9 Franco-English Corps from the WMT'15 site , and test news from the same site as a working sample. Both datasets will be loaded into
You will need 18GB of hard disk space and a few hours to prepare the training building. The body is unpacked, dictionary files are created in
Values given in
The command above will teach a model with two layers (by default there are 3), each of which has 256 units (by default - 1024), with a checkpoint for every 50 steps (by default - 200). Experiment with these options to see what size model is appropriate for your graphics processor.
During the workout, each step
Note that each step takes a little less than 1.4 seconds, perplexing the training set and perplexing the working sample in each group. After about 30 thousand steps, we see how the perplexions of short sentences (groups 0 and 1) become unambiguous. The training building contains about 22 million sentences, one iteration (one run of training data) takes about 340 thousand steps with the number of training samples in the amount of 64. At this stage, the model can be used to translate English sentences into French using the option
What's next?
The example above shows how to create your own end-to-end English-French translator. Run it and see how the model works. Quality is acceptable, but an ideal translation model cannot be obtained with default parameters. Here are a few things that can be improved.
First, we use primitive tokenization, the basic function of
In addition, the default parameters of the translation model are not perfectly configured. You can try to change the learning rate, attenuation, initialization of the model weights. You can also replace the standard
Finally, the model presented above can be used not only for translation, but also for any other sequence-to-sequence task. Even if you want to turn a sequence into a tree, for example, to generate a parse tree, this model can produce state-of-the-art results, as shown in Vinyals & Kaiser et al., 2014 (pdf) . So you can create not only a translator, but also a parser, chat bot or any other program you want. Experiment!
That's all!
We are waiting for your comments and questions here or we invite you to ask them to a teacher in an open lesson .
The second part of the translation, which we posted a couple of weeks ago, in preparation for the start of the second stream of the Data Scientist course . There is another interesting material and an open lesson ahead.
In the meantime, we went further into the wilds of models.
Neural Translation Model
While the core of the sequence-to-sequence model is created by functions from
tensorflow/tensorflow/python/ops/seq2seq.py
, there are still a couple of tricks used in our translation model models/tutorials/rnn/translate/seq2seq_model.py
, which are worth mentioning.
Sampled softmax and projection of the output
As mentioned above, we want to use the sampled softmax to work with a large output dictionary. To decode from it, you have to track the projection of the output. Both the softmax sampled loss and the projection of the output are generated by the following code in
seq2seq_model.py
.if num_samples > 0and num_samples < self.target_vocab_size:
w_t = tf.get_variable("proj_w", [self.target_vocab_size, size], dtype=dtype)
w = tf.transpose(w_t)
b = tf.get_variable("proj_b", [self.target_vocab_size], dtype=dtype)
output_projection = (w, b)
defsampled_loss(labels, inputs):
labels = tf.reshape(labels, [-1, 1])
# We need to compute the sampled_softmax_loss using 32bit floats to# avoid numerical instabilities.
local_w_t = tf.cast(w_t, tf.float32)
local_b = tf.cast(b, tf.float32)
local_inputs = tf.cast(inputs, tf.float32)
return tf.cast(
tf.nn.sampled_softmax_loss(
weights=local_w_t,
biases=local_b,
labels=labels,
inputs=local_inputs,
num_sampled=num_samples,
num_classes=self.target_vocab_size),
dtype)
First, note that we create a sampled softmax only if the number of samples (512 by default) is smaller than the target dictionary size. For dictionaries smaller than 512, it is better to use standard softmax loss.
Then, create a projection of the output. This is a pair consisting of a matrix of weights and a displacement vector. When used, the rnn cell returns the shape vectors of the number of training samples on
size
, and not the number of training samples on target_vocab_size
. To restore logites, you need to multiply it by the weights matrix and add an offset, which is what happens in lines 124-126 c seq2seq_model.py
.if output_projection isnotNone:
for b in xrange(len(buckets)):
self.outputs[b] = [tf.matmul(output, output_projection[0]) +
output_projection[1] for ...]
Bucketing and padding
In addition to the softmax sampled, our translation model also uses bucketing , a method that allows you to effectively manage sentences of different lengths. To begin with, we will explain the problem. When translating from English to French, we have English sentences of different lengths L1 at the entrance and French sentences of different lengths L2 at the exit. Since the English sentence is transmitted as
encoder_inputs
, and the French sentence is output asdecoder_inputs
(with the prefix of the symbol GO), you must create a seq2seq model for each pair (L1, L2 + 1) of the lengths of the English and French sentences. As a result, we get a huge graph consisting of many similar subgraphs. On the other hand, we can “pad” (pad) each sentence with special PAD symbols. And then we will need only one seq2seq model for “full” lengths. But such a model will be ineffective on short sentences - you will have to encode and decode many useless PAD symbols. As a compromise between creating a graph for each pair of lengths and stuffing up to a single length, we use a certain number of groups (buckets) and stuff each sentence up to the length of the group above. In
translate.py
we use the following default groups.buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
Thus, if the input comes in an English sentence with 3 tokens, and the corresponding French sentence in the output contains 6 tokens, they will go into the first group and will be packed to length 5 at the input of the encoder and length 10 at the input of the decoder. And if in the English sentence there are 8 tokens, and in the corresponding French 18, then they will not fall into the group (10, 15) and will be transferred to the group (20, 25), that is, the English sentence will increase to 20 tokens, and the French to 25.
Remember that when creating the decoder input we add a special character
GO
to the input. This occurs in the function get_batch()
in seq2seq_model.py
, which also turns the English sentence. Reversing the input data helped to achieve an improvement in the results of the neural translation model in Sutskever et al., 2014 (pdf).To finally understand, imagine that there is an “I go.” Sentence at the input, divided into tokens ["I", "go", "."]
, and “Je vais.” Sentence , divided into tokens at the output ["Je", "vais", "."]
. They will be added to the group (5, 10), with a representation of the input encoder [PAD PAD "." "go" "I"]
and the input decoder [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]
. Launch it
To train the model described above, you will need a large Anglo-French corps. For training, we will use 10 ^ 9 Franco-English Corps from the WMT'15 site , and test news from the same site as a working sample. Both datasets will be loaded into
train_dir
when the next command is launched.python translate.py
--data_dir [your_data_directory] --train_dir [checkpoints_directory]
--en_vocab_size=40000 --fr_vocab_size=40000
You will need 18GB of hard disk space and a few hours to prepare the training building. The body is unpacked, dictionary files are created in
data_dir,
and after that the body is tokenized and converted into integer identifiers. Pay attention to the parameters responsible for the size of the dictionary. In the example above, all words outside the 40,000 most frequently used will be converted to a UNK token, which means an unknown word. Thus, when the dictionary is resized, the binary re-forms the body with the id-token. After data preparation, training begins. Values given in
translate
, the default is very high. Large models that have been studying for a long time show good results, but this can take too much time or too much GPU memory. You can set up a smaller model for training, as in the example below.python translate.py
--data_dir [your_data_directory] --train_dir [checkpoints_directory]
--size=256 --num_layers=2 --steps_per_checkpoint=50
The command above will teach a model with two layers (by default there are 3), each of which has 256 units (by default - 1024), with a checkpoint for every 50 steps (by default - 200). Experiment with these options to see what size model is appropriate for your graphics processor.
During the workout, each step
steps_per_checkpoin
t the binary will provide statistics on past steps. With the default settings (3 layers of size 1024), the first message looks like this:global step 200 learning rate 0.5000 step-time 1.39 perplexity 1720.62
eval: bucket 0 perplexity 184.97
eval: bucket 1 perplexity 248.81
eval: bucket 2 perplexity 341.64
eval: bucket 3 perplexity 469.04
global step 400 learning rate 0.5000 step-time 1.38 perplexity 379.89
eval: bucket 0 perplexity 151.32
eval: bucket 1 perplexity 190.36
eval: bucket 2 perplexity 227.46
eval: bucket 3 perplexity 238.66
Note that each step takes a little less than 1.4 seconds, perplexing the training set and perplexing the working sample in each group. After about 30 thousand steps, we see how the perplexions of short sentences (groups 0 and 1) become unambiguous. The training building contains about 22 million sentences, one iteration (one run of training data) takes about 340 thousand steps with the number of training samples in the amount of 64. At this stage, the model can be used to translate English sentences into French using the option
--decode
.python translate.py --decode
--data_dir [your_data_directory] --train_dir [checkpoints_directory]
Reading model parameters from /tmp/translate.ckpt-340000
> Who is the president of the United States?
Qui est le président des États-Unis ?
What's next?
The example above shows how to create your own end-to-end English-French translator. Run it and see how the model works. Quality is acceptable, but an ideal translation model cannot be obtained with default parameters. Here are a few things that can be improved.
First, we use primitive tokenization, the basic function of
basic_tokenizer
c data_utils
. A better tokenizer can be found on the WMT'15 website . If you use it and a dictionary of large sizes, you can achieve improved translations. In addition, the default parameters of the translation model are not perfectly configured. You can try to change the learning rate, attenuation, initialization of the model weights. You can also replace the standard
GradientDescentOptimizer
in seq2seq_model.py
with something more advanced, for example,AdagradOptimizer
. Try and follow the improvement of the result! Finally, the model presented above can be used not only for translation, but also for any other sequence-to-sequence task. Even if you want to turn a sequence into a tree, for example, to generate a parse tree, this model can produce state-of-the-art results, as shown in Vinyals & Kaiser et al., 2014 (pdf) . So you can create not only a translator, but also a parser, chat bot or any other program you want. Experiment!
That's all!
We are waiting for your comments and questions here or we invite you to ask them to a teacher in an open lesson .