Generate images from text using AttnGAN

Hi, Habr! I present to you the translation of the article " AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks " by Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He.

In this publication, I want to talk about my experiments with the AttnGAN architecture for generating images from a text description. This architecture was already mentioned on Habré after the release of the original article in early 2018, and I was interested in the question - how difficult would it be to train such a model on my own?

Architecture Description

For those who are not familiar with AttnGAN and the classic GAN, I will briefly describe the essence. Classic GAN consists of at least 2 neural networks - a generator and a discriminator. The task of the generator is to generate some data (images, text, audio, video, etc.), “similar” to real data from dataset. The task of the discriminator is to evaluate the generated data, an attempt to compare them with the real ones and reject them. The rejected result of the generator stimulates it to produce the best result in order to “fool” the discriminator, which, in turn, learns to better recognize the fakes.

There are a great many modifications of GAN, and the AttnGAN authors approached the issue of architecture very ingeniously. The model consists of 9 neural networks that are finely tuned for interaction. It looks like this:

Text and image encoders (text / image encoder) convert the source text description and the actual images into some kind of internal representation. Characteristically, in this case, the text is considered as a sequence of individual words, the representation of which is processed together with the representation of the image, which makes it possible to compare individual words to separate parts of the image. Thus, the attention mechanism, named by the authors of the DAMSM article, is realized.

Fca - creates a concise view of the overall scene in the image, based on the entire text description. The value of C at the output is concatenated with a vector from the normal distribution of Z, which defines the variability of the scene. This information is the basis for the operation of the generator.

The generator is the largest network consisting of three levels. Each level generates images of increasing resolution, from 64 * 64 to 256 * 256 pixels, and the result of work at each level is corrected using Fattn attention networks, which carry information about the correct location of individual objects in the scene. In addition, the results at each level are checked by three separately working discriminators, which assess the realism of the image and the consistency of its overall view of the scene.


To test the architecture, I used the standard dubase CUB with photos and textual description of the birds.

Training of the entire model takes place in two stages. The first stage is the pre-training of DAMSM networks consisting of a text and image encoder. During this stage, as described above, a “attention map” is created, which looks like this:

As you can see from the figure, DAMSM manages to very accurately capture the relationship between individual words from the text description and image elements, it is especially easy for the model to recognize colors. I must say that there is no additional information about what a “red”, “yellow” or “wings”, “beak” is. There is only a set of texts and images.

DAMSM training takes place without any problems, the training time on this dataset is 150-200 epochs, which corresponds to several hours on a high-power GPU.

The second and main stage is generator training using the DAMSM model.
The generator at each level generates an image of higher resolution - it looks like this:

Generator training takes much longer and is not always so stable, the recommended training time for this dataset is 300-600 epochs, which corresponds to about 4-8 days on a high-power GPU.

The main problem in the training of the generator, in my opinion, in the absence of sufficiently good metrics that would allow to evaluate the quality of training in a more formal form. I studied several implementations of the Inception score, which, in theory, is positioned as a universal metric for such tasks - but they did not seem convincing enough to me. If you decide to train such a generator - you will need to constantly monitor the course of training visually, according to intermediate results. However, this rule is valid for any such tasks, visual inspection is always necessary.


Now the fun part. With the help of a trained model, we will try to generate images, let's start with simple sentences:

Let's try more complex descriptions:

All text descriptions are invented, I intentionally did not use dataset phrases for tests. Of course, not all of these images were taken on the first try. The model is wrong, the authors themselves are talking about it. As the text of the description and the elements to be displayed increases, it becomes increasingly difficult to maintain the realism of the whole scene. However, if you want to use something similar in production, say, generate pictures of certain objects for the designer, you can train and customize the system to your requirements, which can be quite strict.

For each text description, you can generate a lot of options for images (including unrealistic), so there will always be something to choose from.

Technical details

In this paper, I used a low-power GPU for prototyping and the Google Cloud cloud server from the Tesla K80 installed during the training phase.

The source code was taken from the repository of the authors of the article and underwent a serious refactoring. The system was tested in Python 3.6 with Pytorch 0.4.1 Thank you

for your attention!

Original article: AttnGAN: Fine-Grained Text Generation with Attentional Generative Adversarial Networks , 2018 - Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He.

Also popular now: