Pictures from rough sketches: exactly how the NVIDIA GauGAN neural network works

Original author: Adam D King
  • Transfer
Last month at NVIDIA GTC 2019, NVIDIA introduced a new application that turns user-drawn simple colored balls into stunning, photo-realistic images.

The application is built on the technology of generative-competitive networks (GAN), which is based on deep learning. NVIDIA itself calls it GauGAN - a pun intended to refer to the artist Paul Gauguin. GauGAN functionality is based on the new SPADE algorithm.

In this article, I will explain how this engineering masterpiece works. And in order to attract as many interested readers as possible, I will try to give a detailed description of how convolutional neural networks work. Since SPADE is a generative-competitive network, I will tell you more about them. But if you are already familiar with this term, you can immediately go to the “Image-to-image broadcast” section.

Image generation

Let's begin to understand: in most modern deep learning applications, the neural discriminant type (discriminator) is used, and SPADE is a generative neural network (generator).


The discriminator classifies the input data. For example, an image classifier is a discriminator that takes an image and selects one suitable class label, for example, defines the image as “dog”, “car” or “traffic light”, that is, selects a label that describes the whole image. The output obtained by the classifier is usually presented as a vector of numbers$ v $where $ v_i $ Is a number from 0 to 1, expressing the confidence of the network that the image belongs to the selected $ i $class.

The discriminator can also compile a list of classifications. It can classify each pixel of an image as belonging to the class of “people” or “machines” (the so-called “semantic segmentation”).

The classifier takes an image with 3 channels (red, green and blue) and compares it with a confidence vector in each possible class that the image can represent.

Since the connection between the image and its class is very complex, neural networks pass it through a stack of many layers, each of which “slightly” processes it and transfers its output to the next level of interpretation.


A generative network like SPADE receives a dataset and seeks to create new original data that looks as if it belongs to this data class. At the same time, the data can be anything: sounds, language or something else, but we will focus on images. In general, data entry into such a network is simply a vector of random numbers, with each of the possible sets of input data creating its own image.

A generator based on a random input vector works virtually opposite to the image classifier. In “conditional class” generators, the input vector is, in fact, the vector of an entire data class.

As we have already seen, SPADE uses much more than just a “random vector”. The system is guided by a kind of drawing called a “segmentation map”. The latter indicates what and where to post. SPADE conducts the process opposite to the semantic segmentation we mentioned above. In general, a discriminatory task that converts one data type to another has a similar task, but it takes a different, unusual path.

Modern generators and discriminators usually use convolutional networks to process their data. For a more complete introduction to convolutional neural networks (CNNs), see the post Chew on Karna or the work of Andrei Karpati .

There is one important difference between the classifier and the image generator, and it lies in how exactly they change the size of the image during its processing. The image classifier should reduce it until the image loses all spatial information and only classes remain. This can be achieved by combining layers, or through the use of convolutional networks through which individual pixels are passed. The generator, on the other hand, creates an image using the reverse process of “convolution”, which is called convolutional transposition. He is often confused with “deconvolution” or “reverse convolution” .

Conventional 2x2 convolution with a step of “2” turns each 2x2 block into one point, reducing the output size by 1/2.

A transposed 2x2 convolution with a step of “2” generates a 2x2 block from each point, increasing the output size by 2 times.

Generator Training

Theoretically, a convolutional neural network can generate images as described above. But how do we train her? That is, if we take into account the set of input image data, how can we adjust the parameters of the generator (in our case, SPADE) so that it creates new images that look as if they correspond to the proposed data set?

To do this, you need to compare with image classifiers, where each of them has the correct class label. Knowing the network prediction vector and the correct class, we can use the backpropagation algorithm to determine the network update parameters. This is necessary to increase its accuracy in determining the desired class and reduce the influence of other classes.

The accuracy of the image classifier can be estimated by comparing its output element by element with the correct class vector. But for generators, there is no “right” output image.

The problem is that when the generator creates an image, there are no “correct” values ​​for each pixel (we cannot compare the result, as in the case of a classifier based on a previously prepared base, approx. Trans.). Theoretically, any image that looks believable and similar to the target data is valid, even if its pixel values ​​are very different from real images.

So, how can we tell the generator in which pixels it should change its output and how it can create more realistic images (ie how to give an “error signal”)? Researchers pondered this question a lot, and in fact it is quite difficult. Most ideas, such as calculating some average “distance” to real images, produce blurry, poor quality pictures.

Ideally, we could “measure” how realistic the generated images look through the use of a “high-level” concept, such as “How difficult is it to distinguish this image from the real one?” ...

Generative adversarial networks

This is exactly what was implemented as part of Goodfellow et al., 2014 . The idea is to generate images using two neural networks instead of one: one network is a
generator, the second is an image classifier (discriminator). The task of the discriminator is to distinguish the output images of the generator from the real images from the primary data set (the classes of these images are designated as “fake” and “real”). The generator’s job is to trick the discriminator by creating images that are as similar as possible to the images in the data set. We can say that the generator and discriminator are opponents in this process. Hence the name: generative-adversarial network .

Generative-competitive network based on random vector input. In this example, one of the generator outputs is trying to trick the discriminator into choosing a “real” image.

How does this help us? Now we can use an error message based solely on the prediction of the discriminator: a value from 0 ("false") to 1 ("real"). Since the discriminator is a neural network, we can share its conclusions about errors with the image generator. That is, the discriminator can tell the generator where and how it should adjust its images in order to better “deceive” the discriminator (that is, how to increase the realism of its images).

In the process of learning how to find fake images, the discriminator gives the generator better and better feedback on how the latter can improve its work. Thus, the discriminator performs a “learn a loss” function for the generator.

Glorious Small GAN

The GAN considered by us in its work follows the logic described above. His discriminator$ D $ analyzes the image $ x $ and gets the value $ D (x) $from 0 to 1, which reflects his degree of confidence that the image is real or faked by the generator. His generator$ G $ gets a random vector of normally distributed numbers $ Z $ and displays the image $ G (z) $that can be tricked by the discriminator (in fact, this image $ D (G (z)) $)

One of the issues that we did not discuss is how to train the GAN and what loss function developers use to measure network performance. In general, the loss function should increase as the discriminator is trained and decrease as the generator is trained. The loss function of the source GAN used the following two parameters. First, it
represents the degree to which the discriminator correctly classifies real images as real. The second is how well the discriminator detects fake images:

$ inline $ \ begin {equation *} \ mathcal {L} _ \ text {GAN} (D, G) = \ underbrace {E _ {\ vec {x} \ sim p_ \ text {data}} [\ log D ( \ vec {x})]} _ {\ text {accuracy on real images}} + \ underbrace {E _ {\ vec {z} \ sim \ mathcal {N}} [\ log (1 - D (G (\ vec {z}))]} _ {\ text {accuracy on fakes}} \ end {equation *} $ inline $

Discriminator $ D $derives his claim that the image is real. It makes sense since$ LogD (x) $increases when the discriminator considers x real. When the discriminator better detects fake images, the value of the expression also increases.$ Log (1-D (G (z)) $ (begins to strive for 1), since $ D (G (z)) $will tend to 0.

In practice, we evaluate accuracy using entire batches of images. We take a lot (but by no means all) of real images$ x $ and many random vectors $ Z $to get the averages according to the formula above. Then we select common errors and a data set.

Over time, this leads to interesting results:

Goodfellow GAN simulating MNIST, TFD, and CIFAR-10 datasets. Contour images are the closest in the dataset to adjacent fakes.

All this was fantastic just 4.5 years ago. Fortunately, as SPADE and other networks show, machine learning continues to progress rapidly.

Training problems

Generative-competitive networks are notorious for their complexity in preparation and instability of work. One of the problems is that if the generator is too far ahead of the discriminator in the pace of training, then its selection of images is narrowed down to those that help it to deceive the discriminator. In fact, as a result, training the generator comes down to creating a single, universal image for tricking the discriminator. This problem is called the “collapse mode”.

GAN collapse mode is similar to Goodfellow's. Please note that many of these bedroom images look very similar to each other. Source

Another problem is that when the generator effectively tricks the discriminator$ D (g (Z)) $, it operates with a very small gradient, therefore $ \ mathcal {L} _ \ text {GAN} G (\ vec {z}) $cannot get enough data to find the true answer, in which this image would look more realistic.

The efforts of researchers to solve these problems were mainly aimed at changing the structure of the loss function. One of the simple changes proposed by Xudong Mao et al., 2016 is the replacement of the loss function$ \ mathcal {L} _ \ text {GAN} $ for a couple of simple functions $ V_ \ text {LSGAN} $, which are based on squares of smaller area. This leads to stabilization of the training process, obtaining better images and less chance of collapse using undamped gradients.

Another problem researchers have encountered is the difficulty of obtaining high-resolution images, in part because a more detailed image gives the discriminator more information to detect fake images. Modern GANs begin to train the network with low-resolution images and gradually add more and more layers until the desired image size is reached.

The gradual addition of layers with higher resolution during GAN training significantly increases the stability of the entire process, as well as the speed and quality of the resulting image.

Image-to-image broadcast

So far, we have talked about how to generate images from random sets of input data. But SPADE doesn't just use random data. This network uses an image called a segmentation map: it assigns a material class to each pixel (for example, grass, wood, water, stone, sky). From this image, the card is SPADE and generates what looks like a photo. This is called "Image-to-image broadcast."

Six different types of Image-to-image broadcasts demonstrated by pix2pix. Pix2pix is ​​the predecessor of the two networks, which we will discuss further: pix2pixHD and SPADE.

In order for the generator to learn this approach, it needs a set of segmentation maps and corresponding photos. We are modifying the GAN architecture so that both the generator and the discriminator receive a segmentation map. The generator, of course, needs a map in order to know "which way to draw." The discriminator also needs it to make sure that the generator places the right things in the right places.

During the training, the generator learns not to put grass where “sky” is indicated on the segmentation map, because otherwise the discriminator can easily detect a fake image, and so on.

For image-to-image translation, the input image is accepted by both the generator and the discriminator. The discriminator additionally receives either the generator output or the true output from the training data set.Example

Image-to-image translator development

Let's look at a real image-to-image translator: pix2pixHD . By the way, SPADE is designed for the most part in the image and likeness of pix2pixHD.

For an image-to-image translator, our generator creates an image and accepts it as an input. We could just use a convolutional layer map, but since convolutional layers only combine values ​​in small areas, we need too many layers to transmit high resolution image information.

pix2pixHD solves this problem more efficiently with the help of the "Encoder", which reduces the scale of the input image, followed by the "Decoder", which increases the scale for obtaining the output image. As we will soon see, SPADE has a more elegant solution that does not require an encoder.

Pix2pixHD network diagram at a "high" level. The “residual” blocks and “+ operation” refer to the “skip connections” technology from the Residual neural network . There are skip blocks in the network, which are interconnected in the encoder and decoder.

Batch normalization is a problem

Almost all modern convolutional neural networks use batch normalization or one of its analogues to speed up and stabilize the training process. The activation of each channel shifts the mean to 0 and the standard deviation to 1 before a pair of channel parameters$ \ beta $ and $ \ gamma $ let them denormalize again.

$ y = \ frac {x - \ mathrm {E} [x]} {\ sqrt {\ mathrm {Var} [x] + \ epsilon}} * \ gamma + \ beta $

Unfortunately, batch normalization harms generators, making it difficult for the network to implement some types of image processing. Instead of normalizing a batch of images, pix2pixHD uses a normalization standard , which normalizes each image individually.

Pix2pixHD Training

Modern GANs, such as pix2pixHD and SPADE, measure the realism of their output images a little differently than what was described for the original design of generative contention networks.

To solve the problem of generating high-resolution images, pix2pixHD uses three discriminators of the same structure, each of which receives the output image at a different scale (normal size, reduced by 2 times and reduced by 4 times).

Pix2pixHD uses$ V_ \ text {LSGAN} $, and also includes another element designed to make the generator’s conclusions more realistic (regardless of whether this helps to deceive the discriminator). This item$ \ mathcal {L} _ \ text {FM} $ called “feature matching” - it encourages the generator to make the distribution of layers the same when simulating discrimination between real data and the outputs of the generator, minimizing $ L_1 Distance $between them.

So, optimization comes down to the following:

$$ display $$ \ begin {equation *} \ min_G \ bigg (\ lambda \ sum_ {k = 1,2,3} V_ \ text {LSGAN} (G, D_k) + \ big (\ max_ {D_1, D_2 , D_3} \ sum_ {k = 1,2,3} \ mathcal {L} _ \ text {FM} (G, D_k) \ big) \ bigg) \ end {equation *}, $$ display $$

where losses are summed up by three discriminatory factors and coefficient $ \ lambda = 10 $, which controls the priority of both elements.

pix2pixHD uses a segmentation map composed of a real bedroom (on the left in each example) to create a fake bedroom (on the right).

Although discriminators reduce the image scale until they disassemble the entire image, they stop at “spots” of size 70 × 70 (at appropriate scales). Then they simply summarize all the values ​​of these “spots” for the entire image.

And this approach works fine, since the function$ \ mathcal {L} _ \ text {FM} $ takes care that the image looks realistic in high resolution, and $ V_ \ text {LSGAN} $only required to check small parts. This approach also has additional advantages in the form of speeding up the network, reducing the number of parameters used and the possibility of using it to generate images of any size.

pix2pixHD generates photorealistic images with appropriate grimaces from simple outline faces. Each example shows a real image from the CelebA dataset on the left, a sketch of the facial expression of this celebrity in the form of a sketch, and an image on the right created from this data.

What is wrong with pix2pixHD?

These results are incredible, but we can do more. It turns out that pix2pixHD loses a lot in one important aspect.

Consider what pix2pixHD does with a single-class input, say, with a map that has grass everywhere. Since the input is spatially uniform, the outputs of the first convolutional layer are also the same. Then, the normalization of instances “normalizes” all (identical) values ​​for each channel in the image and returns$ 0 $as a conclusion for all of them. The β-parameter can shift this value from zero, but the fact remains: the output will no longer depend on whether the input was “grass”, “sky”, “water” or something else.

In pix2pixHD, instance normalization tends to ignore information from the segmentation map. For images consisting of one class, the network generates the same image regardless of this class itself.

And the solution to this problem is the main design feature of SPADE.

Solution: SPADE

Finally, we have reached a fundamentally new level in creating images from segmentation maps: spatially adaptive (de) normalization (SPADE).

The idea of ​​SPADE is to prevent the loss of semantic information in the network, allowing the segmentation map to control the normalization parameters γ, as well as β, locally, at the level of each individual layer. Instead of using only one pair of parameters for each channel, they are calculated for each spatial point by supplying a segmentation map with downsampling through 2 convolutional layers.

Instead of rolling the segmentation map onto the first layer, SPADE uses its downsampling versions to modulate the normalized output for each layer.

The SPADE generator combines this whole design into small “residual blocks” that are placed between the

upsampling layers (transposed convolution): High-level circuit of the SPADE generator compared to the pix2pixHD generator

Now that the segmentation map is fed “from the inside” of the network, there is no need to use it as input for the first layer. Instead, we can return to the original GAN ​​scheme, in which a random vector was used as input. This gives us an additional opportunity to generate various images from one segmentation map (“multimodal synthesis”). It also makes the entire pix2pixHD “encoder” unnecessary, which is a serious simplification.

SPADE uses the same loss function as pix2pixHD, but with one change: instead of squaring the values $ V_ \ text {LSGAN} $it uses hinge loss .

With these changes, we get great results:

Here the SPADE results are compared with the results of pix2pixHD


Let's think about how SPADE can show such results. In the example below, we have a tree. GauGAN uses one “tree-like” class to represent both the trunk and leaves of a tree. However, somehow SPADE finds out that the narrow part at the bottom of the “tree” is the trunk and should be brown, while the big drop on top must be foliage.

The downsampling segmentation that SPADE uses to modulate each layer provides similar “intuitive” recognition.

You may notice that the tree trunk continues in the part of the crown, which refers to the “foliage”. So how does SPADE understand where to place part of the trunk there, and where is the foliage? Indeed, judging by the 5x5 map, there should simply be a “tree”.

The answer is that the plot shown can receive information from lower resolution layers, where the 5x5 block contains the entire tree. Each subsequent convolutional layer also provides some movement of information in the image, which gives a more complete picture.

SPADE allows the segmentation map to directly modulate each layer, but this does not impede the process of coherent distribution of information between layers, as it happens, for example, in pix2pixHD. This prevents the loss of semantic information, since it is updated in each subsequent layer due to the previous one.

Transmission style

SPADE has another magic solution - the ability to generate an image in a given style (for example, lighting, weather conditions, season).

SPADE can generate several different images based on one segmentation card, mimicking a given style.

This works as follows: we pass the images through the encoder and train it to set the generator vectors$ Z $, which in turn will generate similar images. After the encoder is trained, we replace the appropriate segmentation cards with arbitrary ones, and the SPADE generator creates images that correspond to the new cards, but in the style of the images provided, based on the previously received training.

Since the generator usually expects to receive a sample based on the multidimensional normal distribution, to obtain realistic images, we must train the encoder to output values ​​with a similar distribution. In fact, this is the idea of ​​variational auto-encoders , which Joel Zeldes explains .

This is how SPADE / GaiGAN functions. I hope this article has satisfied your curiosity about how the new NVIDIA system works. You can contact me via Twitter @AdamDanielKin or email

Also popular now: