We teach TensorFlow to draw the Cyrillic alphabet.
Hi Habr! In recent years, new approaches to the training of neural networks have significantly expanded the scope of the practical application of machine learning. And the appearance of a large number of good high-level libraries made it possible to test their skills to specialists of different levels of training.
Having some experience in machine learning, until now I have not dealt specifically with neural networks. In the wake of their rapid popularity, it was decided to fill this gap and at the same time try to write an article about it.
I set myself two goals. The first is to come up with a problem that is complex enough to face problems that arise in real life when solving it. And second, to solve this problem using one of the modern libraries, having dealt with the features of working with them.
TensorFlow was chosen as the library. And for the task and its solution, I ask you to ...
Trying to come up with a problem, I was guided by the following considerations:
While surfing the Internet and studying material, I came across, among other things, the following articles:
It was these articles, as well as the requirement for the speed of training, that prompted me the idea of the problem.
For most fonts available on the web, only Latin versions exist. But what if using only images of Latin letters to restore its Cyrillic version while maintaining the original style?
Immediately make a reservation, I did not set out to make the final product ready for use in the publishing house. My goal was to learn how to work with neural networks. Therefore, I limited myself to working with raster images of size 64 × 64 pixels.
Among the existing zoo of packages and libraries, my choice fell on TensorFlow for the following reasons:
All fonts have individual features. The average user will most likely remember about italics or thickness, and the specialist in the typographer will not forget about antiques, grotesques and other beautiful words.
Such individual features are difficult to formalize. When training neural networks, a typical approach to solving this problem is to use the embedding layer. In short, an attempt is made to map a simulated object to a one-dimensional numerical vector. The semantics of the individual components of the vector will remain unknown, however, strangely enough, the whole vector allows you to model the object in the context of the task. The most famous example of this approach is the Word2Vec model .
I decided to map each font to a vector
and each letter is a vector
. Picture
a specific letter for a given font is obtained using a neural network
, to the input of which two vectors corresponding to the font and the letter are fed.
It is used when constructing images of a single letter in the styles of different fonts (similarly for a single font).

As a network
acts as a deconvolution neural network. The image above is taken from the openai.com blog for demonstration. More details about the deployment operation used and the number of layers in the network will be discussed later.
Training sample
consists of images of Russian and Latin letters. For each image, the letter and font used are known.

In the training phase, we look for network weights
and embedding vectors for all letters and fonts from the training set using one of the variations of the stochastic gradient method.
the total cross entropy between predicted and real pixels is used (the pixel value is scaled from zero to one).
Trained neural network
and embedding vectors of Cyrillic and Latin characters are used in the above task of recovering Cyrillic fonts.
Let's say we entered a Latin font
missing in the training set. We can try to restore its embedding vector
using only images of latin letters
. To do this, we solve the following optimization problem:
about two to three dozen letters.
Further, using the embedding vector of Cyrillic characters and embedding vector obtained at the training stage
, we can easily restore the corresponding images using the network
.
As often happens, a significant portion of the time was spent preparing the training sample. It all started with the fact that I went to sites known in narrow circles and downloaded the font collections available there. The unzipped size of all files was ~ 14 GB.
At first it turned out that for different fonts, letters of the same size can occupy a different number of pixels. To combat this, a program was written that selects the font size in order to fit all its characters into an image of 64 by 64 pixels.

Further, all fonts were conditionally divided into three types - classic, exotic and ornamental. The latter can depict anything except Cyrillic and Latin characters, for example, emoticons or pictures. On exotic fonts, all letters are in place, but are depicted in a special way. For example, on exotic fonts, colors can be inverted, frames added, copybook used, and so on.
By sight, ornamental fonts made up a large share in the collection, but they can not be considered anything but noise. The problem arose of filtering them. Unfortunately, using metadata this did not work. Instead, I began to count the number of monochrome areas in black and white images of individual letters and compare it with the reference value. For example, the letter A corresponds to a reference number of three, and the letter T to two. If the font has a lot of letters with differences from the reference values, it is considered noise.
As a result, a significant part of exotic fonts was also eliminated. I was not very upset, because I had no particular illusions about the generalizing ability of the network being developed. The total collection size is ~ 6000 fonts.
If you have read up to the present moment, I think you have an idea about convolutional neural networks. If not, there is enough material on the subject online. Here I want only to note two points about convolution networks:
In the context of the article, the first statement leads to an analogy between the obtained characteristics and the embedding layer described above. And the second gives an intuitive justification for trying to gradually restore the local characteristics of the original image from global ones.
The convolution + ReLu + maxpooling function used in convolutional networks is not reversible. Several methods have been proposed in the literature [2] [4] for its “restoration”. I decided to use the simplest one - convolution transpose + ReLu, which is essentially a linear function + the simplest nonlinear activation function. The main thing for me was to maintain the locality property.
TensorFlow has a conv2d_transpose function that implements this linear transformation. The main problem that I encountered here was to imagine exactly how the calculation is performed in order to rationally select the parameters. Here I came to the aid of the illustration for the one-dimensional case from the article [5]MatConvNet - Convolutional Neural Networks for MATLAB (p. 30) [arXiv] :

As a result, I settled on the parameters stride = [2, 2] and kernel = [4, 4]. This allowed in each deconvolution layer to double the length and width of the image and use groups of 4 adjacent pixels to calculate the neurons of the next layer.
When learning the network, I came across two points that I think will be useful to share.
Firstly, it was very important to balance the number of Cyrillic and Latin images in the training set. At first, the Latin characters in my training set turned out to be much larger, as a result, the network drew them better, to the detriment of the quality of recovery of Cyrillic characters.
Secondly, a gradual decrease in the learning rate of gradient descent helped with training. I did this manually, following the error in TensorBoard, however, there are possibilities in TensorFlow for automating this process, for example using the tf.train.exponential_decay function .
Other useful tricks like regularization, dropout , batch normalization I didn’t have to use it, since without them I managed to achieve results acceptable for my purposes.
The dimension of the embedding layer is 64 for both characters and fonts.
4 deconvolution layers of inner layers 8 × 8 × 128 → 16 × 16 × 64 → 32 × 32 × 32 → 64 × 64 × 16, stride = [2, 2], kernel = [4, 4], ReLu activation
1 convolution layer 64 × 64 × 16 → 64 × 64 × 1, stride = [1, 1], kernel = [4, 4], sigmoid activation
Recently, there was an informational occasion to demonstrate the results when two Russian projects MyOfis and Astra Linux released free font collections at once . These fonts were not used in the training set. On the left is the original font, on the right is the output of the neural network. For compactness, only Cyrillic letters are given, which have no Latin counterparts.




I post the rest of the results in a separate archive .
Having received these results, I believe that I have achieved my goals. Although it is worth noting that they are far from ideal (especially noticeable for some Italic fonts) and there are still many things that can be added, improved and verified.
What article turned out, you judge.
[1] Gatys, Ecker, Bethge, A neural algorithm of artistic style [arXiv]
[2] Dosovitskiy, Springenberg, Tatarchenko, Brox, Learning to Generate Chairs, Tables and Cars with Convolutional Networks [arXiv]
[3] Hertel, Barth, Käster, Martinetz, Deep Convolutional Neural Networks as Generic Feature Extractors [PDF]
[4] Zeiler, Krishnan, Taylor, Fergus, Deconvolutional Networks [PDF]
[5] Vedaldi, Lenc, MatConvNet - Convolutional Neural Networks for MATLAB [arXiv]
Having some experience in machine learning, until now I have not dealt specifically with neural networks. In the wake of their rapid popularity, it was decided to fill this gap and at the same time try to write an article about it.
I set myself two goals. The first is to come up with a problem that is complex enough to face problems that arise in real life when solving it. And second, to solve this problem using one of the modern libraries, having dealt with the features of working with them.
TensorFlow was chosen as the library. And for the task and its solution, I ask you to ...
Task selection
Trying to come up with a problem, I was guided by the following considerations:
- It should be something more complex and interesting than the standard MNIST classification, since I was looking for opportunities to deal with specific features of the training of neural networks;
- As an architecture, a feedforward neural network was planned, since acquaintance is always better to start with simple things;
- Training should work fast enough on a regular video game card, as everything was done in my spare time from my main job.
While surfing the Internet and studying material, I came across, among other things, the following articles:
- [1] A neural algorithm of artistic style [arXiv] which presents an approach to image processing in the style of famous artists. The development of this area later led to the emergence of the popular Prisma service;
- [2] Learning to Generate Chairs, Tables and Cars with Convolutional Networks [arXiv] in which a method for generating images, in particular chairs, according to given parameters such as color or viewing angle, is proposed.
It was these articles, as well as the requirement for the speed of training, that prompted me the idea of the problem.
Task
For most fonts available on the web, only Latin versions exist. But what if using only images of Latin letters to restore its Cyrillic version while maintaining the original style?
Immediately make a reservation, I did not set out to make the final product ready for use in the publishing house. My goal was to learn how to work with neural networks. Therefore, I limited myself to working with raster images of size 64 × 64 pixels.
Library selection
Among the existing zoo of packages and libraries, my choice fell on TensorFlow for the following reasons:
- The library is rather low-level in comparison with , for example, Keras , which on the one hand promised a certain amount of boilerplate code, and on the other - the opportunity to understand the details and try non-standard architectures;
- TensorFlow supports training on multiple machines / graphics cards. Therefore, the acquired skills of working with the library can be useful on real practical problems;
- The TensorBoard tool in TensorFlow makes it easy to follow the learning process online. Among other things, this allowed us to detect errors in the code at the early stages of training and not waste time on obviously bad architectures and parameter sets.
Network architecture
All fonts have individual features. The average user will most likely remember about italics or thickness, and the specialist in the typographer will not forget about antiques, grotesques and other beautiful words.
Such individual features are difficult to formalize. When training neural networks, a typical approach to solving this problem is to use the embedding layer. In short, an attempt is made to map a simulated object to a one-dimensional numerical vector. The semantics of the individual components of the vector will remain unknown, however, strangely enough, the whole vector allows you to model the object in the context of the task. The most famous example of this approach is the Word2Vec model .
I decided to map each font to a vector

As a network
Model training
Training sample

In the training phase, we look for network weights
Image recovery
Trained neural network
Let's say we entered a Latin font
Further, using the embedding vector of Cyrillic characters and embedding vector obtained at the training stage
Training sample preparation
As often happens, a significant portion of the time was spent preparing the training sample. It all started with the fact that I went to sites known in narrow circles and downloaded the font collections available there. The unzipped size of all files was ~ 14 GB.
At first it turned out that for different fonts, letters of the same size can occupy a different number of pixels. To combat this, a program was written that selects the font size in order to fit all its characters into an image of 64 by 64 pixels.

Further, all fonts were conditionally divided into three types - classic, exotic and ornamental. The latter can depict anything except Cyrillic and Latin characters, for example, emoticons or pictures. On exotic fonts, all letters are in place, but are depicted in a special way. For example, on exotic fonts, colors can be inverted, frames added, copybook used, and so on.
By sight, ornamental fonts made up a large share in the collection, but they can not be considered anything but noise. The problem arose of filtering them. Unfortunately, using metadata this did not work. Instead, I began to count the number of monochrome areas in black and white images of individual letters and compare it with the reference value. For example, the letter A corresponds to a reference number of three, and the letter T to two. If the font has a lot of letters with differences from the reference values, it is considered noise.
As a result, a significant part of exotic fonts was also eliminated. I was not very upset, because I had no particular illusions about the generalizing ability of the network being developed. The total collection size is ~ 6000 fonts.
About convolutional neural networks
If you have read up to the present moment, I think you have an idea about convolutional neural networks. If not, there is enough material on the subject online. Here I want only to note two points about convolution networks:
- Their ability to produce at the output of the last layers many characteristics of the original image, a combination of which can be used in its analysis ( [3] Deep Convolutional Neural Networks as Generic Feature Extractors [PDF] );
- These global characteristics are obtained through the sequential calculation of local characteristics in the inner layers of the neural network.
In the context of the article, the first statement leads to an analogy between the obtained characteristics and the embedding layer described above. And the second gives an intuitive justification for trying to gradually restore the local characteristics of the original image from global ones.
Deconvolution
The convolution + ReLu + maxpooling function used in convolutional networks is not reversible. Several methods have been proposed in the literature [2] [4] for its “restoration”. I decided to use the simplest one - convolution transpose + ReLu, which is essentially a linear function + the simplest nonlinear activation function. The main thing for me was to maintain the locality property.
TensorFlow has a conv2d_transpose function that implements this linear transformation. The main problem that I encountered here was to imagine exactly how the calculation is performed in order to rationally select the parameters. Here I came to the aid of the illustration for the one-dimensional case from the article [5]MatConvNet - Convolutional Neural Networks for MATLAB (p. 30) [arXiv] :

As a result, I settled on the parameters stride = [2, 2] and kernel = [4, 4]. This allowed in each deconvolution layer to double the length and width of the image and use groups of 4 adjacent pixels to calculate the neurons of the next layer.
Learning process
When learning the network, I came across two points that I think will be useful to share.
Firstly, it was very important to balance the number of Cyrillic and Latin images in the training set. At first, the Latin characters in my training set turned out to be much larger, as a result, the network drew them better, to the detriment of the quality of recovery of Cyrillic characters.
Secondly, a gradual decrease in the learning rate of gradient descent helped with training. I did this manually, following the error in TensorBoard, however, there are possibilities in TensorFlow for automating this process, for example using the tf.train.exponential_decay function .
Other useful tricks like regularization, dropout , batch normalization I didn’t have to use it, since without them I managed to achieve results acceptable for my purposes.
Network Summary
The dimension of the embedding layer is 64 for both characters and fonts.
4 deconvolution layers of inner layers 8 × 8 × 128 → 16 × 16 × 64 → 32 × 32 × 32 → 64 × 64 × 16, stride = [2, 2], kernel = [4, 4], ReLu activation
1 convolution layer 64 × 64 × 16 → 64 × 64 × 1, stride = [1, 1], kernel = [4, 4], sigmoid activation
results
Recently, there was an informational occasion to demonstrate the results when two Russian projects MyOfis and Astra Linux released free font collections at once . These fonts were not used in the training set. On the left is the original font, on the right is the output of the neural network. For compactness, only Cyrillic letters are given, which have no Latin counterparts.
XOCourserBold

PTAstraSansItalic

XOThamesBoldItalic

PTAstraSerifRegular

Rest
I post the rest of the results in a separate archive .
Total
Having received these results, I believe that I have achieved my goals. Although it is worth noting that they are far from ideal (especially noticeable for some Italic fonts) and there are still many things that can be added, improved and verified.
What article turned out, you judge.
Related Links
[1] Gatys, Ecker, Bethge, A neural algorithm of artistic style [arXiv]
[2] Dosovitskiy, Springenberg, Tatarchenko, Brox, Learning to Generate Chairs, Tables and Cars with Convolutional Networks [arXiv]
[3] Hertel, Barth, Käster, Martinetz, Deep Convolutional Neural Networks as Generic Feature Extractors [PDF]
[4] Zeiler, Krishnan, Taylor, Fergus, Deconvolutional Networks [PDF]
[5] Vedaldi, Lenc, MatConvNet - Convolutional Neural Networks for MATLAB [arXiv]