
Augmentation (augmentation) of data for training a neural network using printed characters as an example

There are already a lot of articles on the Habré dedicated to pattern recognition using machine learning methods, such as neural networks, reference vector machines, random trees. All of them require a significant number of examples for training and tuning parameters. Creating a training and test database of images of adequate volume for them is a very non-trivial task. And this is not about the technical difficulties of collecting and storing a million images, but about the eternal situation, when at the first stage of system development you have one and a half pictures. In addition, it should be understood that the composition of the training base can affect the quality of the resulting recognition system more than all other factors. Despite this, in most articles this important stage of development is completely omitted.
If you are interested in learning about all this - welcome to cat.
Before creating a database of sample images and training a neural network, it is necessary to specify the technical problem. It is clear that the recognition of handwritten text, the emotions of a human face or the location of a photograph are completely different tasks. It is also clear that the architecture of the neural network used will be affected by the choice of platform: in the cloud, on a PC, on a mobile device — the available computing resources vary by orders of magnitude.
Further it becomes more interesting. Recognition of images obtained from high-resolution cameras or blurry images from a web camera without autofocus will require completely different data for training, testing and validation. This is hinted at by the “theorems on the absence of free data”. That is why freely distributed educational image databases (for example, [ 1 , 2 , 3 ]) are excellent for academic research, but they are almost always inapplicable in real-world problems due to their “generality”.
The more accurately the training sample approximates the general set of images that will be input to your system, the higher will be the maximum achievable quality of the result. It turns out that a correctly compiled training sample is precisely the most specific technical task! For example, if we want to recognize printed characters in a photograph taken by a mobile device, then the base of examples should contain photographs of documents from different sources with different lighting, taken from different models of phones and cameras. All this complicates the collection of the required number of examples for training the recognizer.
Let us now consider several possible ways to prepare a sample of images to create a recognition system.
Create training examples from natural images.
Examples for learning from natural images are created on the basis of real data. Their creation consists of the following steps:
- Collection of graphic data (photographing objects of interest, removing the video stream from the camera, highlighting part of the image on the web page).
- Filtering - checking images for a number of requirements: a sufficient level of illumination of objects on them, the presence of the necessary object, etc.
- Preparation of tools for marking (writing your own or optimizing the finished).
- Markup (selection of quadrangles, necessary familiarity, areas of interest of the image).
- Assigning a label to each image (letter or name of the object in the image).
These operations require a significant investment of working time, and, accordingly, a similar way to create a training base is very expensive. In addition, it is necessary to collect data in various conditions - lighting, models of telephones, cameras from which the photo is taken, various sources of documents (printing houses), etc.
All this complicates the collection of the required number of examples for the recognition recognizer. On the other hand, according to the results of training the system on such data, one can judge its effectiveness in real conditions.
Creation of training examples from artificial images.
Another approach to creating training data is its artificial generation. You can take several templates / “perfect” examples (for example, font sets) and use the various distortions to create the required number of examples for training. The following distortions can be used:
- Geometric (affine, projective, ...).
- Bright / color.
- Background replacement.
- Distortions characteristic of the problem being solved: glare, noise, blurring, etc.
Examples of image distortion for the character recognition problem:
Shifts:








Rotations:








Additional lines on images:








Glare:








Defocus:








Compression and stretching along the axes:








Distortions can be generated using image libraries [ 1 , 2 , 3 ] or special programs that allow create whole artificial documents or objects.
This approach does not require a large amount of human resources and is relatively cheap, since it does not require markup and data collection - the entire process of creating an image database is determined by the choice of algorithm and parameters.
The main disadvantage of this method is the weak relationship between the quality of the system on the generated data and the quality of work in real conditions. In addition, the method requires large computing power to create the required number of examples. The choice of distortions used in creating the base for a specific task is also a certain difficulty.
The following is an example of creating a fully artificial base.
The initial set of images of font characters:








Examples of backgrounds:




Examples of images without distortion:








Adding small distortions:








Create artificial training examples generated from natural images.
A logical continuation of the previous method is the generation of artificial examples using real data instead of templates and initial “ideal” examples. By adding distortion, you can achieve a significant improvement in the recognition system. In order to understand exactly which distortions should be applied, some of the real data should be used for validation. They can be used to evaluate the most common types of errors and add images with corresponding distortions to the training base.
This method of creating training examples contains the advantages of both of the above approaches: it does not require high material costs and allows you to create a large number of examples necessary for training the recognizer.
Difficult selection of the “bloat” parameters of the training set from the initial examples can cause difficulties. On the one hand, the number of examples should be sufficient for the neural network to learn to recognize even noisy examples, on the other hand, it is necessary that the quality on other types of complex images does not drop
Comparison of the quality of training a neural network using examples from natural images that are completely artificial and generated using natural ones.
Let's try to create a neural network on MRZ symbol images. A machine-readable zone (MRZ) is a part of an identity document made in accordance with international recommendations set out in Doc 9303 - Machine Readable Travel Documents of the International Civil Aviation Organization . You can read more about MRZ recognition problems in our other article .
MRZ example:

MRZ contains 88 characters. We will use 2 characteristics of the quality of the system:
- percentage of erroneously recognized characters.
- percentage of fully correctly recognized zones (MRZ is considered to be fully correctly recognized if all characters in it are recognized correctly).
In the future, the neural network is supposed to be used on mobile devices, where the computing power is limited, so the meshes used will have a relatively small number of layers and weights.
For the experiments 800'000 examples of symbols were collected, which were divided into 3 groups: 200'000 examples for training, 300'000 examples for validation and 300'000 examples for testing. Such a partition is unnatural, since most of the examples are “wasted” (validation and testing), but it allows us to best show the advantages and disadvantages of various methods.
For a test sample, the distribution of examples of various classes is close to real and is as follows:
Class name (symbol): number of examples
0: 22416 1: 17602 2: 13746 3: 8115 4: 8587 5: 9383 6: 8697 7: 8082 8: 9734 9: 8847
<: 110438 A: 12022 B: 1834 C: 3891 D: 2952 E: 7349 F: 3282 G: 2169 H: 3309 I: 6737
J: 934 K: 2702 L: 4989 M: 6244 N: 7897 O: 4515 P: 4944 Q: 109 R: 7717 S: 5499 T: 3730
U: 4224 V: 3117 W : 744 X: 331 Y: 1834 Z: 1246
When learning only using natural examples, the average symbolic error value for 25 experiments was 0.25%, i.e. the total number of incorrectly recognized characters was 750 out of 300,000 images. For practical use, this quality is unacceptable, since the number of correctly recognized zones in this case is 80%.
Consider the most common types of errors that a neural network makes.
Examples of incorrectly recognized images:


















The following types of errors can be distinguished:
- Errors on off-center images.
- Errors on rotated images.
- Errors on images with lines.
- Errors in flare images.
- Errors in difficult cases.
Table of the most common errors:
(format Original symbol, number of errors, with what symbols the network most often confuses this symbol and how many times)
Original symbol: '0', number of errors: 437
'O': 419, 'U': 5, ' J ': 4,' 2 ': 2,' 1 ': 1
Original character:' <', number of errors: 71
' 2 ': 29,' K ': 6,' P ': 6,' 4 ': 4 , '6': 4
Original symbol: '8', number of errors: 35
'B': 10, '6': 10, 'D': 4, 'E': 2, 'M': 2
Original symbol: ' O ', number of errors: 20
' 0 ': 19,' Q ': 1
Original character:' 4 ', number of errors: 19
' 6 ': 5,' N ': 3,' ¡': 2,' A ' : 1, 'D': 1
Original character: '6', number of errors: 18
'G': 4, 'S': 4, 'D': 3, 'O': 2, '4': 2
Original character : '1', number of errors: 17
'T': 6, 'Y': 5, '7': 2, '3': 1, '6': 1
Original character: 'L', number of errors: 14
'I': 9, '4': 4, 'C': 1
Original character: 'M', number of errors: 14
'H': 7, 'P': 5, '3': 1, 'N': 1
Original character: 'E', number of errors : 14
'C': 5, 'I': 3, 'B': 2, 'F': 2, 'A': 1
We will gradually add various kinds of distortions corresponding to the most common types of errors in the training set. The number of added “distorted” images must be varied and selected based on the inverse response of the validation sample.
We act according to the following scheme:

For example, for this task, the following was done:
- Adding “shift” distortion corresponding to an error on “off-center” images.
- Conducting a series of experiments: training several neural networks.
- Quality assessment on a test sample. MRZ recognition quality increased by 9%.
- Analysis of the most common recognition errors in the validation sample.
- Adding images with additional lines to the training base.
- Again a series of experiments.
- Testing. The recognition quality of MRZ in the test set increased by 3.5%.
Such “iterations” can be carried out repeatedly - until the required quality is achieved or until the quality stops growing.
In this way, recognition quality was obtained in 94.5% of correctly recognized zones. Using post-processing (Markov models, finite state machines, N-gram and vocabulary methods, etc.), one can obtain a further increase in quality.
When using training only on artificial data in the considered problem, only quality was achieved in 81.72% of correctly recognized zones, while the main problem is the difficulty in choosing distortion parameters.
Training Data Type | The percentage of correctly recognized MRZ | Character error |
---|---|---|
Natural images | 80.78% | 0.253% |
Natural images + images with shifts | 89.68% | 0.13% |
+ images with additional lines | 93.19% | 0.1% |
+ rotated images | 95.50% | 0.055% |
Artificial Images | 78.53% | 0.29% |
Conclusion
In conclusion, I would like to note that in each case it is necessary to choose your own algorithm for obtaining training data. If the source data is completely absent, you will have to generate the sample artificially. If real data is easy to obtain, you can use a training set created only from them. And if there is not a lot of real data, or there are rarely encountered errors, the best way is to inflate a set of natural images. In our experience, this latter case is most common.