Recognition of images of documents using the roulette algorithm
V.A. Malykh, D.L. Sholomov, V.V. Arlazarov
To achieve good quality recognition of critical fields on forms, additional information must be used. Often, for this purpose, a check bit or other redundant information is specially introduced into the format of the recognized field.
This article proposes a universal roulette algorithm for recognizing fields with a test function.
The article also presents the results of practical testing of the proposed algorithm and, in addition, gives a general classification of verification algorithms.
Introduction
In a modern setting, the recognition task is, above all, for the so-called business forms. That is, commercial property documents, primarily financial ones. An example of a business form is a waybill, which is one of the main types of documents used in trade.
Business forms are characterized by the ambiguity of information located in various fields of the form. First of all, the important fields are the fields of amounts, account numbers, etc. An example of a critical field is the passport number in the form where passport data is used.
Various methods are used to improve the quality of recognition of critical form fields. In particular, methods are used with the introduction of additional redundant information into the data. A well-known example of such a method from the field of information theory are Hamming codes [ 3 ]. A number of methods in the field of text recognition were proposed in [ 1 ].
In relation to the recognition problem, there is a class of fields that contains additional information in its structure that can serve to verify the recognition accuracy. And also to correct errors if such a task is set.
It is possible to divide the use of additional information conditionally into two types - corrective and reject tests. A reject check is characterized by the use of predefined values for compliance (for example, a widespread dictionary check). In this case, in the absence of the value obtained during recognition in the dictionary, we make a decision on the incorrectness of recognition.
Corrective verification differs from rejection in that we can try to restore an incorrectly recognized value.
Recognition alternatives exist for each character. You can check the value by replacing one (or several) characters with its alternative. Such a method, applied to a value without control data, is much less effective - since we are actually trying to guess what was incorrectly recognized. Due to the fact that the probability of an error, first of all, depends on the character itself, it is impossible to draw an unambiguous conclusion about which of the characters was recognized incorrectly based on general considerations. On the other hand, having control information, we can verify the correctness of replacing a symbol with its alternative.
Due to the fact that the control value algorithm is chosen so that close values of the main data correspond to significantly different control data, and taking into account the low probability of error, we come to the conclusion that we can restore the original data with a high degree of certainty.
Such additional information can be expressed in any form, but the so-called checksums are most widely used.
Mathematical statement of the problem
The formulation of the recognition problem in the most general form is given, for example, in [ 4 ]. The article uses a narrow statement of the problem from [ 5 ]. The recognition task with correction is reduced to enumerating elements of the alternative vector for each character from the word . For each set , where is the ith element of the vector corresponding to the ith recognizable symbol, which we will call interpretation, it is converted into a linear sequence, which is subjected to corresponding verification
The total number of possible interpretations is given by the formula , where is the number of characters in the word .
Already for 2 variants for each character of a word with a length of 15 characters, this formula gives 32768 interpretations, which, with a rather complicated verification function , can lead to long delays in recognition. But, as practical experience shows, most of the words are recognized when checking one option for each character, i.e. for a word with a character length you need to consider only about 15 recognition options.
Adjustment algorithm
An algorithm proposed for rejecting and / or recovering data with control values.
Due to the fact that the probability of error in any symbol is the same, the algorithm does not distinguish between control and ordinary digits. The algorithm successively replaces the alternatives, combining them for all characters until a combination of alternatives satisfies the check used. Due to the complexity of the check discharge verification algorithm, it is possible to significantly reduce the probability of incorrect recognition.
The principle of the algorithm is reduced to a sequential search of the options for interpreting the word and applying verification to them . When describing the algorithm using pseudo-code, the word is designated as RecognitionResult, and the verification function designated as Test.
Roulette(Test, RecognitionResult)
CharCounter[RecognitionResult.Length]
for i = 0 to RecognitionResult.Length
charCounter[i] = 0
while Test(RecognitionResult) is not true
if UpdateCounter(CharCounter, RecognitionResult, 0) is not true
break
else
for i = 0 to RecognitionResult.Length
//Присваиваем значение тестируемого альтернативного
//значения распознанного символа его значению по умолчанию
RecognitionResult.Char[i].Alt[0] =
RecognitionResult.Char[i].Alt[CharCounter[i]]
The algorithm uses the auxiliary function UpdateCounter, with which the interpretation is directly enumerated:
UpdateCounter(CharCounter, RecognitionResult, i)
if CharCounter[i] < RecognitionResult.Char[i].Length)
CharCounter[i] = CharCounter[i] + 1
return true
else
if i < RecognitionResult.Length - 1
CharCounter[i] = 0
return UpdateCounter(CharCounter, RecognitionResult, i+1)
else
return false
The algorithm, called Roulette, takes two parameters as input:
- recognition result presented as a vector of vectors of character recognition alternatives, i.e. each symbol has its own vector of alternatives; It is designated as RecognitionResult;
- a verification function that takes one value at the input and produces a binary result of the passage; designated as Test.
RecognitionResult contains a Char array and a Char array length specified as Length. The Char array contains in each of its elements an array of Alt character recognition alternatives. Each element of the Char array contains the length of the array of Alt alternatives specified as Length.
It is assumed that the test function Test accepts only the RecognitionResult recognition result and checks the default values of the elements of the Char array in it.
Practical example
Consider an example of a checksum for a TIN and its corrective check.
The checksum calculation algorithm is as follows (for a 12-digit code):
Step 1. Check number n2 is the remainder of dividing by 11 sums from the digits of the number multiplied by the corresponding coefficients from table 1 (from the line “computing the check number n2”). If the remainder is 10, then n2 = 0.
Step 2. The control number n1 is the remainder of dividing by 11 the sum of the digits of the number multiplied by the corresponding coefficients from table 1 (from the line “calculating the control number n1”). If the remainder is 10, then n1 = 0.
Table 1.
k12 | k11 | k10 | k9 | k8 | k7 | k6 | k5 | k4 | k3 | k2 | k1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
calculation of check number n2 for 12- digit TIN | 7 | 2 | 4 | 10 | 3 | 5 | 9 | 4 | 6 | 8 | ||
calculation of check number n1 for 12- digit TIN | 3 | 7 | 2 | 4 | 10 | 3 | 5 | 9 | 4 | 6 | 8 | |
calculation of check number n1 for 10- digit TIN |
In the case of a 10-digit TIN, there is one control category, in the case of a 12-digit TIN, two control categories. The coefficients for calculating the control discharge are presented in table 1.
An example of the image received at the input of the recognition algorithm:
Image 1.
In image 1, there are two potential recognition problems (non-exhaustive list):
- the third position “3” is interpreted as “5”, in which the upper line has disappeared;
- the eighth position “7” is interpreted as “2” with an incomplete connecting loop.
Let the first of the described problems be realized - an error occurred in the recognition of the third position. Additionally, we assume that all positions except the third were unambiguously recognized. Accordingly, we have an alternative to recognition in the third position.
Let's make a check for the recognized value: "5253000796"
From where it can be seen that the checksum did not converge.
Now we are trying to replace the characters with their alternatives, that is, according to our assumption, now we are checking the value "5233000796"
Thus, we were able to correct the incorrectly recognized value using the check digit in it.
Practical application of the algorithm
The results of the work have found application in the industrial system Cognitive Forms, where the described algorithm is used for additional verification on the types of fields TIN, PSRN and SNILS in various real-life business forms. The Cognitive Forms system is described in the article [ 2 ].
In addition, testing was performed on a stand with a volume of 480 images (sick leave), where the recognition of fields of the above types was checked. The test results are presented in table 2.
Table 2.
Total number of fields | 3840 |
The number of correctly recognized fields without an algorithm (b / a) | 3510 (91.41%) |
Number of incorrectly recognized fields b / a | 330 (8.59%) |
Of them wrong, no doubt | 171 (4.45%) |
Number of correctly recognized fields with the connected algorithm (s / a) | 3576 (93.12%) |
Number of incorrectly recognized fields with / | 264 (6.88%) |
Of them wrong, no doubt | 55 (1.43%) |
According to the results of the study, it is possible to use the described algorithm to increase the recognition quality in these fields by 20% (from 91.41% to 93.12%) and, in addition, more than triple (from 4.45% to 1.43%) reduce the number of incorrectly recognized fields.
Conclusion
In the future, it is planned to improve the recognition of fields using this algorithm, as well as expand its application to other types of fields.
Literature:
1. Sholomov D.L. Syntactic methods of contextual processing in text recognition problems. - M., 2007.
2. Arlazarov VV Postnikov VV, Sholomov Landau . Cognitive Forms - a system of mass input of structured documents. // "Management of information flows" Proceedings of the Institute for System Analysis of the Russian Academy of Sciences. / M., URSS, 2002
3. Peterson W., Weldon E. Codes for correcting errors: Per. from English M.: Mir, 1976.
4. Khaikin S. Neural networks. - M .: Williams, 2006.
5. Arlazarov VV Structuring visual representations of the information environment and methods for determining the reliability of recognition. - M., 2004.
One of the authors of the article V. Malykh is present at the hub - madrugado