Recognition of images of documents using the roulette algorithm

    V.A. Malykh, D.L. Sholomov, V.V. Arlazarov


    To achieve good quality recognition of critical fields on forms, additional information must be used. Often, for this purpose, a check bit or other redundant information is specially introduced into the format of the recognized field.

    This article proposes a universal roulette algorithm for recognizing fields with a test function.
    The article also presents the results of practical testing of the proposed algorithm and, in addition, gives a general classification of verification algorithms.


    Introduction


    In a modern setting, the recognition task is, above all, for the so-called business forms. That is, commercial property documents, primarily financial ones. An example of a business form is a waybill, which is one of the main types of documents used in trade.

    Business forms are characterized by the ambiguity of information located in various fields of the form. First of all, the important fields are the fields of amounts, account numbers, etc. An example of a critical field is the passport number in the form where passport data is used.

    Various methods are used to improve the quality of recognition of critical form fields. In particular, methods are used with the introduction of additional redundant information into the data. A well-known example of such a method from the field of information theory are Hamming codes [ 3 ]. A number of methods in the field of text recognition were proposed in [ 1 ].

    In relation to the recognition problem, there is a class of fields that contains additional information in its structure that can serve to verify the recognition accuracy. And also to correct errors if such a task is set.

    It is possible to divide the use of additional information conditionally into two types - corrective and reject tests. A reject check is characterized by the use of predefined values ​​for compliance (for example, a widespread dictionary check). In this case, in the absence of the value obtained during recognition in the dictionary, we make a decision on the incorrectness of recognition.

    Corrective verification differs from rejection in that we can try to restore an incorrectly recognized value.

    Recognition alternatives exist for each character. You can check the value by replacing one (or several) characters with its alternative. Such a method, applied to a value without control data, is much less effective - since we are actually trying to guess what was incorrectly recognized. Due to the fact that the probability of an error, first of all, depends on the character itself, it is impossible to draw an unambiguous conclusion about which of the characters was recognized incorrectly based on general considerations. On the other hand, having control information, we can verify the correctness of replacing a symbol with its alternative.

    Due to the fact that the control value algorithm is chosen so that close values ​​of the main data correspond to significantly different control data, and taking into account the low probability of error, we come to the conclusion that we can restore the original data with a high degree of certainty.

    Such additional information can be expressed in any form, but the so-called checksums are most widely used.

    Mathematical statement of the problem


    The formulation of the recognition problem in the most general form is given, for example, in [ 4 ]. The article uses a narrow statement of the problem from [ 5 ]. The recognition task with correction is reduced to enumerating elements of the alternative vector for each character from the word . For each set , where is the ith element of the vector corresponding to the ith recognizable symbol, which we will call interpretation, it is converted into a linear sequence, which is subjected to corresponding verification



    The total number of possible interpretations is given by the formula , where is the number of characters in the word .

    Already for 2 variants for each character of a word with a length of 15 characters, this formula gives 32768 interpretations, which, with a rather complicated verification function , can lead to long delays in recognition. But, as practical experience shows, most of the words are recognized when checking one option for each character, i.e. for a word with a character length you need to consider only about 15 recognition options.

    Adjustment algorithm


    An algorithm proposed for rejecting and / or recovering data with control values.

    Due to the fact that the probability of error in any symbol is the same, the algorithm does not distinguish between control and ordinary digits. The algorithm successively replaces the alternatives, combining them for all characters until a combination of alternatives satisfies the check used. Due to the complexity of the check discharge verification algorithm, it is possible to significantly reduce the probability of incorrect recognition.

    The principle of the algorithm is reduced to a sequential search of the options for interpreting the word and applying verification to them . When describing the algorithm using pseudo-code, the word is designated as RecognitionResult, and the verification function designated as Test.

    Roulette(Test, RecognitionResult)
    	CharCounter[RecognitionResult.Length]
    	for i = 0 to RecognitionResult.Length
    		charCounter[i] = 0
    	while Test(RecognitionResult) is not true
    		if UpdateCounter(CharCounter, RecognitionResult, 0) is not true
    			break
    		else
    			for i = 0 to RecognitionResult.Length
    				//Присваиваем значение тестируемого альтернативного 
    				//значения распознанного символа его значению по умолчанию
        RecognitionResult.Char[i].Alt[0] = 
    	     RecognitionResult.Char[i].Alt[CharCounter[i]]
    

    The algorithm uses the auxiliary function UpdateCounter, with which the interpretation is directly enumerated:

    UpdateCounter(CharCounter, RecognitionResult, i)
    	if CharCounter[i] < RecognitionResult.Char[i].Length)
    		CharCounter[i] = CharCounter[i] + 1
    		return true
    	else
    		if i < RecognitionResult.Length - 1
    			CharCounter[i] = 0
    			return UpdateCounter(CharCounter, RecognitionResult, i+1)
    		else
    			return false
    

    The algorithm, called Roulette, takes two parameters as input:

    1. recognition result presented as a vector of vectors of character recognition alternatives, i.e. each symbol has its own vector of alternatives; It is designated as RecognitionResult;
    2. a verification function that takes one value at the input and produces a binary result of the passage; designated as Test.

    RecognitionResult contains a Char array and a Char array length specified as Length. The Char array contains in each of its elements an array of Alt character recognition alternatives. Each element of the Char array contains the length of the array of Alt alternatives specified as Length.

    It is assumed that the test function Test accepts only the RecognitionResult recognition result and checks the default values ​​of the elements of the Char array in it.

    Practical example


    Consider an example of a checksum for a TIN and its corrective check.

    The checksum calculation algorithm is as follows (for a 12-digit code):

    Step 1. Check number n2 is the remainder of dividing by 11 sums from the digits of the number multiplied by the corresponding coefficients from table 1 (from the line “computing the check number n2”). If the remainder is 10, then n2 = 0.

    Step 2. The control number n1 is the remainder of dividing by 11 the sum of the digits of the number multiplied by the corresponding coefficients from table 1 (from the line “calculating the control number n1”). If the remainder is 10, then n1 = 0.

    Table 1.

    k12k11k10k9 k8 k7 k6 k5 k4 k3 k2 k1
    calculation of check number n2
    for 12- digit TIN
    72410359468
    calculation of check number n1
    for 12- digit TIN
    372410359468
    calculation of check number n1
    for 10- digit TIN

    In the case of a 10-digit TIN, there is one control category, in the case of a 12-digit TIN, two control categories. The coefficients for calculating the control discharge are presented in table 1.

    An example of the image received at the input of the recognition algorithm:

    Image 1.



    In image 1, there are two potential recognition problems (non-exhaustive list):

    • the third position “3” is interpreted as “5”, in which the upper line has disappeared;
    • the eighth position “7” is interpreted as “2” with an incomplete connecting loop.

    Let the first of the described problems be realized - an error occurred in the recognition of the third position. Additionally, we assume that all positions except the third were unambiguously recognized. Accordingly, we have an alternative to recognition in the third position.

    Let's make a check for the recognized value: "5253000796"




    From where it can be seen that the checksum did not converge.

    Now we are trying to replace the characters with their alternatives, that is, according to our assumption, now we are checking the value "5233000796"




    Thus, we were able to correct the incorrectly recognized value using the check digit in it.

    Practical application of the algorithm


    The results of the work have found application in the industrial system Cognitive Forms, where the described algorithm is used for additional verification on the types of fields TIN, PSRN and SNILS in various real-life business forms. The Cognitive Forms system is described in the article [ 2 ].

    In addition, testing was performed on a stand with a volume of 480 images (sick leave), where the recognition of fields of the above types was checked. The test results are presented in table 2.

    Table 2.

    Total number of fields3840
    The number of correctly recognized fields
    without an algorithm (b / a)
    3510 (91.41%)
    Number of incorrectly recognized fields b / a330 (8.59%)
    Of them wrong, no doubt171 (4.45%)
    Number of correctly recognized fields
    with the connected algorithm (s / a)
    3576 (93.12%)
    Number of incorrectly recognized fields with /264 (6.88%)
    Of them wrong, no doubt55 (1.43%)

    According to the results of the study, it is possible to use the described algorithm to increase the recognition quality in these fields by 20% (from 91.41% to 93.12%) and, in addition, more than triple (from 4.45% to 1.43%) reduce the number of incorrectly recognized fields.

    Conclusion


    In the future, it is planned to improve the recognition of fields using this algorithm, as well as expand its application to other types of fields.



    Literature:



    1. Sholomov D.L. Syntactic methods of contextual processing in text recognition problems. - M., 2007.

    2. Arlazarov VV Postnikov VV, Sholomov Landau . Cognitive Forms - a system of mass input of structured documents. // "Management of information flows" Proceedings of the Institute for System Analysis of the Russian Academy of Sciences. / M., URSS, 2002

    3. Peterson W., Weldon E. Codes for correcting errors: Per. from English M.: Mir, 1976.

    4. Khaikin S. Neural networks. - M .: Williams, 2006.

    5. Arlazarov VV Structuring visual representations of the information environment and methods for determining the reliability of recognition. - M., 2004.

    One of the authors of the article V. Malykh is present at the hub - madrugado


    Also popular now: