OCR-based accounting system
Prologue
In the course of his career, he received the task of inventing and implementing a system for accounting advertising information. Accounting was to check the availability of the right information on the right billboard. The shield and printing are numbered.
It was proposed to use a photo as the initial information for the system. After the
Actually, the statement of the problem ends on this and the story of implementation begins.
The problem is solved in three actions:
- Finding the desired rectangle in the image.
- Text recognising.
- Verification of recognition.
Action One - Search
To find the desired rectangle in the picture, the easiest way is to find all the pieces that can be called rectangles, and then filter them according to certain parameters. To search for rectangles in the image, we used a slightly doped standard example from OpenCV - squares.cpp, from which the search function for rectangles was taken.
The pattern search procedure is quite primitive, and if there is a complex picture with a lot of color borders and transitions at the input, it gives a bunch of rectangles, of which, even before the recognition procedure, you need to throw out the unnecessary.
Unnecessary is filtered by several criteria:
1. The ratio of width and height.
The program has a cutoff criterion (r.width <5 * r.height), it can be improved and the condition condition with delta can be used more precisely.
The main thing here is that the photographer does not show imagination and does not shoot the object by turning the camera 90 o (photograph me with my feet).
2. Remove approximately the same shapes.
Another point: before filtering, we straighten the rectangles, since the photographer’s hand may flinch and the desired rectangle may have non-horizontally vertical borders in the photo.
Next, cutting into the file of all the assembled rectangles is done.
It was experimentally established that the recognition utility better processes black-and-white images, for which the cvAdaptiveThreshold method is called before writing to a file. The block size in the conversion procedure was selected experimentally.
The second action is recognition
The recognition utility receives both normal content and garbage at the input.
As stated earlier, for recognition we use the utility from Google - tesseract.
Other recognition tools could be used; cuniform was also tested.
But tesseract was chosen due to the fact that there is a lot of information on it and there was understandable instruction for training it on its own character set.
Your alphabet training was done with several goals:
- Dictionary for the recognition of numbers - should consist of 10 characters, no letters and other characters are needed. Short set error probability.
- In principle, it was possible to stop at the 1st one - tesseract has a mode for recognizing only digits. You could use it and not bother creating your own dictionary.
But the test results moved one more idea and the reason is as follows: regular fonts (included in the standard set), have the characters of the numbers from the point of view of OCR similar to each other: the number "7" under certain conditions is similar to "1", the number "3 "To" 8 ", etc.
Therefore, it was decided to use a font in which the symbol of the numbers will not be similar to each other. As a hint for finding the font was the name thereof - "OCR A Std". This font is just used on the above clippings.
Thus, we have another factor to reduce the likelihood of error.
As a result, a dictionary of 10 characters of this font was created for tesseract, and it can be seen in the clippings above.
I will not give instructions for training the utility, the process is not creative, mechanical, there are a lot of instructions in the network.
Action Three - Collective
The system was tested under Ubuntu. Running slicing and recognition utilities is done by php.
Here, the final verification of the recognized data is carried out using the checksum method.
The crc-8 algorithm is used.
$imagesout = '/home/toor/www/out';
$findrect = '/home/toor/OCR/OpenCV-2.2.0/samples/cpp/findrect';
$uploaddir = '/home/toor/www/uploads/';
$rectdir = '/home/toor/www/out/';
$tesseract = '/home/toor/OCR/tesseract-3.00/api/tesseract';
...
if (isset($_FILES['userfile']['tmp_name']))
{
$uploadfile = $uploaddir. $_FILES['userfile']['name'];
if (!move_uploaded_file($_FILES['userfile']['tmp_name'], $uploaddir . $_FILES['userfile']['name']))
{
echo "Есть ошибки!";
exit(1);
}
echo "Файл {$_FILES['userfile']['name']} успешно загружен!";
$cmd = "$findrect $uploadfile tif $imagesout";
exec($cmd, $output);
echo count($output)." фрагментов";
$datas = array();
foreach($output as $k => $f)
{
$recognized = "$rectdir$k.txt";
$cmd = "$tesseract $f $rectdir$k -l nums.ocr";
exec($cmd);
if (!file_exists($recognized)) continue;
echo "file: $recognized";
$data = file_get_contents($recognized);
$data = preg_replace('/\D/','',$data);
$data = trim($data);
if (!strlen($data)) continue;
if (!array_key_exists($data,$datas)) $datas[$data] = 1; else $datas[$data]++;
}
foreach ($datas as $d => $v)
{
if ($r = crc_check($d, NUMBER_LEN_1, NUMBER_LEN_CRC_1)) {
echo 'Найден номер: '.$r;
}
if ($r = crc_check($d, NUMBER_LEN_2, NUMBER_LEN_CRC_2)) {
echo 'Найден номер: '.$r;
}
}
}
In general, in test mode, the system proved to be quite good.
Images from the simplest phones like this
and up to several megabytes from digital cameras are being worked out .
References
Tesseract
OpenCV
OCR A Std Font