Damaskus July 4, 2011 at 20:41

OCR-based accounting system

From the sandbox

Prologue

In the course of his career, he received the task of inventing and implementing a system for accounting advertising information. Accounting was to check the availability of the right information on the right billboard. The shield and printing are numbered.
It was proposed to use a photo as the initial information for the system. After the ~~trade~~ agreement with the designers, it was agreed that both numbers would be located within the same frame. The only thing that the frame could be anywhere on the shield.
Actually, the statement of the problem ends on this and the story of implementation begins.
The problem is solved in three actions:

Finding the desired rectangle in the image.
Text recognising.
Verification of recognition.

Action One - Search

To find the desired rectangle in the picture, the easiest way is to find all the pieces that can be called rectangles, and then filter them according to certain parameters. To search for rectangles in the image, we used a slightly doped standard example from OpenCV - squares.cpp, from which the search function for rectangles was taken.
The pattern search procedure is quite primitive, and if there is a complex picture with a lot of color borders and transitions at the input, it gives a bunch of rectangles, of which, even before the recognition procedure, you need to throw out the unnecessary.

Unnecessary is filtered by several criteria:
1. The ratio of width and height.
The program has a cutoff criterion (r.width <5 * r.height), it can be improved and the condition condition with delta can be used more precisely.
The main thing here is that the photographer does not show imagination and does not shoot the object by turning the camera 90 ^o (photograph me with my feet).
2. Remove approximately the same shapes.

Another point: before filtering, we straighten the rectangles, since the photographer’s hand may flinch and the desired rectangle may have non-horizontally vertical borders in the photo.

Next, cutting into the file of all the assembled rectangles is done.
It was experimentally established that the recognition utility better processes black-and-white images, for which the cvAdaptiveThreshold method is called before writing to a file. The block size in the conversion procedure was selected experimentally.


#include "cv.h"
#include "highgui.h"
#include 
#include 
#include 
#include 
using namespace cv;
using namespace std;
typedef vector polygon;
typedef vector polygonList;
...
//Сравнение для фильтрации схожих фигур
bool compareRect(const CvRect &r1, const CvRect &r2)
{    
    if (!r1.width || !r1.height) return false;
    if ((float)abs(r1.width- r2.width)/(float)r1.width > 0.05) return false;
    if ((float)abs(r1.height - r2.height)/(float)r1.height > 0.05) return false;
    if ((float)abs(r1.x - r2.x)/(float)r1.width > 0.02) return false;        
    if ((float)abs(r1.y - r2.y)/(float)r1.height > 0.02) return false;
    return true;
}
//Спрямляем прямоугольник
CvRect getRect(const polygon& poly)
{
    CvPoint p1 = cvPoint(10000,10000);
    CvPoint p2 = cvPoint(-10000,-10000);
    for (size_t i=0; i < poly.size(); i++) 
    {
        const Point p = poly[i];
        if (p1.x > p.x) p1.x = p.x;
        if (p1.y > p.y) p1.y = p.y;
        if (p2.x < p.x) p2.x = p.x;
        if (p2.y < p.y) p2.y = p.y;
    }
    return cvRect(p1.x,p1.y,p2.x-p1.x,p2.y-p1.y);    
}
int main(int argc, char** argv)
{
    if(argc <= 3)
    {
        cout << "Wrong Param Count: " << argc << endl;
        cout << "Usage: findrect infile extension outfolder" << endl;
        return 1;
    }
    char *fileIn = argv[1];
    char *fileExt = argv[2];
    char *dirOut = argv[3];    
    char fileOut[128];    
    polygonList squares;    
    IplImage *Img = cvLoadImage(fileIn,1);
    Mat image(Img);
    if(image.empty())
    {
        cout << "Couldn't load " << fileIn << endl;
        return 1;
    }
    findSquares(image, squares);      
    vector rectList;
    int p = 0;
    int adaptive_method = CV_ADAPTIVE_THRESH_GAUSSIAN_C;
    int threshold_type = CV_THRESH_BINARY;
    int block_size = 65;
    double offset = 10;
    for (int j=0; jdepth, Img->nChannels);            
        IplImage *gray = cvCreateImage(cvSize(r.width, r.height), 8, 1);            
        IplImage *bw = cvCreateImage(cvSize(r.width, r.height), 8, 1);            
        cvCopy(Img, dst, NULL);        
        cvResetImageROI(Img);        
        //выводим информацию о файле, она будет нужна для последующей обработки в php
        sprintf(fileOut,"%s/%d.%s",dirOut, p, fileExt);
        cout << fileOut << endl;
        p++;        
        //преобразуем в черно-белый
        cvCvtColor(dst,gray,CV_RGB2GRAY);             
        cvAdaptiveThreshold(gray, bw, 255, adaptive_method,threshold_type,block_size,offset);    
        cvSaveImage(fileOut, bw);        
        cvReleaseImage(&dst);        
        cvReleaseImage(&gray);        
        cvReleaseImage(&bw);        
    }          
    return 0;
}

The second action is recognition

The recognition utility receives both normal content and garbage at the input.

As stated earlier, for recognition we use the utility from Google - tesseract.
Other recognition tools could be used; cuniform was also tested.
But tesseract was chosen due to the fact that there is a lot of information on it and there was understandable instruction for training it on its own character set.

Your alphabet training was done with several goals:

Dictionary for the recognition of numbers - should consist of 10 characters, no letters and other characters are needed. Short set error probability.
In principle, it was possible to stop at the 1st one - tesseract has a mode for recognizing only digits. You could use it and not bother creating your own dictionary.
But the test results moved one more idea and the reason is as follows: regular fonts (included in the standard set), have the characters of the numbers from the point of view of OCR similar to each other: the number "7" under certain conditions is similar to "1", the number "3 "To" 8 ", etc.
Therefore, it was decided to use a font in which the symbol of the numbers will not be similar to each other. As a hint for finding the font was the name thereof - "OCR A Std". This font is just used on the above clippings.
Thus, we have another factor to reduce the likelihood of error.

As a result, a dictionary of 10 characters of this font was created for tesseract, and it can be seen in the clippings above.
I will not give instructions for training the utility, the process is not creative, mechanical, there are a lot of instructions in the network.

Action Three - Collective

The system was tested under Ubuntu. Running slicing and recognition utilities is done by php.
Here, the final verification of the recognized data is carried out using the checksum method.
The crc-8 algorithm is used.


$imagesout = '/home/toor/www/out';
$findrect = '/home/toor/OCR/OpenCV-2.2.0/samples/cpp/findrect';
$uploaddir = '/home/toor/www/uploads/';
$rectdir = '/home/toor/www/out/';
$tesseract = '/home/toor/OCR/tesseract-3.00/api/tesseract';
...
if (isset($_FILES['userfile']['tmp_name'])) 
{
    $uploadfile = $uploaddir. $_FILES['userfile']['name'];
    if (!move_uploaded_file($_FILES['userfile']['tmp_name'], $uploaddir . $_FILES['userfile']['name'])) 
    {
        echo "Есть ошибки!";
        exit(1);
    } 
    echo "Файл {$_FILES['userfile']['name']} успешно загружен!";
    $cmd = "$findrect $uploadfile tif $imagesout";    
    exec($cmd, $output);
    echo count($output)." фрагментов";
    $datas = array();
    foreach($output as $k => $f)
    {         
         $recognized = "$rectdir$k.txt";
         $cmd = "$tesseract $f $rectdir$k -l nums.ocr";             
         exec($cmd);         
         if (!file_exists($recognized)) continue;
         echo "file: $recognized";
         $data = file_get_contents($recognized);         
         $data = preg_replace('/\D/','',$data);
         $data = trim($data);
         if (!strlen($data)) continue;
         if (!array_key_exists($data,$datas))  $datas[$data] = 1; else $datas[$data]++;
    }
    foreach ($datas as $d => $v)
    {
              if ($r = crc_check($d, NUMBER_LEN_1, NUMBER_LEN_CRC_1))  {
                  echo 'Найден номер: '.$r;
             }
            if ($r = crc_check($d, NUMBER_LEN_2, NUMBER_LEN_CRC_2))  {
                  echo 'Найден номер: '.$r;
             }
    }
}

In general, in test mode, the system proved to be quite good.
Images from the simplest phones like this

and up to several megabytes from digital cameras are being worked out .

References

Tesseract
OpenCV
OCR A Std Font

Tags: