The work of the Haar cascade in OpenCV in pictures: theory and practice

In the last article, we described in detail the number recognition algorithm ( link ), which consists in obtaining a textual representation on a pre-prepared image containing a frame with a number + small indents for easy recognition. We only mentioned in passing that the Viola-Jones method was used to highlight areas where numbers are contained. This method has already been described on the hub ( link , link , link , link ). Today we will illustrate how it works and touch on previously unreviewed aspects + as a bonus it will be shown how to prepare cut pictures with numbers on the iOS platform for the subsequent receipt of a textual representation of the number.
Viola Jones Method
Usually, each method has a basis, something without which this method could not exist in principle, and the rest of it is already built on this basis. In the Viola-Jones method, this basis is made up of Haar primitives, which are a breakdown of a given rectangular region into sets of heterogeneous rectangular subdomains:

In the original version of the Viola-Jones algorithm, only primitives without rotations were used, and to calculate the value of the attribute, the sum of the brightnesses of pixels of one subregion was subtracted from the sum of brightnesses another subdomain [1]. In the development of the method, primitives with a slope of 45 degrees and asymmetric configurations were proposed. Also, instead of calculating the usual difference, it was proposed to assign a certain weight to each subdomain and calculate the attribute values as a weighted sum of pixels of different types of regions [2]:

Why was Haar's primitives based on the method? The main reason was an attempt to get away from the pixel representation while maintaining the speed of computing the attribute. It is difficult to derive any meaningful information for classification from the values of a pair of pixels, while, for example, the first cascade of a face recognition system that has a completely meaningful interpretation is constructed from two Haar signs [1]:

The complexity of calculating the sign as well as obtaining the pixel value remains O (1): the value of each subdomain can be calculated by combining 4 values of the integral representation (Summed Area Table - SAT), which in turn can be constructed in advance once for the entire image in O (n), where n is the number Ixels in the image using the formula [2]:


This allowed us to create a quick algorithm for finding objects, which has been successful for more than a decade. But back to our signs. To determine class membership in each cascade, there is a sum of the values of the weak classifiers of this cascade. Each weak classifier gives two values, depending on whether the value of the attribute belonging to this classifier is greater or less than the specified threshold. At the end, the sum of the values of weak classifiers is compared with the threshold of the cascade and decisions are made whether the object is found or not by this cascade. Enough theory, let's get to practice!
We already gave a link to XML of our license plate classifier, which can be found in the opencv project wizard ( link ). Let's look at its first cascade:
6 -1.3110191822052002e+000
<_>
0 -1 193 1.0079263709485531e-002
-8.1339186429977417e-001 5.0277775526046753e-001
<_>
0 -1 94 -2.2060684859752655e-002
7.9418992996215820e-001 -5.0896102190017700e-001
<_>
0 -1 18 -4.8777908086776733e-002
7.1656656265258789e-001 -4.1640335321426392e-001
<_>
0 -1 35 1.0387318208813667e-002
3.7618312239646912e-001 -8.5504144430160522e-001
<_>
0 -1 191 -9.4083719886839390e-004
4.2658549547195435e-001 -5.7729166746139526e-001
<_>
0 -1 48 -8.2391249015927315e-003
8.2346975803375244e-001 -3.7503159046173096e-001
At first glance, it seems that there are a lot of strange numbers and strange information, but in fact it’s simple: weakClassifiers - a set of weak classifiers based on which a decision is made as to whether an object is in the image or not, internalNodes and leafValues are the parameters of a specific weak classifier. Decryption of internalNodes from left to right: the first two values are not used in our case, the third is the attribute number in the general attribute table (it is located further in the XML file under the features tag), and the fourth is the threshold value of the weak classifier. Since we use a classifier based on single-level decision trees ( decision stump), then if the value of the Haar attribute is less than the threshold of the weak classifier (the fourth value in internalNodes), the first value of leafValues is selected, if more is the second. And now we will draw the reaction of some classifiers of the first cascade:





In fact, all these signs are to some extent the most ordinary border detectors. Based on this basis, a decision is made on whether the cascade recognized an object in the image or not.
The second most important point in the Viola-Jones method is the use of a cascade model or a degenerate decision tree: a decision is made in each node of the tree whether the object can be contained in the image or not. If the object is not contained, then the algorithm finishes its work, if it can be contained, then we proceed to the next node. The training is designed in such a way that at the initial levels with the least cost, discard most of the windows in which the object cannot be contained. In the case of face recognition - the first level contains only 2 weak classifiers, in the case of recognition of car numbers - 6 (taking into account that the latter contain up to 15). Well, for clarity, how the number recognition by levels occurs:

A richer tone indicates the weight of the window relative to the level. The rendering was done based on the modified opencv project code from the 2.4 branch (level-by-level statistics added).
Recognition Implementation on iOS



There is usually no problem with adding opencv to a project, especially since there is a ready-made iOS framework that supports all existing architectures (including the simulator). The function for finding objects is used the same as in the project for Android ( link ):
detectMultiScale
class cv::CascadeClassifier
, it remains only to prepare the data for input. Let's say we haveUIImage
where you need to find all the numbers. For the cascade, we need to do several things: first, shrink the image to 800px on the larger side (the larger the image, the more scales you need to consider, the number of windows that you need to look at depends on the image size), and secondly, do there is a black and white analogue from it (the method operates only with brightness, in principle this step can be skipped, opencv can do this for us, but we will do it at the same time, once and for all we manipulate the image), thirdly, get binary data for transmission opencv. All these three things can be done in one fell swoop by drawing our picture with the correct parameters into the context, like this:+ (unsigned char *)planar8RawDataFromImage:(UIImage *)image
size:(CGSize)size
{
const NSUInteger kBitsPerPixel = 8;
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceGray();
NSUInteger elementsCount = (NSUInteger)size.width * (NSUInteger)size.height;
unsigned char *rawData = (unsigned char *)calloc(elementsCount, 1);
NSUInteger bytesPerRow = (NSUInteger)size.width;
CGContextRef context = CGBitmapContextCreate(rawData,
size.width,
size.height,
kBitsPerPixel,
bytesPerRow,
colorSpace,
kCGImageAlphaNone);
CGColorSpaceRelease(colorSpace);
UIGraphicsPushContext(context);
CGContextTranslateCTM(context, 0.0f, size.height);
CGContextScaleCTM(context, 1.0f, -1.0f);
[image drawInRect:CGRectMake(0.0f, 0.0f, size.width, size.height)];
UIGraphicsPopContext();
CGContextRelease(context);
return rawData;
}
Now you can safely create from this buffer
cv::Mat
and pass it to the recognition function. Next, recalculate the position of the found objects in relation to the original image and cut out:CGSize imageSize = image.size;
@autoreleasepool {
for (std::vector::iterator it = plates.begin(); it != plates.end(); it++) {
CGRect rectToCropFrom = CGRectMake(it->x * imageSize.width / imageSizeForCascade.width,
it->y * imageSize.height / imageSizeForCascade.height,
it->width * imageSize.width / imageSizeForCascade.width,
it->height * imageSize.height / imageSizeForCascade.height);
CGRect enlargedRect = [self enlargeRect:rectToCropFrom
ratio:{.width = 1.2f, .height = 1.3f}
constraints:{.left = 0.0f, .top = 0.0f, .right = imageSize.width, .bottom = imageSize.height}];
UIImage *croppedImage = [self cropImageFromImage:image withRect:enlargedRect];
[plateImages addObject:croppedImage];
}
}
If desired, the class
RVPlateNumberExtractor
can be redone and used in any other project where recognition of any other objects, not just numbers, is required. Just in case, I wanted to note that if you want to open the immediately recorded image from the disc through
imread
, then on iOS there may be problems, because when taking photos, iOS records the picture always in the same orientation and adds rotation information to EXIF, and opencv EXIF does not process reading. Again, you can get rid of this by rendering to the context.Afterword
All the source code of our latest iOS application can be found on GitHub: link
There you can find a lot of useful things, for example, the already mentioned class
RVPlateNumberExtractor
for cutting numbers from a full-fledged image of pictures with numbers, as well as RVPlateNumber
with a very simple interface that you can safely take into your own projects, if you need a service for recognition of numbers and it may well be that you will find there something else interesting for yourself. We also do not mind if someone wants to add new functionality to the application or makes a beautiful design! Application in the AppStore: link
At the request of workers, we also updated the android application : added a selection of saved numbers for sending.
List of references
- P. Viola and M. Jones. Robust real-time face detection. IJCV 57 (2), 2004
- Lienhart R., Kuranov E., Pisarevsky V .: Empirical analysis of detection cascades of boosted classifiers for rapid object detection. In: PRS 2003, pp. 297-304 (2003)