Using convolutional networks to search, highlight, and classify

    Recently, ZlodeiBaal published an article, “Neurorevolution in the Heads and Villages,” which gave an overview of the capabilities of modern neural networks. The most interesting, in my opinion, is the approach using convolutional networks for image segmentation, this approach will be discussed in the article.

    segnet.png


    For a long time there was a desire to study convolution networks and learn something new, in addition, there are several recent Tesla K40s with 12GB of memory, Tesla c2050, regular video cards, Jetson TK1 and a laptop with mobile GT525M, most interesting of course is to try on TK1, so how it can be used almost everywhere, even hang on a lamppost. The very first thing I started with is the recognition of numbers, of course there is nothing to be surprised about, the numbers have long been well recognized by networks, but there is always a need for new applications that must recognize something: house numbers, car numbers, car numbers, etc. d. Everything would be fine, but the task of recognizing numbers is only part of more general tasks.


    Convolution networks are different. Some only know how to recognize objects in the image. Some people know how to select a rectangle with an object (RCNN, for example). And some can filter the image and turn it into some kind of logical picture. I liked the latter most of all: they are the fastest and most beautifully work. For testing, one of the latest networks on this front was chosen - SegNet , more details can be found in the article . The main idea of ​​this method is that instead of lable, not a number is supplied, but an image, a new “Upsample” layer is added to increase the dimension of the layer.

    layer {
      	name: "data"
      	type: "DenseImageData"
      	top: "data"
      	top: "label"
      	dense_image_data_param {
       		source: "/path/train.txt"	// файл обучения: image1.png  label1.png
        		batch_size: 4   			   
       		shuffle: true
      	}
    }
    


    At the end, the expanded image and mask from lable are fed to the loss layer, where each class is assigned its weight in the loss function.

    layer {
        	name: "loss"
        	type: "SoftmaxWithLoss"
       	 bottom: "conv_1D"
       	 bottom: "label"
       	 top: "loss"
        	softmax_param {engine: CAFFE}
        	loss_param: {
        	  	weight_by_label_freqs: true
        	  	class_weighting: 1
        	  	class_weighting: 80
      	}
    }
    


    Correctly recognizing numbers is only part of the task of recognizing numbers and is far from the most difficult, you must first find this number, then find where the numbers are approximately located, and then recognize them. Quite often, large errors appear at the first stages, and as a result, it is rather difficult to get a high reliability of recognition of numbers. Dirty and jammed numbers are poorly detected and with large errors, the number template is poorly superimposed, as a result there are many inaccuracies and difficulties. The number can be generally non-standard with arbitrary intervals, etc.
    For example, wagon numbers have many spelling variations. If you correctly select the boundaries of the number, you can get at least 99.9% on each digit. And if the numbers are intertwined? If segmentation will give different numbers in different parts of the car?

    title.jpg


    Or, for example, the task of detecting a license plate number. Of course, it can be solved both through Haar and through Hog . But why not try another method and compare? Especially when there is a base ready for training and markup?
    An image with a car number and a mask are fed to the input of the convolution network, on which the rectangle with the number is filled in with the unit, and everything else is zero. After training, we check the work on a test sample, where for each input image the network produces a mask of the same size, on which it paints those pixels where in its opinion there is a number. The result is in the figures below.

    8284.jpg

    8300.jpg

    8338.jpg

    8413.jpg

    8417.jpg


    After reviewing the test sample, you can understand that this method works quite well and almost does not fail, it all depends on the quality of training and settings. Since Vasyutka and ZlodeiBaal had a marked base of numbers, we trained on it and checked how well everything works. The result was no worse than the Haar cascade, and in many situations even better. Some disadvantages can be noted:
    • does not detect oblique numbers (they were not in the training set)
    • does not detect numbers taken point-blank (they were not in the sample either)
    • sometimes it does not detect white numbers on pure white machines (most likely also due to the incompleteness of the training sample, but, interestingly, the Haar cascade also had the same glitch)

    In general, the manifestation of these shortcomings is logical, the network does not find what was not in the training set. If you carefully approach the process of preparing the base, then the result will be of high quality.
    The resulting solution can be applied to a large class of tasks of searching for objects, not just license plates. Well, the number is found, now you need to find the numbers there and recognize them. This is also not an easy task, as it seems at first glance, you need to check a lot of hypotheses about their location, and what if the number is not standard, does not fit the mask, then it's a pipe. If car numbers are made by guest and have a certain format, that is, numbers that can be written as you like, by hand, with different intervals. For example, wagon numbers are written with spaces, units take up much less space than other numbers.
    Convolution networks are rushing to our aid again. But what if you use the same network for search and recognition. We will search and recognize the numbers of cars. An image is fed to the network input, on which there is a number and a mask, where the squares with numbers are filled with values ​​from 1 to 10, and the background is filled with zero.

    number mask.png


    After not very long training on the Tesla K40, the result is obtained. To make the result more readable, different numbers are painted in different colors. Determine the number by the colors of great labor will not amount to.

    vagon in.png

    vagon_res.png


    In fact, it turned out a very good result, even the worst numbers, which until then were poorly recognized, could be found, divided into numbers and the entire number was recognized. The result was a universal method that allows not only to recognize numbers, but also to find an object in the image as a whole, to select and classify it, if there can be several such objects.

    3.png

    3_res.png


    But what if you try something more unusual, interesting and complex, for example, highlighting and segmentation on medical images. For the test, X-ray images were taken from an open base of CT and X-RAY images, training on lung segmentation was conducted on them, and as a result, it was possible to accurately identify the area of ​​interest. The source image and mask with zero and one were also fed to the network input. On the right is the result that the convolution network produces, and on the left is the same area on the image.

    f1.jpg


    For example, the article uses a lung model for segmentation. The result obtained using convolutional networks is in no way inferior, and in some cases even better. At the same time, training a network takes much faster than creating and debugging an algorithm.

    In general, this approach has shown high efficiency and flexibility in a wide range of tasks; with it, it is possible to solve all kinds of problems of searching for objects, segmentation and recognition, and not just classification.
    • allocation of license plates;
    • recognition of car numbers;
    • recognition of numbers on cars, platforms, containers, etc.
    • segmentation and allocation of objects: lungs, seals, pedestrians, etc.

    This method works quickly enough, on a Tesla video card, processing of one image takes 10-15 ms, and on Jetson TK1 in 1.4 seconds. About how to run Caffe on Jetson TK1 and what processing speed can be achieved on it in these tasks, it is probably better to devote a separate article.

    PS
    Training took no more than 12 hours.
    The size of the base according to numbers 1200 images.
    The size of the base for cars is 6000 images.
    Lightweight base size 480 images.

    1. SegNet
    2. A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (pdf)
    3. Haar
    4. Hog
    5. Segmentation of lungs in images (pdf)

    Also popular now: