Mockingly accurate, fast and lightweight barcode search through semantic segmentation

    Search for objects in images? Having a training sample and a minimum set of knowledge about neural networks, any student today can get a solution of a certain accuracy. However, most of the neural networks used to solve this problem are quite deep, and therefore require a lot of data for training, they work relatively slowly in the inference phase (especially if the device does not have a GPU), they weigh a lot and are quite energy-intensive. All of the above can be very critical in certain cases, primarily for mobile applications.

    Barcodes are objects with a fairly simple structure. In the course of research, we were able, using a relatively original approach, to search for such simple objects very accurately (we beat state-of-the-art) and fairly quickly (real-time on an average CPU). Plus, our detector is very light, having only 30k weights. We will talk about the results of our research in this article.

    But before talking about the solution, it’s worth it to dive into the subject area a bit, and perhaps it’s worth starting with what barcodes are.

    What is a barcode and why look for it?

    Barcodes is a generic name for a group of machine-readable data formats. You have probably heard about barcodes and QR codes - these are just special cases of barcodes. In the modern world, they can be found on tickets, on goods in the store, in official documents and in many other places.

    Barcodes are usually divided into one-dimensional, represented by a set of black and white stripes, and two-dimensional, consisting of small square or rectangular blocks. As can be seen in the figure, in general, the various types are very similar to each other.

    The complexity may be represented by some types of barcodes for which length is not specified. Such barcodes can be arbitrarily long and at the same time very thin.

    The example above is artificial, but barcodes with similar proportions are found in real documents, in particular in invoices.

    And since these are machine-readable formats, they should be relatively easily recognized by the hardware. However, the first step on the path to recognition is search, and it will be discussed in this article.

    It is worth noting that laser scanners and many applications imply the active participation of a person - it is necessary to clearly point the camera / scanner at the barcode so that it is recognized, in which case the need for a search naturally disappears. However, in this scenario, users are forced to spend additional efforts on their part, which is not good. This approach also does not allow recognition of several objects in the image, limited to only one.

    Statement of the problem and quality metrics

    What would we like from the created barcode detector?

    1. Finding all the barcodes in the image is accurate enough *
    2. Find long and very narrow barcodes
    3. Real-time on cpu

    * It is proposed to use the f-measure for objects with the threshold IoU = 0.5 as the main metric of accuracy. We will also look at precision, recall and detection rate.

    Formal Definition of Metrics

    First, we introduce the concept of IoU - Intersection over Union (also known as the Jaccard Index). Having true (G ) and predicted (F) маски объекта, IoU вычисляется как площадь пересечения, деленная на площадь объединения.


    Легко заметить, что IoU принимает значения от 0 до 1. Теперь, основываясь на этой мере, введем понятия precision, recall, f-мера и detection rate.

    Пусть в выборке N объектов. Установим порог IoU равным некоторому фиксированному числу t (например, 0.5). Для некоторого объекта G, и детекции F считаем, что если J(G,F)t, объект найден (1), иначе считаем, что объект не найден (0). Тогда recall(t) определяется как доля найденных объектов выборки, precision(t) — доля детекций, соответствующих хотя бы одному объекту выборки. f-мера определяется стандартно как среднее гармоническое 2precision(t)recall(t)/(precision(t)+recall(t)).

    Осталось разобраться с тем, что такое detection rate. Пусть в выборке M картинок с объектами. Посчитаем IoU между истинными и предсказанными масками по картинке, аналогично отсечем по порогу картинки. Detection rate называется доля картинок, по которым общий IoU всех истинных объектов с общим IoU всех предсказанных объектов больше заданного порога t.

    In previous studies (on the detection of barcodes), it is customary to use the detection rate to assess quality. However, let us consider such a situation - let there be one very large and one very small object in the image, but the detector found large and did not find small. In this case, the detection rate will only signal an error if the threshold is very large.t close to 1. Whiler e c a l l will be no more than 0.5 for arbitrary threshold valuest , which will also make the f-measure less than 1. That is, for such an example, the f-measure may signal an error earlier (in the sense of a lower thresholdt ) than the detection rate.

    Therefore, in our experiments, we relied on the f-measure, and used the detection rate to compare with the work of past years.

    Inspired by text detectors

    Perhaps the most famous and popular object detector when speed is important is YOLO (You Only Look Once). There are also later versions of YOLOv2 and YOLOv3, but in general this approach is good for more or less square objects, but in finding narrow, but very elongated objects, this approach is not so strong.

    So we decided to go the other way. As you can easily see, the lines with text in the image are also very elongated and narrow, and the task of finding text in images in the scientific community is quite popular (certainly more popular than searching for barcodes) and various interesting architectures based on the graph structure were invented for it. Our decision is motivated by just one of the text search works in which the PixelLink architecture is proposed.

    The essence of the approach is as follows - let's solve the instance segmentation task like this:

    1. For each pixel, we will solve the binary classification problem (text / not text).
    2. For each pixel, we will predict 8 links with its neighbors. Communication means that this pair of pixels belongs to one object (instance).
    3. Having received a text / non-text segmentation map and maps with links from the image, we assemble a graph in which the vertices will be pixels whose text class is predicted and the edges are links.
    4. We find connected components in this graph.
    5. Around each connected component, select the minimum enclosing rectangle (for example, using OpenCV), which will be the final detection.

    This process is graphically illustrated, the figure is taken from the original article .

    This approach for searching for text gives fairly decent results comparable to state-of-the-art.

    We return now to the barcodes. We asked ourselves - do we really need these links? Indeed, barcodes, in contrast to the text, in the vast majority of cases are far from each other. And indeed, according to the specification, many types of barcodes must have a certain gap to the nearest neighboring objects.
    In general, we decided that we did not really need links, and modified the approach as follows:

    1. We solve the problem of semantic segmentation - for each pixel we determine the class of barcode / non-barcode.
    2. We build a graph with vertices in pixels for which the type of barcode is determined. As edges instead of looking for links, we consider that there is a link between any pair of adjacent pixels of the barcode type.
    3. We find the connected components in this graph.
    4. Around each connected component, select the minimum enclosing rectangle, which will be the final detection.

    So, we decided on the general solution scheme. Paragraphs 2–4 can be considered trivial, but let’s discuss how to deal with semantic segmentation exactly.

    Network Architecture for Semantic Segmentation

    A bit of history that is commonly used

    In general, semantic segmentation using neural networks has been going on since about 2013 since the advent of U-Net. In U-Net, the spatial resolution was first gradually reduced in Encoder, then gradually increased in Decoder, there were also skip connections from intermediate Encoder features to intermediate Decoder features. Later, dilated convolutions appeared (see the figure below), which yielded better quality, but required more memory and computation to process. Well, in the end, there was an approach known as Deeplabv3 +, combining both of these approaches and being state-of-the-art among human-designed architectures at the time of writing this article (in fact, thanks to Neural Architecture Search, more effective solutions, for example ).

    Let us dwell a little more on dilated convolution, because in the final decision, we will rely on its properties.

    Normal convolution works like this

    While the dilated convolution is shown below (in the image the dilated factor is 2)

    You may notice that conventional convolution is actually a special case of dilated convolution (with dilation factor 1).

    In general, the discussion of solutions for semantic segmentation is a topic for a separate rather big article, I just wanted to make a small reminder in which the most important thing is to deal with the dilated convolution device. If you are new to semantic segmentation, here is the best review known to the author, however, to understand what will happen next, familiarization with it is not necessary.

    If everything were so simple ...

    The main requirement for our neural network, in addition to quality, is speed. In our data, there are barcodes so small relative to the size of the image that the minimum resolution at which it makes sense to run a neural network should be at least 512x512, and preferably 1024x1024. At the same time, there is a requirement to work on the CPU for no more than 100 ms in the image (and also on one core!). In terms of the set of requirements for the input resolution and the total operating time, the task is not very trivial. With such severe restrictions, the use of strong, deep architectures is not possible. In fairness, we have not yet fulfilled the requirement of 100 ms on one core, but the final solution works 40 ms on 4 cores (Intel Core i5, 3.2 GHz).

    Unfortunately, all the architectures we know for semantic segmentation categorically did not fit into these restrictions. So we had a choice: either it is not known how to come up with something completely new, or try to simplify as much as possible one of the popular architectures. We did not have any brilliant ideas, so we chose the latter.

    Inspired by Context Aggregation Network

    Dilated convolutions have the following useful property - if you apply them sequentially with an exponentially increasing dilated factor, the receptive field will grow exponentially (whereas in the case of ordinary convolutions it grows linearly).

    The figure ( source ) shows the exponential increase in the receptive field of the neural network when sequentially applying dilated convolution. (a) F1 is obtained from F0 by convolution with dilation = 1, receptive field = 3x3 for each element in F1 (b) F2 is obtained from F1 by convolution with dilation = 2, receptive field = 7x7 for each element in F2 (c) F3 obtained from F2 using convolution with dilation = 4, receptive field = 15x15 for each element in F3

    In the article "Multi-Scale Context Aggregation by Dilated Convolutions" , a special block (called by the authors of the Context Module) was invented based on this property, which is used to collect information from the context of the current pixel. In the article, this block is used on top of a network that already knows how to solve the segmentation problem to refine these predictions using context information.

    We decided to take this Context Module, but use it not to improve the predictions of an existing good model, but as the main part of the convolution network, which will solve this problem. In general, our architecture is arranged as follows (see table):

    1. Downscale Module - initial convolution to extract the simplest features and reduce resolution.
    2. The Context Module is essentially a set of dilated convolutions that will quickly increase the receptive field, thus collecting attributes from a larger context.
    3. The final layer (1x1 convolution), receiving a probability map for semantic segmentation of barcode / non-barcode. Also, in case we want to distinguish between the type of detected barcodes, N channels are additionally predicted (see below).

    That is, in the resulting architecture, a constant small number of C = 24 channels is used in each layer, the first 3 convolutions reduce the resolution, subsequent convolutions (Context Module) increase the receptive field of each element of the feature map.

    The architecture hyperparameters table, among other things, indicates whether convolution on the current layer is separable. Separable convolutions theoretically give an acceleration of the order of k ^ 2 times compared to conventional convolution, where k is the filter size. In practice, the acceleration can be much less (it all depends on the image resolution, the number of input and output filters).

    Initially, we tried to make all convolutions separable, but this made the quality drop a lot. Then we decided to make separable only the first 3 convolutions, which work with higher image resolutions, and, accordingly, require more calculations compared to subsequent convolutions. At the same time, the quality from such a replacement has already fallen completely. Total due to separability, the neural network accelerated by another 20%.

    There remains the last unexplained moment - what does the number N in the last layer mean. N is responsible for the number of different types of objects (in our case, barcodes) that we want to distinguish. If we are considering the simplest case - when you only need to find objects, and you do not need to determine their type (this is a barcode and no matter which one) - you can consider N = 0. If we still want to distinguish between types of barcodes (this is a barcode of type [such-and-such]), then in the last layer N channels are added in which the probabilities of being of a certain type are predicted. Now, having received the detection in the form of the rectangle in which the barcode is located, we can average the probabilities of the classes inside this found rectangle, from which we can find out the type of the found barcode.


    Having received a solution, you should always look around and compare the quality of the solution with earlier studies. In general, the task of searching for barcodes is not very popular, for the last 10 years only 1-2 articles were published per year (as you know, the simplest way to beat state-of-the-art is to find the unpopular task, which we did in the end .. .). The table below compares with other approaches on two datasets, on one of which we got the best results.

    Excellence, of course, is not super impressive, but still present. However, the main feature of the solution found is not exactly, but in speed - compared to the same YOLO on the same GPU (GTX 1080), and at a higher image resolution, our method works ~ 3.5 times faster.

    In addition to the speed advantage, there is also the weight advantage of the final model. The detector has ~ 30 thousand scales, while the vast majority of modern convolution networks (even sharpened for mobile devices) have millions of parameters. The final solution weighs even less than LeNet, which had ~ 60 thousand parameters.


    In fact, this work is based on two simple ideas:

    1. Detection through semantic segmentation.
    2. Using a sequence of dilated convolution with a small number of channels for the fastest possible growth of the receptive field.

    The result was a very light and fast architecture, which nevertheless showed quite decent results (and even beat SOTA in the unpopular task of finding barcodes). In addition, the application of this approach is not limited to barcodes. We used the same architecture to find text, search for business cards and passports, and we got good results for these tasks.

    However, it is worth noting that this approach in this form has limitations. If 2 objects of the same type are located very close, they will stick together into one. Perhaps in the next series, we will tell you how to deal with such a problem.

    Based on the results of experiments with barcodes, we wrote an article that will be presented at ICDAR2019 in Sydney.


    1. An article with the results of our experiments and a large number of details. Universal Barcode Detector via Semantic Segmentation
    2. An excellent review article on the development of architectures for semantic segmentation. link
    3. PixelLink: Detecting Scene Text via Instance Segmentation
    4. This article is about using dilated convolution to improve predictions by gathering information from context. Multi-Scale Context Aggregation by Dilated Convolutions

    Computer Vision Research Group

    Also popular now: