Machine learning and chocolates
- Transfer
It is said that merchandisers have an unwritten rule: never place Nesquik and Snickers bars near. Who knows whether this is a myth or not, but there are technologies that allow you to check the storage conditions and lay out chocolates on display windows. In this article, we will delve into them and talk about the machine learning model designed just for these purposes.
The company with which we worked has a huge distribution network through supermarket chains covering more than fourteen countries. Each of the distributors must arrange the display of chocolates in the windows in accordance with standard policies. These policies indicate on which shelf this or that sort of candy should be placed, and also determine the rules for storage.
The procedures for verifying compliance with these policies are invariably very expensive. SMART Business sought to develop such a system with which a reviewer or store manager could look at the image and immediately understand how well and correctly the goods were placed on the shelf - as in the image below.
Effective policy (left), ineffective policy (right)
In defining the range of tasks, we investigated a number of image classification techniques, including the Microsoft Custom Vision Service , transferring training using CNTK ResNet, and detecting objects using CNTK Fast-RCNN . Although the technology for detecting objects using Fast-RCNN showed the best result, in the course of the study we also found that each approach is different in complexity and each has its own pros and cons.
Training and working with a REST-based service is much easier than training, deploying, and updating a custom computer vision model. As a result, the first service we started using was the Microsoft Custom Vision Service. Custom Vision Service is a tool for creating classifiers for custom images and their continuous optimization. To train the model, we used a set of samples from 882 images, on which separate shelves with the layout of chocolates were presented (at the same time, 505 images show the layout of the goods corresponding to the requirements and 377 not corresponding).
As a result of the training, we got a relatively effective base model using the Custom Vision Service along with the following performance tests:
In addition, we tested the models on a set of 500 hidden images in order to supplement this data and obtain guaranteed consistent basic indicators.
For more information about benchmarking and using the Custom Vision Service model in a production environment, see the previous code example: Classifying Foods Using Custom Vision Service. Detailed explanations of standard classification metrics are provided in the Metrics section when evaluating machine learning algorithms in Python.
Inaccuracy matrix
Despite the fact that the Custom Vision Service showed excellent results in this scenario and has established itself as a powerful tool for classifying images, this service has revealed a number of limitations that hinder its effective use in a production environment.
These limitations are most fully described in the following portion of the Custom Vision Service documentation :
The methods used in the Custom Vision Service, methods for effectively identifying differences, which allows you to start creating a prototype, even with a small amount of data. In theory, creating a classifier requires a small number of images - 30 images for each class is enough to create a prototype. However, this means that the Custom Vision Service, as a rule, is poorly prepared to implement scenarios aimed at identifying the most insignificant differences.
Custom Vision Service showed excellent results after we found out that we needed to process one policy and separate shelves with the laying out of chocolate products. Nevertheless, the limit of 1000 images used for training the service did not allow us to fine-tune the model to work with some borderline cases within the framework of this policy.
For example, the Custom Vision Service has worked well in terms of detecting gross policy violations, and there were most of them in our data set, for example, the following cases:
Nevertheless, this service constantly failed to recognize less obvious, albeit systematic violations in those cases , when the difference was only one candy - as, for example, on the first shelf in this image:
To overcome the limitations of the Custom Vision Service, we decided to create several models and then combine the results using the majority sample classifier . And although this would undoubtedly improve the results of using the model and possibly allow it to be used in other scenarios, it would also increase the cost of API and runtime. In addition, the model still could not be scaled to serve more than one or two policies, since the Custom Vision Service has a limit on the number of models in each account - no more than nineteen.
In order to avoid the limitations of the Custom Vision Service associated with the data set, we decided to create an image recognition model using CNTK and ResNet-based learning transfer technology , guided by the instructions in the next tutorial . ResNet is a deep convolutional neural network (GNSS) architecture developed by Microsoft as part of the ImageNet contest in 2015.
In this case, our training data set contained two sets of 795 images, which represented an effective and inefficient policy.
Fig. 3. ResNet CNN view with image obtained from ImageNet. The input is the RGB image of the cat, the output is the probability vector, the maximum value of which corresponds to the label “tabby cat”.
Since we did not have enough data (tens of thousands of samples) and enough computing power to train the large-scale CNN model from scratch, we decided to use ResNet as part of retraining at the output level of the training data set.
We launched the ResNet learning transfer model three times: for 20, 200, and 2000 superframes, respectively. The best result was obtained for a test data set at launch for 2000 superframes.
Inaccuracy matrix
As you can see, the learning transfer technology has shown significantly worse performance results compared to Customer Vision Service.
However, porting training to ResNet is a powerful tool for training an image recognition system using limited data sets. Nevertheless, if the model is used for new images that are too much different from the original 1000 ImageNet classes, then it fails to obtain new representative elements using abstract elements “learned” on the basis of the ImageNet training set .
We observed promising results in terms of classifying individual policies using object recognition techniques (for example, Custom Vision Service). Given the large number of brands and the possible options for placing them on the shelves, it was completely impossible to determine whether the policy was respected using only images, using the standard object recognition process based on available data.
Given the high complexity of the problems encountered during the scan, as well as the desire of SMART Business to create new models for each policy as quickly and easily as possible on the basis of standard object recognition methods, we decided to solve the problem creatively.
To make the policy more efficient without compromising the classification accuracy, we decided to use the technology of object detection and Fast R-CNN along with AlexNet to detect shelves with images of goods that meet the requirements on the images. If shelves with a layout that meets the requirements are found in the image, then the entire showcase is considered to meet the requirements. Thus, we were able not only to classify images, but also to reuse previously classified shelves to create new custom policies. We chose Fast R-CNN instead of other alternatives (for example, Faster R-CNN), since the implementation and evaluation process has already proved its effectiveness when using CNTK (see sectionObject Detection Using CNTK ).
First, we used a new image support function - a tool marking the visual elements (Visual Object Tagging Tool, VoTT) - labeling of effective policies in the framework of a data set larger (2600 images). For instructions on marking image catalogs with VOTT, see Marking an Image Catalog .
Please note that on all three shelves the layout meets the requirements, therefore, the image shows the work of an effective policy.
By changing the proportions of filtering, the number and the minimum size of the focal region, we were able to get a high-quality result using an existing data set.
Although at first glance the results for this model look much worse than when using the Custom Vision Service-based solution, the presence of a modular structure and the possibility of generalization in the framework of separate, constantly recurring problems inspired SMART Business to continue research in the field of advanced methods for detecting objects.
Further, the advantages and disadvantages of the studied contextual methods for classifying images in order of increasing complexity are considered.
The deep learning ecosystem is developing rapidly, every day fundamentally new algorithms are developed and improved. Once you review the performance data with standard performance testing, you might be tempted to immediately use the latest DNN algorithm to solve classification problems. However, it is equally important (and maybe more) to evaluate such new technologies in the context of their application. Too often, the novelty of machine learning algorithms overshadows the importance of carefully thought out and balanced techniques.
The techniques that we learned during our collaboration with SMART Business provide a huge selection of classification methods of varying degrees of complexity, and also show possible disadvantages that should be taken into account when building image classification systems.
Our study shows how important it is to take into account all the possible disadvantages (complexity in implementation, scalability and optimization possibilities) in terms of using data sets of a different size, variability of class instances, class similarity and various performance requirements.
PS Thanks to Kostya Kichinsky ( Quantum Quintum ) for illustrating this article.
Digital Transformation Series
Technology articles:
1. Start .
2. Blockchain in the bank .
3. We teach the machine to understand human genes .
4. Machine learning and chocolates .
5. Loading ...
A series of interviews with Dmitry Zavalishin on the DZ Online channel :
1. Alexander Lozhechkin from Microsoft: Do you need developers in the future?
2. Alexey Kostarev from Vera Robot: How to replace HR-a with a robot?
3. Fedor Ovchinnikov from Dodo Pizza: How to replace the restaurant director with a robot?
4. Andrei Golub from ELSE Corp Srl: How do I stop spending a ton of time shopping?
Situation
The company with which we worked has a huge distribution network through supermarket chains covering more than fourteen countries. Each of the distributors must arrange the display of chocolates in the windows in accordance with standard policies. These policies indicate on which shelf this or that sort of candy should be placed, and also determine the rules for storage.
The procedures for verifying compliance with these policies are invariably very expensive. SMART Business sought to develop such a system with which a reviewer or store manager could look at the image and immediately understand how well and correctly the goods were placed on the shelf - as in the image below.
Effective policy (left), ineffective policy (right)
Study
In defining the range of tasks, we investigated a number of image classification techniques, including the Microsoft Custom Vision Service , transferring training using CNTK ResNet, and detecting objects using CNTK Fast-RCNN . Although the technology for detecting objects using Fast-RCNN showed the best result, in the course of the study we also found that each approach is different in complexity and each has its own pros and cons.
Custom vision service
Training and working with a REST-based service is much easier than training, deploying, and updating a custom computer vision model. As a result, the first service we started using was the Microsoft Custom Vision Service. Custom Vision Service is a tool for creating classifiers for custom images and their continuous optimization. To train the model, we used a set of samples from 882 images, on which separate shelves with the layout of chocolates were presented (at the same time, 505 images show the layout of the goods corresponding to the requirements and 377 not corresponding).
As a result of the training, we got a relatively effective base model using the Custom Vision Service along with the following performance tests:
In addition, we tested the models on a set of 500 hidden images in order to supplement this data and obtain guaranteed consistent basic indicators.
For more information about benchmarking and using the Custom Vision Service model in a production environment, see the previous code example: Classifying Foods Using Custom Vision Service. Detailed explanations of standard classification metrics are provided in the Metrics section when evaluating machine learning algorithms in Python.
Label | Accuracy | Return completeness | F-1 score | Support |
---|---|---|---|---|
Does not meet the requirements | 0.71 | 0.74 | 0.72 | 170 |
Meets the requirements | 0.87 | 0.85 | 0.86 | 353 |
Average / total | 0.82 | 0.81 | 0.82 | 523 |
Inaccuracy matrix
125 | 45 |
52 | 301 |
Despite the fact that the Custom Vision Service showed excellent results in this scenario and has established itself as a powerful tool for classifying images, this service has revealed a number of limitations that hinder its effective use in a production environment.
These limitations are most fully described in the following portion of the Custom Vision Service documentation :
The methods used in the Custom Vision Service, methods for effectively identifying differences, which allows you to start creating a prototype, even with a small amount of data. In theory, creating a classifier requires a small number of images - 30 images for each class is enough to create a prototype. However, this means that the Custom Vision Service, as a rule, is poorly prepared to implement scenarios aimed at identifying the most insignificant differences.
Custom Vision Service showed excellent results after we found out that we needed to process one policy and separate shelves with the laying out of chocolate products. Nevertheless, the limit of 1000 images used for training the service did not allow us to fine-tune the model to work with some borderline cases within the framework of this policy.
For example, the Custom Vision Service has worked well in terms of detecting gross policy violations, and there were most of them in our data set, for example, the following cases:
Nevertheless, this service constantly failed to recognize less obvious, albeit systematic violations in those cases , when the difference was only one candy - as, for example, on the first shelf in this image:
To overcome the limitations of the Custom Vision Service, we decided to create several models and then combine the results using the majority sample classifier . And although this would undoubtedly improve the results of using the model and possibly allow it to be used in other scenarios, it would also increase the cost of API and runtime. In addition, the model still could not be scaled to serve more than one or two policies, since the Custom Vision Service has a limit on the number of models in each account - no more than nineteen.
Transfer Learning Using CNTK and ResNet
In order to avoid the limitations of the Custom Vision Service associated with the data set, we decided to create an image recognition model using CNTK and ResNet-based learning transfer technology , guided by the instructions in the next tutorial . ResNet is a deep convolutional neural network (GNSS) architecture developed by Microsoft as part of the ImageNet contest in 2015.
In this case, our training data set contained two sets of 795 images, which represented an effective and inefficient policy.
Fig. 3. ResNet CNN view with image obtained from ImageNet. The input is the RGB image of the cat, the output is the probability vector, the maximum value of which corresponds to the label “tabby cat”.
Since we did not have enough data (tens of thousands of samples) and enough computing power to train the large-scale CNN model from scratch, we decided to use ResNet as part of retraining at the output level of the training data set.
results
We launched the ResNet learning transfer model three times: for 20, 200, and 2000 superframes, respectively. The best result was obtained for a test data set at launch for 2000 superframes.
Label | Accuracy | Return completeness | F-1 score | Support |
---|---|---|---|---|
Does not meet the requirements | 0.38 | 0.96 | 0.54 | 171 |
Meets the requirements | 0.93 | 0.23 | 0.37 | 353 |
Average / total | 0.75 | 0.47 | 0.43 | 524 |
Inaccuracy matrix
165 | 6 |
272 | 81 |
As you can see, the learning transfer technology has shown significantly worse performance results compared to Customer Vision Service.
However, porting training to ResNet is a powerful tool for training an image recognition system using limited data sets. Nevertheless, if the model is used for new images that are too much different from the original 1000 ImageNet classes, then it fails to obtain new representative elements using abstract elements “learned” on the basis of the ImageNet training set .
conclusions
We observed promising results in terms of classifying individual policies using object recognition techniques (for example, Custom Vision Service). Given the large number of brands and the possible options for placing them on the shelves, it was completely impossible to determine whether the policy was respected using only images, using the standard object recognition process based on available data.
Given the high complexity of the problems encountered during the scan, as well as the desire of SMART Business to create new models for each policy as quickly and easily as possible on the basis of standard object recognition methods, we decided to solve the problem creatively.
Decision
Object Detection and Fast R-CNN
To make the policy more efficient without compromising the classification accuracy, we decided to use the technology of object detection and Fast R-CNN along with AlexNet to detect shelves with images of goods that meet the requirements on the images. If shelves with a layout that meets the requirements are found in the image, then the entire showcase is considered to meet the requirements. Thus, we were able not only to classify images, but also to reuse previously classified shelves to create new custom policies. We chose Fast R-CNN instead of other alternatives (for example, Faster R-CNN), since the implementation and evaluation process has already proved its effectiveness when using CNTK (see sectionObject Detection Using CNTK ).
First, we used a new image support function - a tool marking the visual elements (Visual Object Tagging Tool, VoTT) - labeling of effective policies in the framework of a data set larger (2600 images). For instructions on marking image catalogs with VOTT, see Marking an Image Catalog .
Please note that on all three shelves the layout meets the requirements, therefore, the image shows the work of an effective policy.
By changing the proportions of filtering, the number and the minimum size of the focal region, we were able to get a high-quality result using an existing data set.
results
Although at first glance the results for this model look much worse than when using the Custom Vision Service-based solution, the presence of a modular structure and the possibility of generalization in the framework of separate, constantly recurring problems inspired SMART Business to continue research in the field of advanced methods for detecting objects.
Use cases
Further, the advantages and disadvantages of the studied contextual methods for classifying images in order of increasing complexity are considered.
Methodology | Benefits | disadvantages | Application area |
---|---|---|---|
Custom vision service | • Ability to start using even with small data sets. A graphical user interface is not required. • Verified images can be re-tagged to enhance the model. • Ability to implement the service in a production environment with just one click. | • Ability to detect the most minor changes. • Inability to run the model locally. • Limited training set: total 1000 images. | • Cloud services (for example, Custom Vision Service) are excellent for solving problems related to the classification of objects, in the presence of a limited training set. This is the easiest method available. • As part of our research, the service showed the best results when using the available data set, but could not cope with scaling within several policies and with the detection of constantly recurring problems. |
CNN / Learning Transfer | • Effective use of existing levels of the model, so that the model does not have to be trained from scratch. • Simple training - just select the sorted image catalogs and apply a training script to them. • The size of the training set is not limited, it is possible to run the model offline. | • It cannot cope with the classification of data whose abstract elements differ from the elements that participated in the training based on the ImageNet dataset. • A graphical user interface is required for training. • Implementation in a production environment is much more complicated than implementing a computer vision service. | • Transferring CNN training to pre-trained models (such as ResNet or Inception) shows the best results when using medium-sized datasets whose properties are similar to ImageNet categories. Please note that if you have a large data set (at least several tens of thousands of samples), it is recommended that you re-train at all levels of the model. • Of all the methods we have studied, transfer of training has shown the worst result in relation to our scenario of complex classification. |
Object Discovery Using VoTT | • Better for detecting minor differences between image classes. • Detection areas have a modular structure and can be reused when changing the criteria for a comprehensive classification. • The size of the training set is not limited, it is possible to run the model offline. | • Requires annotation of frames for all images (although using VoTT greatly simplifies the task). • A graphical user interface is required for training. • Implementation in a production environment is much more complicated than implementing a computer vision service. • Fast R-CNN algorithms are not capable of detecting small areas. | • By combining object detection techniques with heuristic image classification technologies, it is possible to apply scenarios that support working with medium-sized datasets in cases where minor differences are required to differentiate image classes. • Of all the considered methods, this turned out to be the most difficult in terms of implementation, however, it showed the most accurate result for the available test data set. It is on this technique that SMART Business opted for. |
The deep learning ecosystem is developing rapidly, every day fundamentally new algorithms are developed and improved. Once you review the performance data with standard performance testing, you might be tempted to immediately use the latest DNN algorithm to solve classification problems. However, it is equally important (and maybe more) to evaluate such new technologies in the context of their application. Too often, the novelty of machine learning algorithms overshadows the importance of carefully thought out and balanced techniques.
The techniques that we learned during our collaboration with SMART Business provide a huge selection of classification methods of varying degrees of complexity, and also show possible disadvantages that should be taken into account when building image classification systems.
Our study shows how important it is to take into account all the possible disadvantages (complexity in implementation, scalability and optimization possibilities) in terms of using data sets of a different size, variability of class instances, class similarity and various performance requirements.
PS Thanks to Kostya Kichinsky ( Quantum Quintum ) for illustrating this article.