How Amazon Go Perhaps Implements the “Just Get Out” Shopping Scheme

Transfer

In our time, the press releases of techno-companies do little to surprise us. The details of innovation either flow away a few months earlier or are not very impressive. But recently we encountered several real surprises. A few months before the release of Switch, Nintendo decided that the future of consoles was their past, and announced the NES Classic . And the victory of Google AlphaGo over a champion among people discouraged experts, who believed that such results could be obtained not earlier than ten years later.

December announcementAmazon Go retail store, where you can simply pick up products from the shelves and exit, can be compared to the shock of the news about AlphaGo. The “pick up and go” method has been known for some time as “the future of retail sales” and has been “just a few years away” from our time. I worked for more than ten years in the research department of robotics in Caltech, Stanford and Berkeley, and now I head a startup that manufactures security cameras for outdoor use. Computer vision was a big part of my work. But just a few months before the announcement, I confidently told someone that it would take several more years to implement the “took and gone” system. And I was not the only one who thought so - just two months before Planet Money had an episode on this topic.

So when Amazon suddenly surprised us all by creating such a thing, the first question was obvious: how will this work? In the promotional video rush loud words such as computer vision, in-depth training and the synthesis of sensors. But what does all this mean and how to really unite all these things?

I will start with the disclosure of intrigue: in fact, I do not know. I did not take part in the development of the project, and the company did not talk about how it works. But, given my experience and work in the field of computer vision, I can do a few guesses backed up with knowledge. At its core, Amazon Go looks like the same product of the development of AI, computer vision and automatic decision-making, like AlphaGo, and sudden breakthroughs in the field of robo mobiles. The breakthroughs in statistics and parallel computing over the past five years have created a new milestone in the field of machine intelligence.

That is why advanced developments happen in waves, and therefore, by allowing the mobile phone to take you to the store to buy a bag of milk, you destroy the interaction between people much earlier than anyone could have imagined.

Shopping basket

To better understand how Amazon Go’s ecosystem works, you need to outline the problem. In the case of a grocery store, Amazon has to answer one question: what does a visitor take away when leaving the store? In other words, what's in his shopping cart?

In fact, there are only two ways to answer the question. Amazon needs to either look into the cart when the user leaves, or keep track of what exactly goes into this cart. The first way we call the queue at the cashier, and this is how most modern stores work (check everything that the user takes with them). Another approach I call a bar account. As a bartender tracks all customer orders, so a business can find out what is in the shopping cart, tracking what exactly goes into the basket or leaves it. Ideally, you will know exactly what is there, and you will not have to force users to demonstrate their purchases.

Of course, Amazon Go is no ordinary grocery store. He should not only find out what is in each specific basket, but also understand who to write off for her money. To charge in the world without cashiers, you need to identify the user.

How does Amazon deal with this? How will the company keep track of people in the store, and what they take from the shelves or return, while avoiding errors? It all starts with the cameras. They are unobtrusive and cheap, and they can be put everywhere. Amazon talked about this by mentioning computer vision in a video. But how to handle what the cameras see, and use it to track customers and their actions? Then comes the second loud term, deep learning.

Neurons

The idea of using cameras in the process of charging was born a long time ago, but until recently it was just an idea.

So far, the algorithms of view have worked through finding noticeable properties of the image and collecting them into objects. From the image it was possible to extract lines, angles and edges. Four lines and four corners in a certain combination give you a square (or a rectangle). The same principles can be used to identify and track more complex objects using more complex properties and sets. The complication of visual algorithms depends on the complexity of the properties and techniques used to recognize certain sets of properties of objects.

For a long time, the most interesting progress in computer vision and machine learning depended on the invention of more complex features by researchers. Instead of lines and corners came wavelets and Gaussian blur, and properties with esoteric names like SIFT and SURF. For a time, the best property for identifying a person in an image was called HOG. But pretty quickly it became clear that the meticulous creation of properties manually quickly rests on the ceiling of its capabilities.

Algorithms based on the recognition of certain properties, surprisingly well worked on the recognition of what they have already seen. Show the algorithm a picture of a pack of six cans of cola, and he will become a world expert in recognizing packs of six cans of coke. But the generalization of these algorithms was not given; it was much harder for them to recognize soda in general, or the wider world of drinks.

To make matters worse, these systems were unreliable, and it was very difficult to improve them. Correction of errors required diligent manual adjustment of the logic of work, and only doctors of science who understood how the algorithm worked could do this. In the case of a store, you might not care if the algorithm mixed up a bottle of cola with a bottle of pepsi, but you would be worried if the algorithm took a bottle of wine worth $ 20 per bottle of soda to cost $ 2.

Today's deep learning opportunities are intentionally designed to get rid of the manual search and adjustment of image features. Instead of trying to manually find the characteristic properties, you use huge amounts of data to train a neural network. For examples of what it needs to recognize, the neural network finds its own features. Low-level neurons learn to recognize simple things such as lines, and their output is transmitted upward, to neurons that combine these primitives into more complex things, such as forms, into a hierarchical architecture.

It is not necessary to specify which features should be recognized by neurons; in the process of training, they simply appear independently. Neurons determine which laws make better sensitivity. If you are trying to create a soda discernment system, you show it tens of thousands of soda images, and it will go from lines and curves to forms, and then to boxes and bottles.

Our brain works in much the same way, so error correction takes place according to human schemes. On examples. If your neural network confuses wine and soda, you need to fix it, finding a few thousand more or other examples, and train it on them. She herself will figure out how to distinguish objects.

Software for simulating the work of neurons has existed for several decades, but its use for computer vision has long remained in the theoretical field. To simulate the view of animals requires from tens to hundreds of layers of neurons, each of which contains tens of thousands of neurons. And with each new layer the number of connections between the layers grows exponentially. Such networks require huge computer capacities, and for training large data arrays.

To create a neural network operating in a reasonable time, it is necessary to fine-tune its structure to minimize the number of internal connections. But even then too much horsepower is required.

Computational Cooperation

The next breakthrough was related to the use of GPUs as desktop supercomputers. The simulation of a neural network requires the collection of input data and the calculation of the output data for a variety of neurons - and this process is easy to parallelize. The hours of computing on the most powerful CPUs started running in minutes on the average GPU hand.

Parallel GPU computing finally allowed researchers to take advantage of the old discovery — structuring a neural network to simulate vision. Recall that even a simple network of several hundred thousand neurons can have billions of connections. All of them need to simulate, unless there is some shortcut for the operation of these compounds.

Fortunately, to create seeing networks, you can cheat a little - we have amazing examples of neural networks that are optimized for vision right in our heads. Neurobiology has been marking the visual cortex of mammals for decades, which served as inspiration. Thus was born the convolutional neural network (SNS). Over the past few years, it has become one of the most popular and powerful tools in the field of computer vision.

Convolution is an amazing mathematical concept, a simple explanation of which goes beyond my capabilities. One of the most colorful, but from a technical point of view, absolutely wrong ways to imagine it is to take one mathematical function and move it along another, watching the result.

In the SNA, as in the visual cortex, there are neurons that are sensitive to certain properties (say, noses), and they are distributed throughout the field of view. The output of these neurons is connected so as if we took the only neuron sensitive to the noses and led them around the field of view. The result is an output containing location information on the nose image. This, of course, is not limited to noses - the effect is used to create spatial layouts of where certain features are on the images. These spatial relationships are fed to the higher layers of the network, and are combined in them to recognize patterns and objects.

SNA became a revelation in the field of computer vision. They are extremely useful for generalized object recognition: you train the SNS to recognize not a particular car or person, but cars or people in general. They even made irrelevant one of the famous comics XKCD.

And because of the spatial nature of their structure, they very well lend themselves to parallelization on the GPU. Different neurons that monitor different parts of the image can be simulated completely independently. Suddenly, it became possible to quickly and inexpensively recognize people, places and objects with impressive accuracy.

The simultaneous explosion of the popularity of mobile phones and networks meant that hundreds of millions of people went online and uploaded billions of images to Facebook services.and Google, involuntarily creating huge sets for training algorithms.

Recent advanced developments go even further. The researchers have created a recurrent neural network (RNS), which has built-in memory. Instead of simply transferring connections to the next layer, it uses internal connections to create persistent memory. If you are familiar with digital logic, then as an analogy, you can imagine the triggers. So, you can train the network with a single visual layer, "looking" at the image, and transmitting everything he saw into memory, so that the network can recognize actions on the video.

And after these developments, you suddenly have algorithms that can recognize people, objects, and actions with extremely high accuracy. In other words, you can train the algorithms to recognize a person, know where the product is from the store when it is moved, and recognize when a person puts or takes it from the shelf. You only need a little GPU. And how convenient that one of the largest collections of GPUs available on request is owned by Amazon - this is their extremely powerful and profitable AWS cloud service.

Have we cracked Amazon Go's secret by combining cheap cameras with brain algorithms and an army of computers? Not really, because you need to solve another problem. The camera angle is limited - so how can a business cover the entire store space with them? What if the client is between the camera and the shelf?

To do this, you need to make sure that any area is viewed on multiple cameras. But this raises another question - how to combine the input data of several cameras into a coherent picture of what is happening?

Food synthesis

To do this, back in the 1960s. Then NASA engineers faced a big problem - they had a lot of different navigation tools, from gyros to tracking stars, and they needed to reduce all the measurements into one best estimate of the spacecraft location.

Amazon Go had a similar problem. In order for this whole undertaking to work, it is necessary to combine the observations from several different cameras for different periods of time into one coherent information about the shopping cart. The catch is that the world is essentially an indefinite place, so the decision was in accepting this uncertainty. Instead of trying to determine everything with maximum precision, successful models use a probabilistic approach.

At NASA, it was an algorithm called the Kalman filter"they use to take into account the errors of each instrument and the combination of measurements into the best possible estimate. The Kalman filter is based on the Bayes formula .

In essence, the Bayes formula is a mathematical relationship that connects the observation of an event and the probability of its occurrence, and giving you the likelihood that the event actually happened. The result is the following: our belief that one of the probable states is true (a posteriori probability) is equal to the strength of our belief in this state before observation (a priori faith NOSTA) multiplied by the support of the condition data obtained from the sensors.

Returning to the example of wine and soda: let's say the neural network reports that the client took the wine. The Bayesian formula tells us that the probability that he actually took it is equal to the probability that he will take the wine, multiplied by the probability that the camera correctly reports the fact of taking the wine.

Amazon has two big advantages when using a probabilistic scheme based on the Bayes formula. The first is that a company may consider a priori probabilities, since it knows the history of previous purchases of many customers. This means that if an Amazon Go customer buys coffee and a cupcake every Tuesday, even before he goes to the respective shelves, the store can already increase the likelihood of these purchases. This is a natural way to use a huge amount of data on users, which the company already has.

The second big advantage is that translating everything into a probability language allows you to add multiple dimensions from multiple sensors over multiple periods of time. Assuming the independence of the observations, you can simply multiply the probabilities. Also, the posterior probability of one event can be used as a priori for another.

For example, let several cameras see one regiment. Some stand closer, some farther. Several cameras believe that the client took an inexpensive soda from the shelf, one believes that he took an expensive product, one did not see anything, and the last one believes that he was just picking his nose. And now what?

Amazon could come up with a complex logic for this case, from which it would follow which camera you can believe. Was it located closer and was there a better overview of the camera, which considers that the client took expensive soda? Has the buyer blocked the camera, which saw picking his nose? But all you need is probability. Based on the number of errors of each camera, depending on its location and overview, the Bayes formula tells us how to combine all the input data in order to understand what the probability was that the user took cheap soda, expensive, or did not take anything.

In fact, since you moved into the wonderful world of probabilities, the Bayes formula allows you to combine input data from completely different types of sensors.

Therefore, Amazon has sent requests for patenting.methods of using RFID sensors for automatic payment of purchases. Passive RFID sensors are placed on products, and then read by scanners located in the store. This technology is an excellent candidate for the creation of an auto shop, since it is cheap and widespread today. And since it allows remote scanning, it can be used instead of a cashier. Place the scanner where customers go, and you will see what they have in the basket, without the need to get the goods and present them to the cashier. When watching the promotional video, I noticed that all the goods were pre-packaged - canned food, packages of chips and plastic containers with food. These products are not only more profit, they also allow you to place a label on each item.

But using RFID alone has disadvantages. It is impossible to distinguish one customer from another. You see that the store is leaving a set of soda, chips and sandwiches, and you understand that this is a purchase, but who bought it? In addition, RFID may give errors. If two buyers pass by the scanner, you can scan the purchases of both, and not find out who ordered what.

Probability estimates based on the Bayesian formula help to cope with such problems. Amazon may give out probabilities by location and possible purchase combinations for hundreds of shoppers. The situation is similar to the many-worlds interpretation of quantum mechanics: every time after a certain customer action, the store creates a new “world” with this action and tracks it (updating the probability of this world according to Bayes).

Let's go back to cameras and a soda example: based on RFID, Amazon can use scans to confirm or deny cameras, without having to develop any special logic.

And the cherry on the cake. As in the case of machine learning of neural networks, probabilistic estimates are improved with the involvement of a larger amount of data. As in the case of statistics, the more measurements you make, the better you get. Each new data set improves the accuracy of the system and its perception by the user.

And Amazon proudly presents ... your dinner

The description may not be accurate, and for sure we will not know this until Amazon reveals its cards, but the Bayes formula helps to complete a fairly realistic picture of how this new-fangled system can work.

Entering the store, you spend on the scanner smartphone. Cameras running algorithms with image recognition and in-depth training track you while you go shopping. Each time you take an item or return it, the cameras recognize this action. Observations from several cameras are combined using the Bayes formula, and give information about what you have taken. The system keeps track of all possible combinations of goods taken by you. Each time you walk through a door or frame, you are scanned with RFID tags, which allows the system to reduce the list of combinations. When you leave the store, the system looks at a list of what you think it has, chooses a guess with the highest probability, and deducts the required amount from your account.

All this has become possible with the development of deep learning, cloud computing and probabilistic assessments. Amazon Go could not be made even five years ago, but today all the components are already available. And the same combination is currently at the core of the development of romo mobiles, AI, text translation systems and much more. Today it is very interesting to work in the field of computer training. And although it is very interesting for me to find out what else awaits us, I hope to enjoy shopping soon, where you can just take the goods and leave.

Tags: