Neural network predicts 1 second of the future from the photo


    Generative adversarial neural network, optimized for video processing, is able to show what happens next. The

    ability to predict the near future is an important skill for any person. The speed of human reaction is not enough to react in real time to surrounding events, so we predict them in constant mode with a probability close to 100%. Athletes know where the ball will fly. Businessmen know when the interviewee will reach out for a handshake. We predict the trajectory of cars on the road and the immediate actions of people on the facial expression and subject in their hands.

    Artificial intelligence also needs to know the future. He must understand which events will lead to what result, in order to avoid obvious missteps and plan his actions. A group of researchers fromThe Laboratories for Informatics and Artificial Intelligence (CSAIL) at the Massachusetts Institute of Technology teaches a neural network to predict the future by training it on millions of videos.

    A trained neural network with a single static frame (photos) tries to predict future events. The program is limited to a frame size of 64 × 64 pixels and a prediction duration of 32 frames, that is, about a second of the future.

    Knowledge of the future makes it possible to better understand the present. This is a basic ability that any robot operating in the real world must possess. Observing a person in front of a plate of food with a fork and knife in his hands, one should unequivocally predict that this person will soon begin to eat. Without such an understanding, the robot cannot function in any effective way - you don’t want the robot to take the chair aside when you sit on a chair? No, he must understand what will happen in a second and not touch anything. Or vice versa, quickly move the chair exactly to the place where the person sits.

    At the moment, even the most advanced AI systems lack the basic capabilities for predicting the near future. Therefore, this study is so important. Research groups at New York University and Facebook are doing a similar job, but their neural networks give out only a few frames from the future or show it too blurry.

    The program developed in CSAIL quite accurately predicts the most banal and obvious events. For example, from a photo of a train on a platform, she predicts his movement.

    Examples of the prediction of events by photography. Patterns of movement of people, animals, natural phenomena, transport

    In a scientific study, developers solve the fundamental problem of studying a script, as the events in the frame unfold in time. Obviously, such a task is very difficult for formal annotation. Therefore, the neural network was trained directly on the finished material - on millions of videos without semantic annotations. This approach has certain advantages, because the AI ​​can be trained offline, just by watching what is happening around and processing a huge amount of video on the Internet.

    The trained neural network then set the task to generate small videos on a single static frame. To achieve a realistic result, the authors of the study applied a generative adversary network (generative adversarial network, GAN). One neural network generates video, and the second discriminator network learns to distinguish fake video from the real one and blocks fakes. As the discriminator learns, the network-generator has to generate more and more realistic videos to get tested.


    The generative model uses two streams that separately model the foreground and background to separate them from each other and clearly distinguish the movement of the object.



    Over time, such a program will be able to more effectively help a person in different situations. For example, a robot can predict when a person will fall - and keep him from falling. Digital assistant in the car will learn to predict the actions of the driver on the movement of hands and eyes to avoid an accident.

    All videos on which the neural network was trained, as well as the source code of the program are published in open access . The code for the generative consensual neural network lies on GitHub . Using the data for training (approximately 10.5 terabytes of video), you can repeat the experiment yourself. As an option, already trained models are available for download (1 GB in the archive).

    Video materials for training are taken from Flickr photo and video hosting, where they are under a free license. These are themed scenes: beach events, golf matches, train stations and babies in hospitals.



    Two million videos are just two years of video. "This is very small compared to the amount of video information that passed through the brain of a 10-year-old child or compared to the amount of information that was processed during the evolutionary process of life on Earth," admits Carl Vondrick, one of the authors of work.

    But this is only the beginning, the AI ​​is taking the first steps, you need to start somewhere. In the future, the neural network will be trained on longer video fragments. The authors hope that gradually the AI ​​will begin to limit the choice of possible options for the future, given the limitations of the laws of physics and the properties of objects. Experiments show that the neural network is able to assimilate them. Gradually, the program will learn to predict a more distant future, and not just 1 second. For sure, additional modules will be connected to it, such as recognition of the person, lip reading, prediction of crimes on the person’s face , etc.

    Scientific article publishedon the site of the Massachusetts Institute of Technology. The study continues thanks to funding from the US National Science Foundation and Google grants for two of the three members of the research team. The report was prepared for the 29th conference on neuroinformation processing systems (NIPS 2016), which will be held from 5 to 10 December in Barcelona.

    Also popular now: