LipNet neural network reads lips with 93.4% accuracy
Commander Dave Bowman and co-pilot Frank Poole, not trusting the computer, decided to disconnect it from controlling the ship. To do this, they confer in a soundproofed room, but HAL 9000 reads their lips-talk. Frame from the film “Space Odyssey of 2001”
Reading lips plays an important role in communication. The experiments of 1976 showed that people “hear” completely different phonemes if you put the wrong sound on the lip movement (see “Hearing lips and seeing voices” , Nature 264, 746-748, 23 December 1976, doi: 10.1038 / 264746a0) .
From a practical point of view, lip reading is an important and useful skill. You can understand the interlocutor without turning off the music in the headphones, read the conversations of all people in sight (for example, all passengers in the waiting room), listen to people with binoculars or a telescope. The scope of the skill is very wide. The professional who mastered it will easily find a highly paid job. For example, in the field of security or competitive intelligence.
Automatic lip reading systems also have a rich practical potential. These are medical hearing aids of a new generation with speech recognition, systems for silent lectures in public places, biometric identification, systems of hidden information transmission for espionage, speech recognition by video from surveillance cameras, etc. In the end, the computers of the future, too, will read lips, like HAL 9000 .
Therefore, scientists for many years trying to develop automatic lip reading, but without much success. Even for relatively simple English, in which the number of phonemes is much less than in Russian, the recognition accuracy is low.
To understand speech on the basis of a person’s mimicry is a most difficult task. People who have mastered this skill are trying to recognize dozens of consonant phonemes, many of which are very similar in appearance. It is especially difficult for an unprepared person to distinguish between five categories of visual phonemes (that is, a visa) of English. In other words, it is almost impossible to distinguish the pronunciation of certain consonant sounds by lips. It is not surprising that people do very poorly with accurate lip reading. Even the best among hearing impaired people demonstrate an accuracy of only 17 ± 12% of 30 monosyllabic words or 21 ± 11% of polysyllabic words (hereinafter, results for the English language).
Automatic lip reading is one of the tasks of machine vision, which comes down to the frame-by-frame processing of a video sequence. The task is greatly complicated by the low quality of most practical video materials, which do not allow accurate reading of the spatiotemporal, that is, spatial-temporal characteristics of a person during a conversation. Persons move and turn in different directions. Recent developments in computer vision are trying to track the movement of faces in the frame to solve this problem. Despite successes, until recently they were able to recognize only single words, but not sentences.
Significant breakthroughs in this area have been made by the developers at Oxford University. The LipNet neural network trained by thembecame the first in the world, which successfully recognizes lip-rate speech at the level of entire sentences, processing the video sequence.
Single- card salinity cards for the English words "please" (above) and "lay" (below) when processing a neural network that reads lips, highlighting the most attractive (salient) features of
LipNet - recurrent neural network of the LSTM type (long short-term memory). The architecture is shown in the illustration. The neural network was trained using the neural network temporal classification method (Connectionist Temporal Classification, CTC), which is widely used in modern speech recognition systems, since it eliminates the need for training in input data that is synchronized with the correct result.
LipNet neural network architecture. The input is a sequence of frames T, which are then processed by three layers of the space-time (spatiotemporal) convolutional neural network (STCNN), each of which is accompanied by a layer of spatial sampling. For extracted features, the sampling rate on the timeline increases (upsampling), and then they are processed with dual LTSM. Each time step at the LTSM output is processed by a two-layer network of direct distribution and the last layer of SoftMax.
On the special package of the GRID sentences, the neural network shows recognition accuracy of 93.4%. This not only exceeds the recognition accuracy of other software developments (which are listed in the table below), but also exceeds the lip-reading performance of specially trained people.
Method | Data set | The size | Extradition | Accuracy |
---|---|---|---|---|
Fu et al. (2008) | AVICAR | 851 | Numbers | 37.9% |
Zhao et al. (2009) | AVLetter | 78 | Alphabet | 43.5% |
Papandreou et al. (2009) | CUAVE | 1800 | Numbers | 83.0% |
Chung & Zisserman (2016a) | OuluVS1 | 200 | Phrases | 91.4% |
Chung & Zisserman (2016b) | OuluVS2 | 520 | Phrases | 94.1% |
Chung & Zisserman (2016a) | BBC TV | > 400,000 | The words | 65.4% |
Wand et al. (2016) | GRID | 9000 | The words | 79.6% |
Lipnet | GRID | 28853 | suggestions | 93.4% |
The special GRID package is composed of the following pattern:
command (4) + color (4) + preposition (4) + letter (25) + digit (10) + adverb (4),
where the number corresponds to the number of word variations for each of the six verbal categories .
In other words, the accuracy of 93.4% is still a result obtained in greenhouse laboratory conditions. Of course, when recognizing arbitrary human speech, the result will be much worse. Not to mention the analysis of data from a real video, where a person's face is not taken close-up in excellent lighting and with high resolution.
The operation of the LipNet neural network is shown in a demo video.
The scientific article was prepared for the ICLR 2017 conference and was published on November 4, 2016 in open access.