Microsoft's speech recognition system has reached the human level

Published on October 19, 2016

Microsoft's speech recognition system has reached the human level



    Microsoft's trained neural networks now recognize human voices as well as humans. The report in the field of scientific intelligence Speech & Dialog team of researchers said that the speech recognition system is now wrong as often as professional stenographers. In some cases, the system is capable of making fewer errors.

    During the tests, the error rate (WER) was 5.9%, which is lower than the previous result of 6.3%, which Microsoft reported last month. This is the lowest result ever recorded. The team does not consider this a breakthrough in algorithm or data, but in carefully customizing existing AI architectures. The main difficulty lies in the fact that
    even if the soundtrack is of good quality and does not contain extraneous noise, the algorithm must deal with different voices, interruptions, vibrations and other nuances of a person’s live speech.

    To test how the algorithm is able to replicate human abilities, Microsoft hired external shorthand writers for the purity of the experiment. The company was already ready the correct transcript of the audio file, which was proposed to specialists. Stenographers worked in two stages: first, one person retyped the audio fragment, and then the second listened and corrected errors in the transcript. Based on the correct transcript for standardized tests, the specialists, decoding the conversation record on a specific topic, worked 5.9%, and the result of decoding the free dialogue showed 11.3% errors. After 2,000 hours of human speech training, the Microsoft system scored 5.9% and 11.1% errors for the same audio files, respectively. This means that the computer can now recognize words in a conversation as if it were a person.

    Now Microsoft is about to repeat the same result in a noisy environment. For example, while driving on a highway or at a party. In addition, the company plans to focus on more effective ways to help technology recognize individual speakers if they speak at the same time, and make sure that AI works well with a large number of votes regardless of age and emphasis. Realization of these opportunities in the future is crucial and goes beyond simple shorthand.

    To achieve such results, the researchers used the company's own development - the computer network Toolkit. The ability of this neural network toolkit to quickly process learning algorithms on multiple computers running a GPU has significantly improved the speed with which they could conduct research, and ultimately reach the human level.



    This level of accuracy was made possible through the use of three convolutional neural network options .. The first of these was the VGG architecture, featuring a large number of hidden layers. Compared to the networks that were previously used for image recognition, this network uses smaller, deeper filters (3x3) and also uses up to five convolutional levels before combining. The second network is modeled on the ResNet architecture, which adds backbone connections. The only difference is that the developers applied batch normalization before calculating ReLU. The last convolutional network in the list is LACE. This is a variant of a neural network with a time delay in which each higher level is a nonlinear transformation of the weighted sums of the windows of the lower level frames. In other words, each higher level uses a wider context than the lower levels. Lower levels focus on extracting simple local structures, while higher levels extract more complex structures that cover wider contexts.



    This achievement is for the company one more step towards easy and pleasant communication with the computer. But until the computer can understand the meaning of what they say, he will not be able to correctly execute the command or answer the question. Here the task is much more complicated. And it underlies what Microsoft is going to do in the coming years. Earlier this year, Satya Nadella talked about the fact that artificial intelligence is the "future of the company," and its ability to communicate with humans has become the cornerstone. “The next frontier is the transition from recognition to understanding,” said Jeffrey Zweig, head of research at Speech & Dialog.

    Despite the obvious success, there is one big difference between the automatic system and the work of stenographers: it cannot understand subtle conversational nuances like the sound of “eh”. We can make this sound involuntarily in order to somehow “pause” a pause while thinking about the next thought that needs to be said. Or “uh” may be a signal that the other person can continue to speak, as well as “yeah”. Professional stenographers are able to distinguish them among themselves, but these small signals are lost for artificial intelligence, which is not able to understand the context in which a particular sound was pronounced.

    “Five years ago, I would not even have thought that we could achieve such a result. I just wouldn’t think it’s possible, ”said Harry Sham, executive vice president of Microsoft’s AI team.

    The first research in speech recognition can be traced back to the 1970s when the United States Defense Advanced Research Projects Agency (DARPA) set the goal of creating breakthrough technology for national security. For decades, most of the largest IT companies and many research organizations have joined the race. “This achievement is the culmination of more than twenty years of effort,” notes Jeffrey Zweig.
    Microsoft believes that the result of work on speech recognition will have a great impact on the development of consumer and business products of the company, the number of which will increase significantly. At least Xbox and Cortana will get new opportunities from existing developments. In addition, each user will be able to use the tools for instant translation of speech into text.