karazyabko November 5, 2011 at 14:26

The history of the development of speech recognition systems: how we came to Siri

Transfer

Looking back, we see that the development of speech recognition technology is similar to observing the process of growing up in a child - progressing from defining individual words, then ever larger dictionaries, and finally to quick answers to questions like Siri does.

Listening to Siri with her slightly elegant sense of humor, we marvel at how far we have come over the years in the speech recognition industry. Let's take a look at the past decades, which allowed a person to control devices using only voice.

1950 and 1960: Baby talk

The first speech recognition systems could only understand numbers (given the complexity of the language, it’s correct that engineers first focused on numbers). Bell Laboratories has developed an Audrey system that recognizes numbers spoken in one voice. After 10 years, in 1962, IBM demonstrated their brainchild - the Shoebox system , which understood 16 words in English.

Laboratories in the USA, Japan, England and the USSR have developed several devices that recognize individual pronounced sounds, expanding speech recognition technology with support for four vowels and nine consonants. They didn’t sound very good, but these first attempts gave an impressive start, especially considering how primitive computers of that time were.

1970s: Systems gradually gaining popularity

Speech recognition systems made great strides in the seventies thanks to interest and sponsorship from the US Department of Defense. From 1971 to 1976, their DARPA Speech Understanding Research (SUR) program was one of the largest in the history of speech recognition, and in addition to everything else, she was responsible for the Harpy system of Carnegie Mellon University. Harpy understood 1011 words, which is the average vocabulary of a three-year-old child.

Harpy was a significant milestone, as it introduced a more effective search approach called Beam search , "demonstrating a network of possible sentences with a finite number of states" ( Readings in Speech Recognition ).

The 70s were also marked by several milestones in this technology, for example, the founding of the first commercial company Threshold Technology, which introduced a system that could interpret various voices.

1980s: Speech recognition justifies predictions

In the next decade, thanks to new approaches and technologies, the vocabulary of such systems grew from several hundred to several thousand words and had the potential to recognize an unlimited number of words. One of the reasons was the new statistical method, better known as the hidden Markov model .

Using patterns for words and sound patterns, she examined the likelihood that unknown sounds might be words. This base has been used by other systems for another twenty years ( Automatic Speech RecognitionA Brief History of the Technology Development ).

With an expanded vocabulary, speech recognition has begun to make its way into commercial applications for businesses and specialized industries such as medicine. She even entered ordinary people's homes in 1987 in the form of a Worlds of Wonder's Julie doll , which children could train to recognize their voice (“Finally, a doll that understands you”).

Although recognition software could recognize up to 5000 words, such as the Kurzweil text-to-speech program, there was a huge drawback in them - these programs supported discrete dictation, that is, you had to stop after each word for the program to process it.

1990s: Automatic speech recognition goes to the masses

In the nineties, computers finally got fast processors, and speech recognition programs became viable.

In 1990, the first publicly available Dragon Dictate program appeared with a staggering price of $ 9,000. Seven years later, an improved version was released - Dragon NaturallySpeaking . The application recognized normal speech, so you could speak at a normal pace of about 100 words per minute. But still, you had to train the program for 45 minutes before use, and it still had a high price of $ 695.

BellSouth's first VAL voice portal appeared in 1996. It was the first interactive speech recognition system that provided information based on what you said on the phone. VAL paved the way for all the inaccurate voice menus that bored callers over the next 15 years.

2000s: Stagnation in Speech Recognition - Until Google Appears

By 2001, speech recognition had risen to 80 percent accuracy, and technology progress had stopped. Systems recognized worked fine when the language universe was limited, but they still “guessed” using statistical models among similar words, the language universe grew with the growth of the Internet.

Did you know that voice recognition and voice commands were built into Windows Vista and Mac Os? Most users did not even realize that such functionality exists. Windows Speech Recognition and OS X voice commands were interesting, but not accurate enough and convenient, like a keyboard and mouse.

Speech recognition technology got a second wind after one important event: the advent of the Google Voice Search application for the iPhone. The impact of this application was significant for two reasons. Firstly, phones and other mobile devices are ideal objects for speech recognition, and the desire to replace the tiny on-screen keyboards with alternative input methods was very great. Secondly, Google had the opportunity to offload this process using its cloud data centers, directing all their power for large-scale data analysis to find matches between users' words and the huge number of voice request samples they received.

In short, the bottleneck of speech recognition has always been the availability of data and the possibility of efficient processing. The application added data from billions of search queries to the analysis to better predict what you said.

In 2010, Google added personalized recognition to the voice search for Android phones. The software could record user voice requests to build a more accurate voice model. The company also added voice recognition to its Chrome browser in mid-2011. Remember how we started with 10 words and progressed to a few thousand? So Google now recognizes 230 billion words.

Then came Siri. Like Goggle Voice Search, it relies on cloud computing. She uses the data that she knows about you to generate the response arising from the context and answers your request, like a certain person. Speech recognition has evolved from an instrument into entertainment.

Future: Accurate and ubiquitous speech

The boom in speech recognition applications indicates that the time for speech recognition has come, and we can expect a huge number of them in the future. These applications will not only allow you to control your computer with voice or convert voice to text - they will also be able to distinguish between different languages, will allow you to choose the assistant's voice from various options.

It is likely that speech recognition technology will switch to other types of devices. It is easy to imagine how in the future we will operate coffee makers, talk to printers and talk to lighting so that it turns off.

Tags: