DemiurgeSerge June 27, 2013 at 11:38

Speech technology. Recognition of continuous speech for dummies as an example of IVR systems

Tutorial

Hello.
By the nature of my professional activity, I am implementing projects based on speech technologies. These are speech synthesis and recognition, voice biometrics and speech analysis.
Few people wonder how these technologies are already present in our lives, although not always - obviously.
I will try to popularly explain to you how this works and why it is needed at all.
I’ll start in detail with speech recognition, as this is a thing closer to everyday life that many of us have met, and some are already constantly using.

But first, let’s try to figure out what speech technologies are in general and what they can be.

- Synthesis of speech (translation of text into speech).
With this technology, we have little to face in real life. Or just don't notice her.
There are special “readers” for iOS and Android that can read aloud the books that you download to the device. They read quite well, after a day or two you no longer notice that the text is being read by a robot.
In many call centers, synthesized voice voices the dynamic information to subscribers, as it’s quite difficult to record all sound clips voiced by a person in advance, especially if the information changes every 3 seconds.
For example, in the St. Petersburg Metro, many informational messages at stations are read by synthesis, but almost no one notices it, because the text sounds pretty good.

- Voice biometrics (search and confirmation of identity by voice).
Yes, yes - a person’s voice is as unique as a fingerprint or retina. Reliability of verification (comparison of two fingerprints of voice) reaches 98%. For this, 74 voice parameters are analyzed.
In everyday life, technology is still very rare. But trends suggest that soon it will be widespread, especially in call centers of financial companies. Their interest in this technology is very large.
Voice biometrics has 2 unique features:
- this is the only technology that allows you to confirm your identity remotely, for example, by phone. And for this, special scanning devices are not needed.
- this is the only technology that confirms human activity, i.e. that a living person is talking on the phone. I must say right away that the voice recorded on a high-quality voice recorder will not work - it’s proved. If somewhere such a record “passes”, it means that the system initially has a low threshold of “trust”.

- Speech analysis.
Few people know that by voice you can determine the mood of a person, his emotional state, gender, approximate weight, ethnicity, etc.
Of course, no machine can immediately say whether a person is sad or happy (it is possible that he always has this state of life: for example, the average speech of an Italian and a Finn is very different in temperament), but by the change in voice during the conversation, to determine this already is quite real.

- Speech recognition (translation of speech into text).
This is the most common speech technology in our life, and first of all - thanks to mobile devices, because many manufacturers and developers believe that it is much more convenient for a person to say something in a smartphone than to type the same text on a small keyboard on the screen.
First I propose to talk about this: where do we meet with speech recognition technology in life and how do we even know about it?

Most of us will immediately recall Siri (iPhone), Google Voice Search, sometimes IVR systems with voice control in some call centers, such as Russian Railways, Aeroflot, etc.
This is something that lies on the surface, and that you can easily try for yourself.

There is speech recognition built into the car system (dialing a phone number, radio control), on televisions, info telephones (things similar to those that accept money for mobile operators). But this is not common and is practiced more as a "chip" of certain manufacturers. It is not even a matter of technical limitations and quality of work, but of ease of use and people's habits. I don’t imagine a person who is flipping through a voice on a TV program when there is a remote control at hand.

So. Speech Recognition Technology. What are they like?
I want to say right away that almost all my work is related to telephony, so many examples in the text below will be taken from there - from the practice of call centers.

Recognition by closed grammars.
Recognition of one word (voice command) from the list of words (base).
The term "closed grammar" means that the system has a certain finite word base in which the system will search for a word or expression uttered by the subscriber.
In this case, the system should put the question to the subscriber in such a way as to get a clear answer, consisting of one word.

Example:
System question: “What day of the week are you interested in?”
Subscriber's answer: “Tuesday”

In this example, the question is posed so that the system expects a definite answer from the subscriber.
The database of words in the above example may consist of the following answer options: "Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday." The following answers should also be provided for and put in the database: “I don’t know,” “anyway,” “any,” etc. - these subscriber responses must also be provided for and processed by the system separately, in accordance with a predefined dialogue scenario.

Built-in grammar.
This is a kind of closed grammar.
Recognize frequently requested standard expressions and concepts.
The concept of “built-in grammars” means that the system already has grammars (that is, it does not need to be “taught” separately), which are able to recognize specific thematic phrases of the subscriber. When writing a dialogue script, you just need to refer to a specific built-in grammar.

Example:
System question: “What time is the film interesting for you?”
Subscriber response: “At 15.30”

In this example, the time values are recognized. All the necessary grammar for time recognition is already embedded in the system.
Built-in grammars are used to simplify the development of voice menus when you can use standard universal blocks.

Recognition by open grammars.
Recognition of the entire phrase uttered by the subscriber.
This allows the system to ask the subscriber an open question and get an answer in a free form.
The concept of “open grammar” means that the system expects to hear from the subscriber not a specific word / command, but the entire semantic sentence in which the system will be interested in each word.

Example:
System question: “What interests you?”
Subscriber answer: “What documents are needed for a loan?”

In this example, each word is recognized in the subscriber’s response and the general meaning of what is said is revealed. Based on the recognized keywords and concepts in the proposal, a request is generated to the database and the subscriber is “collected” the answer - reference information is provided.

Recognition of continuous speech gives the system many more opportunities to automate the process of dialogue with the subscriber. Plus, the speed and ease of use of the system by the subscriber increases. But such systems are more difficult to implement. If the solution to the problem may involve monosyllabic answers of the subscriber, then it is better to use closed grammars.
They work more reliably, such systems are simple to implement and more familiar to subscribers who are accustomed to using DTMF dialing (navigation using dialing in tone mode).
But the future, of course, is with continuous speech recognition. Gradually, users will get used to it and will not “hang” for 5-10 seconds, when the system offers them to enter into an open dialogue with it.

How does an IVR system with continuous speech recognition work?
On the example of VoiceNavigator software, synthesis and recognition of Russian speech for an IVR system.

! Caution. Further there will be more difficult text for understanding.

1. Immediately dwell on the fact that the subscriber's call came to the call center and was transferred to the voice platform. The voice platform is software that works out the entire logic of the subscriber’s dialogue with the system, i.e. on the voice platform works IVR menu. The most popular voice platforms are Avaya, Genesys, Cisco, and Asterisk.
So, the voice platform transmits sound from the microphone of the subscriber's phone to VoiceNavigator.

2. Sound enters the speech recognition module (ASR), which converts human speech into text as a sequence of separately composed words. The resulting words have no meaning for the system yet.

An example of a subscriber’s voice phrase on the subject of train schedules:

3. Further, the text falls into the recognition grammar SSM module (I won’t dwell on what it is, I can find it myself who wants to delve deeper into the subject. This will also apply to other terms below in text), where the received words are analyzed for the subject of the utterance (which topic the recognized phrase relates to). In our example, the recognized phrase refers to the subject: “Long-distance trains” and has its own unique identifier.

4.Then the text along with the topic identifier is transmitted to the slot information allocation module (robust parser), in which certain concepts and expressions that are important for a given subject area (the meaning of the user's statement) are highlighted. In this module, the system “understands” what the subscriber said and analyzes whether this information is sufficient to form a request to the knowledge base or clarifying questions are required.

The slot information extraction module generates certain parameters, which are transmitted further to the dialogue manager along with the recognized words.

5. The dialogue manager is engaged in processing the entire logic (algorithm) of the dialogue of the IVR system with the subscriber. Based on the parameters passed, the dialogue manager can send a request to the knowledge base (the subscriber’s speech contains all the information) to generate a response to the subscriber or request additional information from the subscriber, refine the request (the subscriber’s speech does not contain all the information).

6. To generate a response to the subscriber, the dialogue manager contacts the knowledge base. It contains all the information on the subject area.

7. Next, the system generates a response to the subscriber based on information from the knowledge base and according to the dialogue scenario from the dialogue manager.

8. After providing a response to the subscriber, the IVR system is ready to continue working.

Where should I start when creating an IVR system based on speech recognition?

Learning continuous speech recognition system.
1. The basis for recognition of continuous speech is statistics. Therefore, any project to implement an IVR system using continuous speech recognition begins with the collection of statistics. You need to know what interests subscribers, how they formulate questions, what information they initially have, what they expect to hear in response, etc.

The collection of information begins with listening to and analyzing real-life conversations of subscribers with CC operators. Based on this information, statistical tables are built, from which it becomes clear which voice requests of subscribers need to be automated, and how this can be done.

2. This information is enough to create a simplified prototype of the future IVR menu. Such a prototype is necessary for the collection and analysis of more relevant subscribers' answers to IVR menu questions, because the manner in which subscribers communicate with CC operators and with IVR speech menus is very different.

The created prototype IVR menu is placed on the incoming phone number of the company. It can be maximally simplified or not work out the incorporated functionality at all, because its main task is to accumulate statistical material (all kinds of answers from subscribers), which will be the basis for the formation of a statistical language model (SLM) aimed at a specific topic. The flexibility of the language model for a specific subject area improves the quality of recognition.

Implementation example of a prototype IVR menu:
System question: “What service are you interested in?”
Subscriber response: “I would like to take a loan”
System response: “The call will be transferred to the contact center operator”
System actions: Transfer to the CC operator with any subscriber’s response.

Using the prototype IVR menu, the developer creates an emulation of the system and collects real records of subscribers' responses (collection of phonograms), according to which the system will subsequently be trained.
It does not matter what the subscriber says to the question of the system, the call will in any case be transferred to the contact center operator. Thus, the subscriber will be served without inconvenience, and the developers of the IVR menu will receive a database of all possible answers on this topic using an example of a real dialogue.
The necessary number of phonograms for the implementation of the project can reach several thousand, and the time for collecting phonograms can take several months. It all depends on the complexity of the project.

3.Then the collected conversation records are transcribed. Each entry is auditioned by a specialist and translated into text. Transcribed phrases must not contain punctuation marks and special characters. Also, all words and abbreviations must be spelled out in full, as they are pronounced by the client. This work takes a lot of time, but does not require specific knowledge, therefore, many employees are usually involved at the same time.

4. Transcribed files are used to build a training file (a dictionary of words and expressions, configuration parameters), which is an XML document. The larger the training file for the subject area will be, the better the recognition will be.
The training file allows you to create a language model (SLM), which is the basis for the recognition of continuous speech.

For this, the training file is loaded into a special utility developed by the MDG company (it is she who is the author of VoiceNavigator software) - ASR Constructor, which builds the language model. Then the language model is loaded into the VoiceNavigator software.

At this stage of work on building an IVR speech menu, the system is able to recognize a subscriber’s speech in the form of separately composed words that are not related to each other.

5. Then, in the recognized list of words, it is necessary to identify the subject of the subscriber’s appeal (SSM recognition grammar) and select slot information (Robust parser). This requires additional system training using the appropriate training files.
Training files can be created based on previously received transcribed files. But unlike the task of obtaining a language model, transcribed files should be appropriately modified to suit their suitability for SSM grammar and Robust parser.

Well, the beginning of the article turned out to be quite simple for understanding those who are not familiar with speech technologies at all. And then I plunged headlong into the subtleties of creating real voice self-service systems. I apologize for such metamorphoses.

Who is interested in this topic, and he wants to learn more about the creation of IVR systems with voice control, I want to recommend visiting a special wiki site - www.vxml.ru
It is devoted to the development of IVR systems in the interactive language VoiceXML, which is the main one in this work.

Thanks.

Tags:

Speech technology. Recognition of continuous speech for dummies as an example of IVR systems

Also popular now: