Tactful robot: can listen and does not interrupt
- Tutorial
Speech recognition (hereinafter - ASR, Automatic Speech Recognition) is used when creating bots and / or IVR, as well as for automated polls. Voximplant uses ASR provided by the “corporation of good” - Google recognition works quickly and with high accuracy, but ... As always, there is one nuance. A person can pause even in short sentences, while we need a guarantee that the ASR will not take a pause as the end of the answer. If the ASR thinks that the person has finished speaking, then after the “answer” the scenario can turn on voice synthesis with the following question - at the same time the person will continue to speak and get a bad user experience: the bot / IVR interrupts the person. Today we will tell how to deal with it, so that your users are not upset by communicating with the iron assistants.

The goal is to ask a question and listen to the person, without interrupting and waiting for the end of his answer. ASR is represented by a separate module , where there is an ASR.Result event - it triggers when a person has finished speaking. The specificity of the work of ASR from Google is that ASR.Result with the recognized text will return as soon as the person makes at least a short pause and google decides that the spoken phrase is recognized and completed.
To give a person the ability to pause, you can use the ASR.InterimResult event . In it, the ASR in the recognition process returns all the “raw” text, correcting and changing it depending on the context - and so on until the ASR.Result is triggered. Therefore, the ASR.InterimResult eventis an indication that a person is currently saying something . We will focus only on him and see how long it does not come. And intermediate recognized texts received from ASR.Result - add.
In general, it will look like this:
To work properly with pauses, you can create a special object:
After asking a question, a person often thinks for a few seconds. The timer for silence at the very beginning is better to set 6-8 seconds, the timer ID, we will save in the parameter timeouts.silence .
The pauses in the middle of the answer are optimal in 3-4 seconds, so that the person could think, but did not suffer while waiting, when he finished. This is the timeouts.pause parameter .
A general timer for the whole answer, timeouts.duration, is useful if we don’t want a person to talk for too long. It will also protect us from cases when a person is in a noisy room with background voices that will be taken by us for the client's speech. And also from the cases when we got on another robot that talks with our robot in a circle.
So at the beginning of the script we connect the ASR module, declare variables and create a timeouts object :
When an incoming call arrives in the script, the AppEvents.CallAlerting event is fired . Create a handler for this event: answer the call, greet the client, start recognition after the greeting. And let's also let the person kill the robot from the middle of the question asked (details - a little further).
You can see that the startASR and startSilenceAndDurationTimeouts functions are called - let 's look at what it is and why.
Recognition is implemented in the startASR function . It creates an ASR instance and sends the human voice to this instance, it also contains handlers for ASREvents.InterimResult and ASREvents.Result events . As we said above, here we treat ASR.InterimResult as a sign that a person is saying. The event handler clears the previously created timeouts, sets a new value for timeouts.pause, and finally stops the synthesized voice (like this, a person can interrupt the bot). The ASREvents.Result handler simply concatenates all final answers in a speech variable . Specifically, in this scenario, speech is not used at all, but if desired, it can be transmitted on your backend, for example.
The startSilenceAndDurationTimeouts function ... Records the values of the corresponding timers:
SpeechAnalysis stops recognition and analyzes text from speech (which is obtained from ASREvents.Result ). If there is no text, then we repeat the question; if there is text, then politely say goodbye and hang up.
A handleSilence is responsible for repeating the question :
Finally, the helper function to stop the ASR:
The final scenario shows how to “ennoble” a straight-line robot, adding a little tact and attention to it. Surely this method is not the only possible one, so if you have any thoughts on how you can elegantly finish the usual interaction between the bot and the person, share in the comments. For those who want something more advanced and suddenly have not read our DialogFlow tutorial, we recommend that you read it .

Concept
The goal is to ask a question and listen to the person, without interrupting and waiting for the end of his answer. ASR is represented by a separate module , where there is an ASR.Result event - it triggers when a person has finished speaking. The specificity of the work of ASR from Google is that ASR.Result with the recognized text will return as soon as the person makes at least a short pause and google decides that the spoken phrase is recognized and completed.
To give a person the ability to pause, you can use the ASR.InterimResult event . In it, the ASR in the recognition process returns all the “raw” text, correcting and changing it depending on the context - and so on until the ASR.Result is triggered. Therefore, the ASR.InterimResult eventis an indication that a person is currently saying something . We will focus only on him and see how long it does not come. And intermediate recognized texts received from ASR.Result - add.
In general, it will look like this:
asr.addEventListener(ASREvents.InterimResult, e => {
  clearTimeout(timer)
  timer = setTimeout(stop, 3000)
})
asr.addEventListener(ASREvents.Result, e => {
  answer += " " + e.text
})
functionstop(){
//...
}We reveal the essence. Timers
To work properly with pauses, you can create a special object:
timeouts = {
  silence: null,
  pause: null,
  duration: null
}After asking a question, a person often thinks for a few seconds. The timer for silence at the very beginning is better to set 6-8 seconds, the timer ID, we will save in the parameter timeouts.silence .
The pauses in the middle of the answer are optimal in 3-4 seconds, so that the person could think, but did not suffer while waiting, when he finished. This is the timeouts.pause parameter .
A general timer for the whole answer, timeouts.duration, is useful if we don’t want a person to talk for too long. It will also protect us from cases when a person is in a noisy room with background voices that will be taken by us for the client's speech. And also from the cases when we got on another robot that talks with our robot in a circle.
So at the beginning of the script we connect the ASR module, declare variables and create a timeouts object :
require(Modules.ASR)
let call,
    asr,
    speech = ""
timeouts = {
  silence: null,
  pause: null,
  duration: null
}
Incoming call
When an incoming call arrives in the script, the AppEvents.CallAlerting event is fired . Create a handler for this event: answer the call, greet the client, start recognition after the greeting. And let's also let the person kill the robot from the middle of the question asked (details - a little further).
AppEvents.CallAlerting handler
VoxEngine.addEventListener(AppEvents.CallAlerting, e => {
  call = e.call
  // отвечаем на входящий звонок. При соединении отловим событие Connected 
  call.answer()
  call.addEventListener(CallEvents.Connected, e => {
    call.say("Здравствуйте, вы оформляли заказ на нашем сайте. Расскажите, пожалуйста, как вы оцениваете удобство работы с нашим сервисом?", Language.RU_RUSSIAN_FEMALE)
    // начнём слушать через 4 секунды и дадим возможность с этого момента перебивать робота
    setTimeout(startASR, 4000)
    // включим все остальные таймеры по окончанию вопроса
    call.addEventListener(CallEvents.PlaybackFinished, startSilenceAndDurationTimeouts)
  });
  call.addEventListener(CallEvents.Disconnected, e => {
    VoxEngine.terminate()
  })
})You can see that the startASR and startSilenceAndDurationTimeouts functions are called - let 's look at what it is and why.
Recognition and timeouts
Recognition is implemented in the startASR function . It creates an ASR instance and sends the human voice to this instance, it also contains handlers for ASREvents.InterimResult and ASREvents.Result events . As we said above, here we treat ASR.InterimResult as a sign that a person is saying. The event handler clears the previously created timeouts, sets a new value for timeouts.pause, and finally stops the synthesized voice (like this, a person can interrupt the bot). The ASREvents.Result handler simply concatenates all final answers in a speech variable . Specifically, in this scenario, speech is not used at all, but if desired, it can be transmitted on your backend, for example.
startASR
functionstartASR() {
  asr = VoxEngine.createASR({
    lang: ASRLanguage.RUSSIAN_RU,
    interimResults: true
  })
  asr.addEventListener(ASREvents.InterimResult, e => {
    clearTimeout(timeouts.pause)
    clearTimeout(timeouts.silence)
    timeouts.pause = setTimeout(speechAnalysis, 3000)
    call.stopPlayback()
  })
  asr.addEventListener(ASREvents.Result, e => {
    // складываем распознаваемые ответы
    speech += " " + e.text
  })
  // направляем поток в ASR
  call.sendMediaTo(asr)
}The startSilenceAndDurationTimeouts function ... Records the values of the corresponding timers:
functionstartSilenceAndDurationTimeouts() {
  timeouts.silence = setTimeout(speechAnalysis, 8000)
  timeouts.duration = setTimeout(speechAnalysis, 30000)
}And some more features
SpeechAnalysis stops recognition and analyzes text from speech (which is obtained from ASREvents.Result ). If there is no text, then we repeat the question; if there is text, then politely say goodbye and hang up.
speechAnalysis
functionspeechAnalysis() {
  // останавливаем модуль ASR
  stopASR()
  const cleanText = speech.trim().toLowerCase()
  if (!cleanText.length) {
    // если переменная с нулевой длиной, то это значит что сработал таймер тишины,// т.е. человек вообще ничего не ответил, и мы можем, например, повторить вопрос абоненту
    handleSilence()
  } else {
    call.say(
      "Большое спасибо за отзыв! До свидания!",
      Language.RU_RUSSIAN_FEMALE
    )
    call.addEventListener(CallEvents.PlaybackFinished, () => {
      call.removeEventListener(CallEvents.PlaybackFinished)
      call.hangup()
    })
  }
}A handleSilence is responsible for repeating the question :
functionhandleSilence() {
  call.say("Извините, вас не слышно. Расскажите, пожалуйста, как вы оцениваете удобство работы с нашим сервисом?", Language.RU_RUSSIAN_FEMALE)
  // начнём слушать через 3 секунды и дадим возможность с этого момента перебивать робота
  setTimeout(startASR, 3000)
  call.addEventListener(CallEvents.PlaybackFinished, startSilenceAndDurationTimeouts)
}Finally, the helper function to stop the ASR:
functionstopASR() {
  asr.stop()
  call.removeEventListener(CallEvents.PlaybackFinished)
  clearTimeout(timeouts.duration)
}Together
script listing
The final scenario shows how to “ennoble” a straight-line robot, adding a little tact and attention to it. Surely this method is not the only possible one, so if you have any thoughts on how you can elegantly finish the usual interaction between the bot and the person, share in the comments. For those who want something more advanced and suddenly have not read our DialogFlow tutorial, we recommend that you read it .