Using Google Cloud Speech API v2 in Asterisk to recognize Russian speech

Good evening, colleagues. Recently, there was a need to add a voice ticket system to our ticket system. But it’s not always convenient to listen to the voice file every time, so the idea came up to add an automatic voice recognition system to this, and besides, in the future it would be useful in other projects. In the course of this work, two API versions of the most popular speech recognition systems from google and yandex were tried. In the end, the choice fell on the first option. Unfortunately, I did not find detailed information about this on the Internet, so I decided to share my experience. If you are interested in what came of it, welcome to cat.

Choosing an Speech Recognition API


I considered only the api option, boxed solutions were not needed, because they required resources, recognition data was not critical for business, and their use is much more complicated and requires more man-hours.

The first was the Yandex SpeechKit Cloud. I immediately liked the ease of use:

curl -X POST -H "Content-Type: audio/x-wav" --data-binary "@speech.wav" "https://asr.yandex.net/asr_xml?uuid=<идентификатор пользователя>&key=&topic=queries"

Pricing policy 400 rubles per 1000 requests. The first month is free. But after that, only disappointments went:

- To send a large sentence, an answer of 2-3 words came in
- These words were recognized in a strange sequence
- Attempts to change the topic of positive results did not bring

Perhaps this was due to the average recording quality, we all tested through voice gateways and ancient panasonic phones. So far, I plan to use it in the future to build IVR.

The next was a service from Google. The Internet is replete with articles proposing to use the API for Chromium developers. Now the keys for this API can not be so easily obtained. Therefore, we will use a commercial platform.

Pricing policy - 0-60 minutes per month for free. Next, $ 0.006 for 15 seconds of speech. Each request is rounded to a multiple of 15. The first two months are free, you need a credit card to create a project. The options for using the API in the underlying documentation are varied. We will use a Python script:

Script from the documentation
"""Google Cloud Speech API sample application using the REST API for batch
processing."""
import argparse
import base64
import json
from googleapiclient import discovery
import httplib2
from oauth2client.client import GoogleCredentials
DISCOVERY_URL = ('https://{api}.googleapis.com/$discovery/rest?'
                 'version={apiVersion}')
def get_speech_service():
    credentials = GoogleCredentials.get_application_default().create_scoped(
        ['https://www.googleapis.com/auth/cloud-platform'])
    http = httplib2.Http()
    credentials.authorize(http)
    return discovery.build(
        'speech', 'v1beta1', http=http, discoveryServiceUrl=DISCOVERY_URL)
def main(speech_file):
    """Transcribe the given audio file.
    Args:
        speech_file: the name of the audio file.
    """
    with open(speech_file, 'rb') as speech:
        speech_content = base64.b64encode(speech.read())
    service = get_speech_service()
    service_request = service.speech().syncrecognize(
        body={
            'config': {
                'encoding': 'LINEAR16',  # raw 16-bit signed LE samples
                'sampleRate': 16000,  # 16 khz
                'languageCode': 'en-US',  # a BCP-47 language tag
            },
            'audio': {
                'content': speech_content.decode('UTF-8')
                }
            })
    response = service_request.execute()
    print(json.dumps(response))
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        'speech_file', help='Full path of audio file to be recognized')
    args = parser.parse_args()
    main(args.speech_file)

Preparing to use the Google Cloud Speech API


We will need to register the project and create a service account key for authorization. Here is the link to get the trial, you need a Google account. After registration, you need to activate the API and create a key for authorization. After you need to copy the key to the server.

Let's move on to setting up the server itself, we will need:

- python
- python-pip
- python google api client

sudo apt-get install -y python python-pip
pip install  --upgrade google-api-python-client

Now we need to export two environment variables for successful work with api. The first is the path to the service key, the second is the name of your project.

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_file.json
export GCLOUD_PROJECT=your-project-id

Download the test audio file and try to run the script:

wget https://cloud.google.com/speech/docs/samples/audio.raw
 python voice.py audio.raw
{"results": [{"alternatives": [{"confidence": 0.98267895, "transcript": "how old is the Brooklyn Bridge"}]}]}

Excellent! The first test is successful. Now let's change the text recognition language in the script and try to recognize it:

nano voice.py
   service_request = service.speech().syncrecognize(
        body={
            'config': {
                'encoding': 'LINEAR16',  # raw 16-bit signed LE samples
                'sampleRate': 16000,  # 16 khz
                'languageCode': 'ru-RU',  # a BCP-47 language tag

We need a .raw audio file. We use sox for this

apt-get install -y sox
sox test.wav -r 16000 -b 16 -c 1 test.raw
python voice.py test.raw
{"results": [{"alternatives": [{"confidence": 0.96161985, "transcript": "\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435 \u0412\u0430\u0441 \u043f\u0440\u0438\u0432\u0435\u0442\u0441\u0442\u0432\u0443\u0435\u0442 \u043a\u043e\u043c\u043f\u0430\u043d\u0438\u044f"}]}]}

Google returns us the answer in Unicode. But we want to see normal letters. Let's change our voice.py a bit:

Instead

print(json.dumps(response))

We will use

s = simplejson.dumps({'var': response}, ensure_ascii=False)
    print s

Add import simplejson . The final script under the cat:

Voice.py
"""Google Cloud Speech API sample application using the REST API for batch
processing."""
import argparse
import base64
import json
import simplejson
from googleapiclient import discovery
import httplib2
from oauth2client.client import GoogleCredentials
DISCOVERY_URL = ('https://{api}.googleapis.com/$discovery/rest?'
                 'version={apiVersion}')
def get_speech_service():
    credentials = GoogleCredentials.get_application_default().create_scoped(
        ['https://www.googleapis.com/auth/cloud-platform'])
    http = httplib2.Http()
    credentials.authorize(http)
    return discovery.build(
        'speech', 'v1beta1', http=http, discoveryServiceUrl=DISCOVERY_URL)
def main(speech_file):
    """Transcribe the given audio file.
    Args:
        speech_file: the name of the audio file.
    """
    with open(speech_file, 'rb') as speech:
        speech_content = base64.b64encode(speech.read())
    service = get_speech_service()
    service_request = service.speech().syncrecognize(
        body={
            'config': {
                'encoding': 'LINEAR16',  # raw 16-bit signed LE samples
                'sampleRate': 16000,  # 16 khz
                'languageCode': 'en-US',  # a BCP-47 language tag
            },
            'audio': {
                'content': speech_content.decode('UTF-8')
                }
            })
    response = service_request.execute()
     s = simplejson.dumps({'var': response}, ensure_ascii=False)
    print s
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        'speech_file', help='Full path of audio file to be recognized')
    args = parser.parse_args()
    main(args.speech_file)

But before starting it, you will need to export another environment variable export PYTHONIOENCODING = UTF-8 . Without it, I had problems with stdout when called in scripts.

export PYTHONIOENCODING=UTF-8
python voice.py test.raw
{"var": {"results": [{"alternatives": [{"confidence": 0.96161985, "transcript": "Здравствуйте Вас приветствует компания"}]}]}}

Excellent. Now we can call this script in the dialplan.

Asterisk dialplan example


To call the script, I will use a simple dialplan:

exten => 1234,1,Answer
exten => 1234,n,wait(1)
exten => 1234,n,Playback(howtomaketicket)
exten => 1234,n,Playback(beep)
exten => 1234,n,Set(FILE=${CALLERID(num)}--${EXTEN}--${STRFTIME(${EPOCH},,%d-%m-%Y--%H-%M-%S)}.wav)
exten => 1234,n,MixMonitor(${FILE},,/opt/test/send.sh support@test.net "${CDR(src)}" "${CALLERID(name)}" "${FILE}")
exten => 1234,n,wait(28)
exten => 1234,n,Playback(beep)
exten => 1234,n,Playback(Thankyou!)
exten => 1234,n,Hangup()

I use mixmonitor to write and run the script after finishing. You can use record and it will probably be better. The send.sh example for sending is that it assumes you already have mutt configured:

#!/bin/bash
#скрипт для отправки уведомлений
# экспортируем необходимые переменные окружения
# файл лицензии гугла
export GOOGLE_APPLICATION_CREDENTIALS=/opt/test/project.json
# название проекта
export GCLOUD_PROJECT=project-id
# кодировка для питона
export PYTHONIOENCODING=UTF-8
#список переменных на входе
EMAIL=$1
CALLERIDNUM=$2
CALLERIDNAME=$3
FILE=$4
# перекодируем звуковой файл в raw для того, чтобы отдать его гугл апи
sox /var/spool/asterisk/monitor/$FILE -r 16000 -b 16 -c 1 /var/spool/asterisk/monitor/$FILE.raw
# присваиваем переменной значение выполненного скрипта по конвертации звука в текст и обрезаем не нужное
TEXT=`python /opt/test/voice.py /var/spool/asterisk/monitor/$FILE.raw  | sed -e 's/.*transcript"://' -e 's/}]}]}}//'`
# отправляем письмо, включаем в письмо распознанный текст
echo "новое уведомление от номера: $CALLERIDNUM  $CALLERIDNAME
 $TEXT " | mutt -s "Это заголовок письма" -e 'set from=test@test.net realname="я присылаю оповещения"' -a "/var/spool/asterisk/monitor/$FILE" -- $EMAIL

Conclusion


Thus, we solved the task. I hope someone will benefit from my experience. I will be glad to comment (perhaps just for the sake of this and it is worth reading Habr!). In the future I plan to implement based on this IVR with elements of voice control.

Also popular now: