Google speech synthesis and recognition for Asterisk

Good morning!

Last night I looked at Habr, saw a Google article translate + Asterisk IVR and my hair moved in my armpits.

Speech synthesis, how easy it is!
No need to collect Festival and look for samples for it. Everything is ready, simple and from Google.


I immediately copied the proposed option to my favorite php and designed it in the form of an AGI for calling from Asterisk. I wanted the synthesis to be used as a single line in the dialplanet, as the standard SayDigits () command :

Example of use in extensions.ael:
s => {
        Answer();
        Wait(1);
        AGI(say.php,"Здравствуйте");
        AGI(say.php,"Вас приветствует компания");
        AGI(say.php,"Habrahabr!",en);
        AGI(say.php,"Ваш звонок важен для нас!");
        AGI(say.php,"Пожалуйста!");
        AGI(say.php,"оставайтесь на линии");
        AGI(say.php,"Вам обязательно ответят!");
};


And the php code itself (should be /var/lib/asterisk/agi-bin/say.php):
#!/usr/bin/php -q<?php
$agivars = array();
while (!feof(STDIN)) {
    $agivar = trim(fgets(STDIN));
    if ($agivar === '')
        break;
    $agivar = explode(':', $agivar);
    $agivars[$agivar[0]] = trim($agivar[1]);
}
extract($agivars);
$text = $_SERVER["argv"][1];
if (isset($_SERVER["argv"][2])) $lang = $_SERVER["argv"][2];
else $lang = 'ru';
$md5 = md5($text);
$prefix = '/var/lib/asterisk/festivalcache/';
$filename = $prefix.$md5;
if (!file_exists($filename.'.alaw')) {
    $wget = 'wget -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" ';
    $wget.= '"http://translate.google.com/translate_tts?q='.$text.'&tl='.$lang.'" -O '.$filename.'.mp3';
    $ffmpeg = 'ffmpeg -i '.$filename.'.mp3 -ar 8000 -ac 1 -ab 64 '.$filename.'.wav -ar 8000 -ac 1 -ab 64 -f alaw '.$filename.'.alaw -map 0:0 -map 0:0';
    $exec = $wget.' && '.$ffmpeg.' && rm '.$filename.'.mp3 '.$filename.'.wav';
    exec($exec);
}
echo'STREAM FILE "'.$filename.'" ""'."\n";
fgets(STDIN);
exit(0);
?>

In my Asterisk, the main codec is alaw, so I convert mp3 to alaw right away.

After 10 minutes of delight, I remembered that Google has the ability to recognize speech (as in a search from a mobile phone). Climbed on the Internet and found the article Voice control. Recognition of Russian speech , where lies the example in php for speech recognition using Google.

I rewrote the code in the AGI form and got (/var/lib/asterisk/agi-bin/voice.php):
#!/usr/bin/php -q<?
$agivars = array();
while (!feof(STDIN)) {
    $agivar = trim(fgets(STDIN));
    if ($agivar === '')
        break;
    $agivar = explode(':', $agivar);
    $agivars[$agivar[0]] = trim($agivar[1]);
}
extract($agivars);
$filename = $_SERVER["argv"][1];
exec('flac -f -s '.$filename.'.wav -o '.$filename.'.flac');
$file_to_upload = array('myfile'=>'@'.$filename.'.flac');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=ru-RU");
curl_setopt($ch, CURLOPT_POST,1);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: audio/x-flac; rate=8000"));
curl_setopt($ch, CURLOPT_POSTFIELDS, $file_to_upload);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);
$json_array = json_decode($result, true);
$voice_cmd = $json_array["hypotheses"][0]["utterance"];
unlink($filename.'.flac');
unlink($filename.'.wav');
echo'SET VARIABLE VOICE "'.$voice_cmd.'"'."\n";
fgets(STDIN);
echo'VERBOSE ("'.$voice_cmd.'")'."\n";
fgets(STDIN);
exit(0);
?>

The Google Speech API accepts sound files in flac and speex format, from the example left flac.
Recognized text will be set to the variable $ {VOICE}.

Common example use in extensions.ael:
s => {
        Answer();
        Wait(1);
        AGI(say.php,"Здравствуйте");
        AGI(say.php,"Пожалуйста");
        AGI(say.php,"Скажите имя сотрудника");
        Record(/tmp/${UNIQUEID}.wav,3,20);
        AGI(say.php,"Вы сказали");
        Playback(/tmp/${UNIQUEID});
        AGI(voice.php,/tmp/${UNIQUEID});
        AGI(say.php,"Система услышала");
        AGI(say.php,"${VOICE}");
        Hangup();
    };

Record records a wav file with a maximum length of 20 seconds and ends recording after 3 seconds of silence.
Since this is a test case, we listen to what is said and then synthesize the recognized text.

What can I say, Google - well done!
Now it’s clear how pure Asterisk can teach synthesis and speech recognition without using Festival and Sphinx.

And if the authorities ask you to quickly make a voice IVR menu, we can surprise you!

Added

I read the user comment int80h , read about migrating from the Google Translate API to the Bing Translate API and thought that everything needed an alternative.

Version 2.0
say.php with the ability to synthesize speech through Microsoft Translator:
#!/usr/bin/php -q<?php
$agivars = array();
while (!feof(STDIN)) {
    $agivar = trim(fgets(STDIN));
    if ($agivar === '')
        break;
    $agivar = explode(':', $agivar);
    $agivars[$agivar[0]] = trim($agivar[1]);
}
extract($agivars);
$text = $_SERVER["argv"][1];
if (isset($_SERVER["argv"][2]) && in_array($_SERVER["argv"][2], array('g','m'))) $voice = $_SERVER["argv"][2];
else $voice = 'g';
if (isset($_SERVER["argv"][3])) $lang = $_SERVER["argv"][3];
else $lang = 'ru';
$md5 = md5($text.$voice.$lang);
$prefix = '/var/lib/asterisk/festivalcache/';
$appid = 'T0CQJrrwQ1NcJFlJshEfWTzaI18B4TzVvBKx9CDoLvf8*';
$filename = $prefix.$md5;
if (!file_exists($filename.'.alaw')) {
    if ($voice == 'm') {
        $ext = '.wav';
        exec('wget "http://api.microsofttranslator.com/V2/Http.svc/Speak?language='.$lang.'&format=audio/wav&options=MaxQuality&appid='.$appid.'&text='.$text.'" -O '.$filename.$ext);
    } else {
        $ext = '.mp3';
        exec('wget -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" "http://translate.google.com/translate_tts?q='.$text.'&tl='.$lang.'" -O '.$filename.$ext);
    }
    if (@filesize($filename.$ext) > 0) {
        exec('ffmpeg -i '.$filename.$ext.' -ar 8000 -ac 1 -ab 64 -f alaw '.$filename.'.alaw -map 0:0');
    }
    unlink($filename.$ext);
}
if (file_exists($filename.'.alaw')) {
    echo'STREAM FILE "'.$filename.'" ""'."\n";
    fgets(STDIN);
} else {
    echo'VERBOSE ("Speech Error!")'."\n";
    fgets(STDIN);
}
exit(0);
?>

Microsoft gives out sound in wav format (mp3 quality is zero) and asks for some Bing AppId (I took it from microsofttranslator.com, let's see how long it will live).
The quality of the synthesis seemed to me worse than that of Google, but the emphasis in the names puts more correctly.

AGI(say.php,"Здравствуйте",m);
AGI(say.php,"Здравствуйте",${ГОЛОС},${ЯЗЫК});
${ГОЛОС} - может быть m или g (если опустить = g)
${ЯЗЫК} - ru, en и прочее (если опустить = ru)

The Russian text will work only with ru, English will always be, but with ru it will be “broken”.
Accents work in the text ('in front of the vowel) and punctuation marks (for example!) Change intonation.

PS: Replaced that speech recognition can produce empty text, but when you send the same file again, everything goes smoothly, this is strange :-)

Also popular now: