ASR and TTS technologies for application programmer: theoretical minimum

  • Tutorial

Introduction


In the past few years, voice interfaces have surrounded us more and more. What was once shown only in films about the distant future turned out to be quite real. Things have already reached the embedding of engines for synthesis (Text To Speech - TTS) and recognition (Automatic Speech Recognition - ASR) of speech in mobile phones. Moreover, quite accessible APIs appeared for embedding ASR and TTS in applications.

Now anyone can create programs with a voice interface (not stingy to pay for the engine). Our review will be devoted specifically to the use of existing engines (for example, Nuance) and not to the creation of those. Also, general information necessary for each programmer who first encounters speech interfaces will be given. The article may also be useful to project managers trying to assess the feasibility of integrating voice technology into their products.
So, let's begin ...

But for the seed - a joke:
A lesson of the Russian language in a Georgian school.
The teacher says: “Children, remember: the words salt, beans and noodles are written with a soft sign, and the words fork, bulka, plate - without a soft sign. Children, remember, because it’s impossible to understand! ”

This joke used to seem ridiculous to me. Now - rather life. Why is that? Now I’ll try to explain ...

1. Phonemes


Speaking of speech (already ridiculous), we will first of all have to deal with the concept of phoneme. Simply put, a phoneme is a separate sound that can be pronounced and recognized by a person. But this definition is certainly not enough, because you can pronounce a lot of sounds, and the set of phonemes in languages ​​is limited. I would like to have a more rigorous definition. So - you need to go to the philologists. Alas, philologists themselves cannot agree on what it is (and they don’t really need it), but they have several approaches. One connects phonemes with meaning. For example, the English Wiki tells us, “The smallest contrastive linguistic unit which may bring about a change of meaning.” others with perceptions. So, our compatriot N. Trubetskoywrote "Phonological units, which from the point of view of a given language cannot be decomposed into shorter phonological units that follow one after another, we call phonemes." In both definitions, there are clarifications that are important to us. On the one hand, changing a phoneme can (but does not have to) change the meaning of the word. So, “code” and “cat” will be perceived as two different words. On the other hand, you can say “museum” or “muse” and the meaning will not change. Is it possible that your interlocutors will be able to somehow classify your accent. The indivisibility of phonemes is also important. But, as Trubetskoy correctly noted, it can depend on the language. Where a person of one nationality hears one sound, someone else can hear two, one after another. However, I would like to have phonetic invariants suitable for all languages, and not just just one.

2. Phonetic alphabet


In order to somehow settle the definition back in 1888, the International Phonetic Alphabet (IPA) was created . This alphabet is good because it does not dependfrom a particular language. Those. It is designed as if for a "superman" who can pronounce and recognize the sounds of almost all available living (and even dead) languages. The alphabet IPA gradually changed until our days (2005). Since it was created mainly in the pre-computer era, philologists drew symbols denoting sounds as God would put a soul on. Of course, they somehow focused on the Latin alphabet, but very, very conditionally. As a result, IPA characters are now available in Unicode, but typing them from the keyboard is not easy. Here the reader may ask - why do people need an IPA? Where can I see at least examples of words spelled phonetically? My answer is that an ordinary person does not need to know IPA. But, with all this, you can see it very easily - in many Wiki articles, concerning geographical names, surnames and proper names. Knowing the IPA, you can always verify the correct pronunciation of a particular name in an unfamiliar language. For example, want to say “Paris” as a Frenchman? There you go - [paʁi].

3. Phonetic transcription


An attentive Wiki user may really notice that sometimes strange phonetic alphabet icons are inside square brackets - [mɐˈskva], and sometimes inside slashes - / ˈlʌndən /. What is the difference? In square brackets the so-called narrow, or “narrow” transcription. In domestic literature, it is called phonetic. In slashes, broad is written, i.e. "Broad" or phonemic transcription. The practical meaning is as follows: phonetic transcription gives an extremely accurate pronunciation, which in a sense is ideal and regardless of the accent of the speaker. In other words, having a phonetic transcription, we can say "Cockney will pronounce this word like that." Phonemic transcription allows variation. So, Australian and Canadian English pronounced sound with the same entry in // may be different. In truth, even narrow transcription is still not straightforward. Those. quite far from the waw file. Male, female and children's voices pronounce the same phoneme in different ways. Also, the overall speech speed, its volume and the basic pitch of the voice are not taken into account. Actually, these differences make the task of speech generation and recognition nontrivial. Further in the text, I will always use IPA in narrow transcription, unless otherwise specified. At the same time, I will try to minimize the direct use of IPA. these differences make the task of speech generation and recognition nontrivial. Further in the text, I will always use IPA in narrow transcription, unless otherwise specified. At the same time, I will try to minimize the direct use of IPA. these differences make the task of speech generation and recognition nontrivial. Further in the text, I will always use IPA in narrow transcription, unless otherwise specified. At the same time, I will try to minimize the direct use of IPA.

4. Languages


Each living natural language has its own set of phonemes. More precisely, this is a property of speech, for generally speaking, one can know a language without being able to pronounce words (how the deaf and dumb learn the language). The phonetic composition of languages ​​is different, much like the alphabets are different. Accordingly, the phonetic complexity of the language also varies. It consists of two components. Firstly, the difficulty of converting graphemes into phonemes (we remember that the English write “Manchester” and read “Liverpool”) and the difficulty of pronouncing the sounds themselves (phonemes) secondly. How many phonemes usually contains language? A few dozens. From childhood we were taught that Russian pronunciation is simple as three pennies, and everything is read as it is written, in contrast to European languages. Of course we were deceived! If you read the words literally as they are written, though they understand you, they are not always true. But they certainly won’t count the Russians. In addition, such a terrible thing for a European as stress is involved. Instead of putting it at the beginning (like the British) or at the end (like the French), it walks with us all over the word, as God puts the soul on the soul, while changing the meaning. Do rogi and dor o gi - two different words, and even parts of speech. How many phonemes in Russian? Nuance has 54 of them. For comparison - in English there are only 45 phonemes, and in French even less - 34. No wonder the aristocrats considered it easy to learn a language a couple of centuries ago! Of course, Russian is not the most difficult language in Europe, but one of them (mind you, I'm still silent about grammar).

5.X-SAMPA and LH +


Since people wanted to introduce phonetic transcription from the keyboard for a long time, even before the widespread distribution of Unicode, notations were developed that made it possible to do only with the characters of the ASCII table. The two most common of these are X-SAMPA - the creation of Professor John Wells, and LH + - the internal format of Lernout & Hauspiewhose technology was later purchased by Nuance Communications. There is a pretty significant difference between X-SAMPA and LH +. Formally, X-SAMPA is just a notation that, by certain rules, allows you to record the same IPA phonemes, only using ASCII. Another thing is LH +. In a sense, LH + is an analogue of broad (phonemic) transcription. The fact is that for each language, the same LH + symbol can denote different IPA phonemes. On the one hand, it’s good, because the record is shortened, and it is not necessary to encode all possible IPA characters, on the other hand, ambiguity arises. And every time for translation to IPA you need to keep a correspondence table in front of you. However, the saddest thing is that only a “voice” for a certain language can correctly pronounce a line recorded in LH +.

6. Votes


No, it’s not about those voices that are often heard in the head by programmers who wrote too much bad code in the past. Rather, about those who are so often searched on trackers and file washes by owners of navigators and other mobile devices. These voices even have names. The words "Milena" and "Katerina" say a lot to an experienced user of voice interfaces. What is it? Roughly speaking, these are data sets prepared by various companies (such as Nuance) that allow a computer to convert phonemes into sound. Voices are female and male, and cost a lot of money. Depending on the platform and the developer, you may be required to pay 2-5 thousand dollars per voice. Thus, if you want to create an interface in at least 5 of the most common European languages, then the bill can go to tens of thousands. Of course it's about the software interface. So the voice is language specific. From here comes its binding to phonetic transcription. This is not easy to realize at first, but the joke at the beginning of the article is the true truth. People with the same mother tongue usuallysimply unable to pronounce phonemes of another that are not in their native language. And, even worse - not only individual phonemes but also certain combinations of them. So, if in your language the word never ends with a soft “l” then we will not be able to pronounce it (at first).

Same thing with voices. The voice is designed to pronounce only those phonemes that are in the language. Moreover - in a specific dialect of the language. Those. voices for Canadian French and French French will not only differ in sound, but will also have a different set of pronounced phonemes. Incidentally, this is convenient for manufacturers of ASR and TTS engines, as each language can be sold for separate money. On the other hand, you can understand them. Creating a voice is quite time-consuming, and expensive for money. Perhaps this is precisely why there is still no widespread market for Open Source solutions for most languages.

It would seem that nothing prevents creating a “universal” voice that will be able to pronounce all IPA phonemes, and thus solve the problem of multilingual interfaces. But for some reason no one does it. Most likely, this is impossible. Those. he can and will speak, but every native speaker will be dissatisfied with the lack of "naturalness" of pronunciation. It will sound like Russian in the mouth of a little practiced Englishman or English in the mouth of a Frenchman. So, if you want multilingualism - get ready to fork out.

7. TTS API Example


To give the reader an idea of ​​how the process of working with TTS looks at the lower level (C ++ is used), I will give an example of speech synthesis based on the Nuance engine. Of course, this is an incomplete example, it can not only be run but even compiled, but it gives an idea of ​​the process. All functions except TTS_Speak () are needed as a binding for it.

TTS_Initialize () - serves to initialize the
TTS_Cleanup () engine - to de-initialize
TTS_SelectLanguage - selects a language and sets recognition parameters.

TTS_Speak () - actually generates sound samples
TTS_Callback () - is called when the next portion of audio data is ready to play, as well as in the case of other events.

TTS and its binding
static const NUAN_TCHAR * _dataPathList[] = {
    __TEXT("\\lang\\"),
    __TEXT("\\tts\\"),
};
static VPLATFORM_RESOURCES _stResources = {
    VPLATFORM_CURRENT_VERSION,
    sizeof(_dataPathList)/sizeof(_dataPathList[0]),
    (NUAN_TCHAR **)&_dataPathList[0],
};
static VAUTO_INSTALL    _stInstall = {VAUTO_CURRENT_VERSION};
static VAUTO_HSPEECH    _hSpeech = {NULL, 0};
static VAUTO_HINSTANCE  _hTtsInst = {NULL, 0};
static WaveOut *        _waveOut = NULL;
static WaveOutBuf *     _curBuffer = NULL;
static int              _volume = 100;
static int              _speechRate = 0; // use default speech rate
static NUAN_ERROR _Callback (VAUTO_HINSTANCE        hTtsInst,
                                   VAUTO_OUTDEV_HINSTANCE hOutDevInst,
                                   VAUTO_CALLBACKMSG    * pcbMessage,
                                   VAUTO_USERDATA         UserData);
static const TCHAR * _szLangTLW = NULL;
static VAUTO_PARAMID _paramID[] = {
    VAUTO_PARAM_SPEECHRATE,
    VAUTO_PARAM_VOLUME
};
static NUAN_ERROR _TTS_GetFrequency(VAUTO_HINSTANCE hTtsInst, short *pFreq) {
    NUAN_ERROR  Error = NUAN_OK;
    VAUTO_PARAM TtsParam;
    /*-- get frequency used by current voicefont --*/
    TtsParam.eID = VAUTO_PARAM_FREQUENCY;
    if (NUAN_OK != (Error = vauto_ttsGetParamList (hTtsInst, &TtsParam, 1)) ) {
        ErrorV(_T("vauto_ttsGetParamList rc=0x%1!x!\n"), Error);
        return Error;
    }
    switch(TtsParam.uValue.usValue)
    {
    case VAUTO_FREQ_8KHZ:  *pFreq = 8000;
        break;
    case VAUTO_FREQ_11KHZ: *pFreq = 11025;
        break;
    case VAUTO_FREQ_16KHZ: *pFreq = 16000;
        break;
    case VAUTO_FREQ_22KHZ: *pFreq = 22050;
        break;
    default: break;
    }
    return NUAN_OK;
}
int TTS_SelectLanguage(int langId) {
    NUAN_ERROR nrc;
    VAUTO_LANGUAGE     arrLanguages[16];
    VAUTO_VOICEINFO    arrVoices[4];
    VAUTO_SPEECHDBINFO arrSpeechDB[4];
    NUAN_U16 nLanguageCount, nVoiceCount, nSpeechDBCount;
    nLanguageCount = sizeof(arrLanguages)/sizeof(arrLanguages[0]);
    nVoiceCount    = sizeof(arrVoices)   /sizeof(arrVoices[0]);
    nSpeechDBCount = sizeof(arrSpeechDB)/sizeof(arrSpeechDB[0]);
    int nVoice = 0, nSpeechDB = 0;
    nrc = vauto_ttsGetLanguageList( _hSpeech, &arrLanguages[0], &nLanguageCount);
    if(nrc != NUAN_OK){
        TTS_ErrorV(_T("vauto_ttsGetLanguageList rc=0x%1!x!\n"), nrc);
        return 0;
    }
    if(nLanguageCount == 0 || nLanguageCount<=langId){
        TTS_Error(_T("vauto_ttsGetLanguageList: No proper languages found.\n"));
        return 0;
    }
    _szLangTLW = arrLanguages[langId].szLanguageTLW;
    NUAN_TCHAR* szLanguage      = arrLanguages[langId].szLanguage;
    nVoice = 0; // select first voice;
    NUAN_TCHAR* szVoiceName      = arrVoices[nVoice].szVoiceName;
    nSpeechDB = 0; // select first speech DB
    {
        VAUTO_PARAM stTtsParam[7];
        int cnt = 0;
        // language
        stTtsParam[cnt].eID = VAUTO_PARAM_LANGUAGE;
        _tcscpy(stTtsParam[cnt].uValue.szStringValue, szLanguage);
        cnt++;
        // voice
        stTtsParam[cnt].eID = VAUTO_PARAM_VOICE;
        _tcscpy(stTtsParam[cnt].uValue.szStringValue, szVoiceName);
        cnt++;
        // speechbase parameter - frequency
        stTtsParam[cnt].eID = VAUTO_PARAM_FREQUENCY;
        stTtsParam[cnt].uValue.usValue = arrSpeechDB[nSpeechDB].u16Freq;
        cnt++;
        // speechbase parameter - reduction type
        stTtsParam[cnt].eID = VAUTO_PARAM_VOICE_MODEL;
        _tcscpy(stTtsParam[cnt].uValue.szStringValue, arrSpeechDB[nSpeechDB].szVoiceModel);
        cnt++;
        if (_speechRate) {
            // Speech rate
            stTtsParam[cnt].eID = VAUTO_PARAM_SPEECHRATE;
            stTtsParam[cnt].uValue.usValue = _speechRate;
            cnt++;
        }
        if (_volume) {
            // Speech volume
            stTtsParam[cnt].eID = VAUTO_PARAM_VOLUME;
            stTtsParam[cnt].uValue.usValue = _volume;
            cnt++;
        }
        nrc = vauto_ttsSetParamList(_hTtsInst, &stTtsParam[0], cnt);
        if(nrc != NUAN_OK){
            ErrorV(_T("vauto_ttsSetParamList rc=0x%1!x!\n"), nrc);
            return 0;
        }
    }
    return 1;
}
int TTS_Initialize(int defLanguageId) {
    NUAN_ERROR nrc;
    nrc = vplatform_GetInterfaces(&_stInstall, &_stResources);
    if(nrc != NUAN_OK){
        Error(_T("vplatform_GetInterfaces rc=%1!d!\n"), nrc);
        return 0;
    }
    nrc = vauto_ttsInitialize(&_stInstall, &_hSpeech);
    if(nrc != NUAN_OK){
        Error(_T("vauto_ttsInitialize rc=0x%1!x!\n"), nrc);
        TTS_Cleanup();
        return 0;
    }
    nrc =  vauto_ttsOpen(_hSpeech, _stInstall.hHeap, _stInstall.hLog, &_hTtsInst, NULL);
    if(nrc != NUAN_OK){
        ErrorV(_T("vauto_ttsOpen rc=0x%1!x!\n"), nrc);
        TTS_Cleanup();
        return 0;
    }
    // Ok, time to select language
    if(!TTS_SelectLanguage(defLanguageId)){
        TTS_Cleanup();
        return 0;
    }
    // init Wave out device
    {
        short freq;
        if (NUAN_OK != _TTS_GetFrequency(_hTtsInst, &freq))
        {
            TTS_ErrorV(_T("_TTS_GetFrequency rc=0x%1!x!\n"), nrc);
            TTS_Cleanup();
            return 0;
        }
        _waveOut = WaveOut_Open(freq, 1, 4);
        if (_waveOut == NULL){
            TTS_Cleanup();
            return 0;
        }
    }
    // init TTS output
    {
        VAUTO_OUTDEVINFO stOutDevInfo;
        stOutDevInfo.hOutDevInstance = _waveOut;
        stOutDevInfo.pfOutNotify = TTS_Callback;              // Notify using callback!
        nrc = vauto_ttsSetOutDevice(_hTtsInst, &stOutDevInfo);
        if(nrc != NUAN_OK){
            ErrorV(_T("vauto_ttsSetOutDevice rc=0x%1!x!\n"), nrc);
            TTS_Cleanup();
            return 0;
        }
    }
    // OK TTS engine initialized
    return 1;
}
void TTS_Cleanup(void) {
    if(_hTtsInst.pHandleData){
        vauto_ttsStop(_hTtsInst);
        vauto_ttsClose(_hTtsInst);
    }
    if(_hSpeech.pHandleData){
        vauto_ttsUnInitialize(_hSpeech);
    }
    if(_waveOut){
        WaveOut_Close(_waveOut);
        _waveOut = NULL;
    }
    vplatform_ReleaseInterfaces(&_stInstall);
    memset(&_stInstall, 0, sizeof(_stInstall));
    _stInstall.fmtVersion = VAUTO_CURRENT_VERSION;
}
int  TTS_Speak(const TCHAR * const message, int length) {
    VAUTO_INTEXT stText;
    stText.eTextFormat  = VAUTO_NORM_TEXT;
    stText.szInText     = (void*) message;
    stText.ulTextLength = length * sizeof(NUAN_TCHAR);
    TraceV(_T("TTS_Speak: %1\n"), message);
    NUAN_ERROR rc = vauto_ttsProcessText2Speech(_hTtsInst, &stText);
    if (rc == NUAN_OK) {
        return 1;
    }
    if (rc == NUAN_E_TTS_USERSTOP) {
        return 2;
    }
    ErrorV(_T("vauto_ttsProcessText2Speech rc=0x%1!x!\n"), rc);
    return 0;
}
static NUAN_ERROR TTS_Callback (VAUTO_HINSTANCE        hTtsInst,
                                   VAUTO_OUTDEV_HINSTANCE hOutDevInst,
                                   VAUTO_CALLBACKMSG    * pcbMessage,
                                   VAUTO_USERDATA         UserData) {
    VAUTO_OUTDATA * outData;
    switch(pcbMessage->eMessage){
    case VAUTO_MSG_BEGINPROCESS:
        WaveOut_Start(_waveOut);
        break;
    case VAUTO_MSG_ENDPROCESS:
        break;
    case VAUTO_MSG_STOP:
        break;
    case VAUTO_MSG_OUTBUFREQ:
        outData = (VAUTO_OUTDATA *)pcbMessage->pParam;
        memset(outData, 0, sizeof(VAUTO_OUTDATA));
        {
            WaveOutBuf * buf = WaveOut_GetBuffer(_waveOut);
            if(buf){
                VAUTO_OUTDATA * outData = (VAUTO_OUTDATA *)pcbMessage->pParam;
                outData->eAudioFormat     = VAUTO_16LINEAR;
                outData->pOutPcmBuf       = WaveOutBuf_Data(buf);
                outData->ulPcmBufLen      = WaveOutBuf_Size(buf);
                _curBuffer = buf;
                break;
            }
            TTS_Trace(_T("VAUTO_MSG_OUTBUFREQ: processing was stopped\n"));
        }
        return NUAN_E_TTS_USERSTOP;
    case VAUTO_MSG_OUTBUFDONE:
        outData = (VAUTO_OUTDATA *)pcbMessage->pParam;
        WaveOutBuf_SetSize(_curBuffer, outData->ulPcmBufLen);
        WaveOut_PutBuffer(_waveOut, _curBuffer);
        _curBuffer = NULL;
        break;
    default:
        break;
    }
    return NUAN_OK;
}



As the reader may notice, the code is rather cumbersome, and simple (seemingly) functionality requires a large number of presets. Alas, this is the flip side of the flexibility of the engine. Of course, the API of other engines for other languages ​​can be significantly simpler to more compact.

8. Phonemes again


Looking at the API, the reader may ask - why do we need phonemes at all if TTS (Text-To-Speech) can directly convert text to speech. Able, but there is one "but." Familiar with speechword engine. The situation is much worse with the words "unfamiliar". Such as toponyms, proper names, etc. This is especially evident in multinational countries, such as Russia, for example. The names of cities and towns on the territory of one eternally sixth part of the land were given by different people, in different languages ​​and at different times. The need to spell them in Russian letters played a bad joke with national languages. The phonemes of the Tatars, Nenets, Abkhazians, Kazakhs, Yakuts, Buryats were squeezed into the Procrustean bed of the Russian language. In which, although there are many phonemes, it is still not enough to convey all the languages ​​of the peoples of the former Union. But even worse - if the phonetic record is at least somewhat similar to the original, then reading the TTS engine a name like “Kuchuk-Kainardzhi” causes nothing but laughter.

However, it is naive to think that this is only a problem of the Russian language. Similar difficulties exist in countries that are more homogeneous in terms of population. So, in French, the letters p, b, d, t, s at the end of words are usually not read. But if we take place names, then local traditions will come into force here. So, in the word Paris 's' in the end is not really pronounced, and in the word 'Valluris' - vice versa. The difference is that Paris is located in the north of France, and Vallauris is in the south, in Provence, where the pronunciation rules are somewhat different. That is why it is still desirable to have a phonetic transcription for words. Usually cards come with it. True, unity in the format is not observed. So, NavTeq traditionally uses X-SAMPA transcription, and TomTom - LH +. Well, if your TTS engine accepts both, and if not? Here you have to pervert. For example, to convert one transcription into another, which in itself is not trivial. If there is no phonetic information at all, then the engine has its own methods for obtaining it. If we talk about the Nuance engine, it is “Data Driven Grapheme To Phoneme” (DDG2P) and “Common Linguistic Component” (CLC). However, the use of these options is already an extreme measure.

9. Special sequences


Nuance provides not only the ability to pronounce text or phonetic recording, but also dynamically switch between them. For this, an escape sequence of the form is used:/ +

In general, using escape sequences, you can specify many parameters. In general form, it looks like this:
\ =\

For example,

\ x1b \ rate = 110 \ - sets the pronunciation speed
\ x1b \ vol = 5 \ - sets the volume
\ x1b \ audio = "beep.wav" \ - inserts data from the wav file into the audio stream.

Similarly, you can make the engine spell a word, insert pauses, change the voice (for example, from male to female) and much more. Of course, not all sequences can be useful to you, but overall this is a very useful feature.

10. Dictionaries


Sometimes you need to pronounce a certain set of words in a certain way (abbreviations, abbreviations, proper names, etc), but you want in each case to replace the text with phonetic transcription (and this is not always possible). In this case, dictionaries come to the rescue. What is a dictionary in Nuance terminology? This is a file with a set of pairs: <text> <transcription>. This file is compiled and then loaded by the engine. When pronouncing, the engine checks if the word / text is present in the dictionary and, if so, replaces it with its phonetic transcription. For example, a dictionary containing the names of streets and squares of the Vatican.

[Header]
Name = Vaticano
Language = ITI
Content = EDCT_CONTENT_BROAD_NARROWS
Representation = EDCT_REPR_SZZ_STRING
[Data]
"Largo del Colonnato" // 'lar.go_del_ko.lo.'n: a.to
"Piazza del Governatorato" // 'pja.t & s: a_del_go.ver.na.to.'ra.to
"Piazza della Stazione" // 'pja.t & s: a_de.l: a_sta.'t & s: jo.ne
"Piazza di Santa Marta" // 'pja.t & s: a_di_'san.ta_'mar.ta
"Piazza San Pietro" // 'pja.t & s: a_'sam_'pjE.tro
"Piazzetta Châteauneuf Du Pape" // pja.'t & s: et: a_Sa.to.'nef_du_'pap
"Salita ai Giardini" // sa.'li.ta_aj_d & Zar.'di.ni
"Stradone dei Giardini" // stra.'do.ne_dej_d & Zar.'di.ni
"Via dei Pellegrini" // 'vi.a_dej_pe.l: e.'gri.ni
"Via del Fondamento" // 'vi.a_del_fon.da.'men.to
"Via del Governatorato" // 'vi.a_del_go.ver.na.to.'ra.to
"Via della Posta" // 'vi.a_de.l: a_'pOs.ta
"Via della Stazione Vaticana" // 'vi.a_de.l: a_sta.'t & s: jo.ne_va.ti.'ka.na
"Via della Tipografia" // 'vi.a_de.l: a_ti.po.gra.'fi.a
"Via di Porta Angelica" // 'vi.a_di_'pOr.ta_an.'d & ZE.li.ka
"Via Tunica" // 'vi.a_'tu.ni.ka
"Viale Centro del Bosco" // vi.'a.le_'t & SEn.tro_del_'bOs.ko
"Viale del Giardino Quadrato" // vi.'a.le_del_d & Zar.'di.no_kwa.'dra.to
"Viale Vaticano" // vi.'a.le_va.ti.'ka.no


11. Recognition


Speech recognition is even more challenging than its synthesis. If synthesizers somehow worked in the good old days, sensible recognition became available only now. There are several reasons, the first of which is very similar to the problems of an ordinary living person faced with an unfamiliar language, the second is a collision with text from an unfamiliar area.

Perceiving sound vibrations that remind us of a voice, we first try to divide it into phonemes, isolate the familiar sounds that we must form into words. If the language is familiar to us, then this is easily obtained, if not, then it is most likely that it will not even be possible to “decompose” the speech into phonemes (remember the story about “Alla, I’m at the bar!”). Where one is heard to us, the one who speaks is quite another. This happens because, over the years, our brains have been "trained" in certain phonemes, and over time they are used to perceiving only them. Encountering an unfamiliar sound, he tries to choose the phoneme of his native language [languages] closest to what he has heard. In a way, this is similar to the vector quantization technique used in speech codecs such as CELP. Not the fact that such an approximation will be successful. That is why, “native” phonemes will be “convenient” for us.

Remember, back in the Soviet Union, while studying at school, and when meeting with foreigners, we tried to “transliterate” our name, saying:
“May name from Boris Petroff” The
teachers then scolded us, saying - why distort your name? Do you think he will understand this? Speak Russian!

Alas, even here they deceived us or were mistaken ... If you could pronounce your name in English / German / Chinese, then it will be really easier for a native speaker to perceive it. The Chinese understood this a long time ago, and take special “European” names for themselves to communicate with Western partners. In machine recognition, a particular language is described by the so-called acoustic model. Before recognizing the text, we must load the acoustic model of a certain language, thereby making it clear to the program what phonemes it should wait for at the input.

The second problem is no less complicated. Let us return to our analogy with a living person. Listening to the interlocutor, we subconsciously build in the head a model of what he will say next, in other words, we create the context of the conversation. And if SUDDENLY inserting words that fall outside the context into the narrative (for example, “involute” when it comes to football), one can cause cognitive dissonance in the interlocutor. Roughly speaking, at the computer this very dissonance occurs constantly, because he never knows what to expect from a person. It’s easier for a person - he can ask his interlocutor again. And what should a computer do? To solve this problem and give the computer the right context, grammar is used.

12. Grammar


Grammars (usually given in the form of BNF) just give the computer (more precisely, the ASR engine) an idea of ​​what to expect from the user at this particular moment. Usually these are several alternatives combined through 'or', but more complex grammars are possible. Here is a grammar example for choosing Kazan metro stations:

# BNF + EM V1.0;
! grammar test;
! start ;
 :
"Ametyevo"! Id (0)! Pronounce ("^. 'M% je.t% jjI.vo-") |
"Aircraft"! Id (1)! Pronounce ("^ v% jI'astro-'it% jIl% jno-j ^") |
"Slides"! Id (2)! Pronounce ("'gor.k% jI") |
"Goat settlement"! Id (3)! Pronounce ("'ko.z% jj ^ _slo-.b ^.' Da") |
"Kremlin"! Id (4)! Pronounce ("kr% jIm.'l% jof.sko-.j ^") |
"Gabdulla Tukay Square"! Id (5)! Pronounce ("'plo.S% jIt% j_go-.bdu.'li0_'tu.ko-.j ^") |
Victory Avenue! Id (6)! Pronounce ("pr ^. 'Sp% jekt_p ^.' B% je.di0") |
"North Station"! Id (7)! Pronounce ("'s% je.v% jIr.ni0j_v ^ g.'zal") |
"Cloth settlement"! Id (8)! Pronounce ("'su.ko-.no-.j ^ _slo-.b ^.' Da") |
"Yashlek"! Id (9)! Pronounce ("ja.'Sl% jek");


As you can see, each line is one of the alternatives consisting of the text itself, integer id and phoneme. A phoneme is generally optional, but recognition will be more accurate with it.

How big can grammar be? Great enough. Say, in our experiments, 37 thousand alternatives are recognized at an acceptable level. Things are much worse with complex and ramified grammars. Recognition time is growing, and quality is falling, and the dependence on the length of the grammar is non-linear. Therefore, my advice is to avoid complex grammars. Anyway, bye.

Grammars (as well as contexts) are static and dynamic. You have already seen an example of a static grammar; it is compiled in advance and stored in the internal binary representation of the engine. However, sometimes the context changes during user interaction. A typical example for navigation is the selection of a city by the first letters. A lot of possible options for recognition here change with each letter you enter, respectively, the recognition context must be constantly rebuilt. For these purposes, dynamic contexts are used. Roughly speaking, a programmer compiles grammars "on the fly" and palm off them to the engine right in the course of program execution. Of course, if we are talking about a mobile device, the processing speed will not be too high,

13. ASR API Example


Text recognition is not as straightforward as synthesis. If the user is just silent in front of the microphone, we will have to recognize the surrounding noise. If he says something like “ehhhhhh,” then recognition is also likely to be unsuccessful. In the best case, ASR usually returns a set of options (also called hypotheses) to us. Each hypothesis has a certain weight. If the grammar is large, then recognition options can be quite a lot. In this case, it makes sense to consecutively state hypotheses (for example, the first five in descending order of reliability) and ask the user to select one of them. Ideally, with a short grammar (“yes” | “no”), we will return one option with a high confidence indicator.

The example below contains the following functions:

ConstructRecognizer () - creates a “recognizer” and configures its parameters
DestroyRecognizer () - destroys the “recognizer”
ASR_Initialize () - initializes the ASR engine
ASR_UnInitialize () - de-initializes the ASR engine
evt_HandleEvent - processes events generated by the thread of the “recognizer”
- recognizer ) recognition results

ASR and its binding
typedef struct RECOG_OBJECTS_S {
    void             *pHeapInst;          // Pointer to the heap.
    const char	   *acmod;				  // path to acmod data
    const char	   *ddg2p;				  // path to ddg2p data
    const char	   *clc;				  // path to clc data
    const char	   *dct;				  // path to dct data
    const char	   *dynctx;				  // path to empty dyn ctx data
    LH_COMPONENT      hCompBase;          // Handle to the base component.
    LH_COMPONENT      hCompAsr;           // Handle to the ASR component.
    LH_COMPONENT      hCompPron;		  // Handle to the pron component (dyn ctx)
    LH_OBJECT         hAcMod;             // Handle to the AcMod object.
    LH_OBJECT         hRec;				  // Handle to the SingleThreadedRec Object
    LH_OBJECT         hLex;				  // Handle to lexicon object (dyn ctx)
    LH_OBJECT         hDdg2p;			  // Handle to ddg2p object (dyn ctx)
    LH_OBJECT		  hClc;				  // Handle to the CLC (DDG2P backup)
    LH_OBJECT         hDct;				  // Handle to dictionary object (dyn ctx)
    LH_OBJECT         hCache;			  // Handle to cache object (dyn ctx)
    LH_OBJECT         hCtx[5];            // Handle to the Context object.
    LH_OBJECT         hResults[5];        // Handle to the Best results object.
    ASRResult        *results[5];         // recognition results temporary storage
    LH_OBJECT         hUswCtx;            // Handle to the UserWord Context object.
    LH_OBJECT         hUswResult;         // Handle to the UserWord Result object.
    unsigned long     sampleFreq;         // Sampling frequency.
    unsigned long     frameShiftSamples;  // Size of one frame in samples
    int               requestCancel;      // boolean indicating user wants to cancel recognition
    // used to generate transcriptions for dyn ctx
    LH_BNF_TERMINAL		*pTerminals;
    unsigned int		terminals_count;
    unsigned int		*terminals_transtype; // array with same size as pTerminals; each value indicates the type of transcription in pTerminal: user-provided, from_ddg2p, from_dct, from_clc
    SLOT_TERMINAL_LIST	*pSlots;
    unsigned int		slots_count;
    // reco options
    int			isNumber;	// set to 1 when doing number recognition
    const char *		UswFile;	// path to file where userword should be recorded
    char * staticCtxID;
} RECOG_OBJECTS;
// store ASR objects
static RECOG_OBJECTS recogObjects;
static int ConstructRecognizer(RECOG_OBJECTS *pRecogObjects,
                               const char *szAcModFN, const char * ddg2p, const char * clc, const char * dct, const char * dynctx) {
    LH_ERROR lhErr = LH_OK;
    PH_ERROR phErr = PH_OK;
    ST_ERROR stErr = ST_OK;
    LH_ISTREAM_INTERFACE  IStreamInterface;
    void                 *pIStreamAcMod = NULL;
    LH_ACMOD_INFO        *pAcModInfo;
    LH_AUDIOCHAINEVENT_INTERFACE    EventInterface;
    /* close old objects */
    if(!lh_ObjIsNull(pRecogObjects->hAcMod)){
        DestroyRecognizer(pRecogObjects);
    }
    pRecogObjects->sampleFreq      = 0;
    pRecogObjects->requestCancel   = 0;
    pRecogObjects->pTerminals      = NULL;
    pRecogObjects->terminals_count = 0;
    pRecogObjects->pSlots          = NULL;
    pRecogObjects->slots_count     = 0;
    pRecogObjects->staticCtxID     = NULL;
    pRecogObjects->acmod  = szAcModFN;
    pRecogObjects->ddg2p  = ddg2p;
    pRecogObjects->clc	  = clc;
    pRecogObjects->dct	  = dct;
    pRecogObjects->dynctx = dynctx;
    EventInterface.pfevent = evt_HandleEvent;
    EventInterface.pfadvance = evt_Advance;
    // Create the input stream for the acoustic model.
    stErr = st_CreateStreamReaderFromFile(szAcModFN, &IStreamInterface, &pIStreamAcMod);
    if (ST_OK != stErr) goto error;
    // Create the AcMod object.
    lhErr = lh_CreateAcMod(pRecogObjects->hCompAsr, &IStreamInterface, pIStreamAcMod, NULL, &(pRecogObjects->hAcMod));
    if (LH_OK != lhErr) goto error;
    // Retrieve some information from the AcMod object.
    lhErr = lh_AcModBorrowInfo(pRecogObjects->hAcMod, &pAcModInfo);
    if (LH_OK != lhErr) goto error;
    pRecogObjects->sampleFreq = pAcModInfo->sampleFrequency;
        pRecogObjects->frameShiftSamples = pAcModInfo->frameShift * pRecogObjects->sampleFreq/1000;
    // Create a SingleThreadRec object
    lhErr = lh_CreateSingleThreadRec(pRecogObjects->hCompAsr, &EventInterface, pRecogObjects, 3000, pRecogObjects->sampleFreq, pRecogObjects->hAcMod, &pRecogObjects->hRec);
    if (LH_OK != lhErr) goto error;
    // cretae DDG2P & lexicon for dyn ctx
    if (pRecogObjects->ddg2p) {
        int rc = InitDDG2P(pRecogObjects);
        if (rc<0) goto error;
    } else if (pRecogObjects->clc) {
        int rc = InitCLCandDCT(pRecogObjects);
        if (rc<0) goto error;
    } else {
        // TODO: what now?
    }
    // Return without errors.
    return 0;
error:
    // Print an error message if the error comes from the private heap or stream component.
    // Errors from the VoCon3200 component have been printed by the callback.
    if (PH_OK != phErr) {
        printf("Error from the private heap component, error code = %d.\n", phErr);
    }
    if (ST_OK != stErr) {
        printf("Error from the stream component, error code = %d.\n", stErr);
    }
    return -1;
}
static int DestroyRecognizer(RECOG_OBJECTS *pRecogObjects) {
    unsigned int curCtx;
    if (!lh_ObjIsNull(pRecogObjects->hUswResult)){
        lh_ObjClose(&pRecogObjects->hUswResult); pRecogObjects->hUswResult = lh_GetNullObj();
    }
    if (!lh_ObjIsNull(pRecogObjects->hUswCtx)){
        lh_ObjClose(&pRecogObjects->hUswCtx); pRecogObjects->hUswCtx = lh_GetNullObj();
    }
    if (!lh_ObjIsNull(pRecogObjects->hDct)){
        lh_ObjClose(&pRecogObjects->hDct); pRecogObjects->hDct = lh_GetNullObj();
    }
    if (!lh_ObjIsNull(pRecogObjects->hCache)){
        lh_ObjClose(&pRecogObjects->hCache); pRecogObjects->hCache = lh_GetNullObj();
    }
    if (!lh_ObjIsNull(pRecogObjects->hClc)){
        lh_ObjClose(&pRecogObjects->hClc); pRecogObjects->hClc = lh_GetNullObj();
    }
    if (!lh_ObjIsNull(pRecogObjects->hLex)){
        lh_LexClearG2P(pRecogObjects->hLex);
        lh_ObjClose(&pRecogObjects->hLex); pRecogObjects->hLex = lh_GetNullObj();
    }
    if (!lh_ObjIsNull(pRecogObjects->hDdg2p)){
        lh_DDG2PClearDct (pRecogObjects->hDdg2p);
        lh_ObjClose(&pRecogObjects->hDdg2p); pRecogObjects->hDdg2p = lh_GetNullObj();
    }
    for(curCtx=0; curCtxhCtx[curCtx])){
            lh_RecRemoveCtx(pRecogObjects->hRec, pRecogObjects->hCtx[curCtx]);
            lh_ObjClose(&pRecogObjects->hCtx[curCtx]); pRecogObjects->hCtx[curCtx] = lh_GetNullObj();
        }
        if (!lh_ObjIsNull(pRecogObjects->hResults[curCtx])){
            lh_ObjClose(&pRecogObjects->hResults[curCtx]); pRecogObjects->hResults[curCtx] = lh_GetNullObj();
        }
    }
    if (!lh_ObjIsNull(pRecogObjects->hRec)){
        lh_ObjClose(&pRecogObjects->hRec); pRecogObjects->hRec = lh_GetNullObj();
    }
    if (!lh_ObjIsNull(pRecogObjects->hAcMod)){
        lh_ObjClose(&pRecogObjects->hAcMod); pRecogObjects->hAcMod = lh_GetNullObj();
    }
    return 0;
}
int ASR_Initialize(const char * acmod, const char * ddg2p, const char * clc, const char * dct, const char * dynctx) {
    int rc = 0;
    size_t curCtx;
    LH_HEAP_INTERFACE     HeapInterface;
    // Initialization of all handles.
    recogObjects.pHeapInst		= NULL;
    recogObjects.hCompBase		= lh_GetNullComponent();
    recogObjects.hCompAsr		= lh_GetNullComponent();
    recogObjects.hCompPron		= lh_GetNullComponent();
    recogObjects.hAcMod			= lh_GetNullObj();
    for(curCtx=0; curCtxhRec, &abnormCondition);
        if (LH_OK != lhErr) goto error;
        switch (abnormCondition) {
    case LH_FX_BADSNR:
        printf ("Abnormal condition: LH_FX_BADSNR.\n");
        break;
    case LH_FX_OVERLOAD:
        printf ("Abnormal condition: LH_FX_OVERLOAD.\n");
        break;
    case LH_FX_TOOQUIET:
        printf ("Abnormal condition: LH_FX_TOOQUIET.\n");
        break;
    case LH_FX_NOSIGNAL:
        printf ("Abnormal condition: LH_FX_NOSIGNAL.\n");
        break;
    case LH_FX_POORMIC:
        printf ("Abnormal condition: LH_FX_POORMIC.\n");
        break;
    case LH_FX_NOLEADINGSILENCE:
        printf ("Abnormal condition: LH_FX_NOLEADINGSILENCE.\n");
        break;
        }
    }
    // LH_AUDIOCHAIN_EVENT_FX_TIMER
    // It usually is used to get the signal level and SNR at regular intervals.
    if ( type & LH_AUDIOCHAIN_EVENT_FX_TIMER )   {
        LH_ERROR            lhErr = LH_OK;
        LH_FX_SIGNAL_LEVELS SignalLevels;
        printf ("Receiving event LH_AUDIOCHAIN_EVENT_FX_TIMER at time %d ms.\n", timeMs);
        lhErr = lh_FxGetSignalLevels(pRecogObjects->hRec, &SignalLevels);
        if (LH_OK != lhErr) goto error;
        printf ("Signal level: %ddB, SNR: %ddB at time %dms.\n", SignalLevels.energy, SignalLevels.SNR, SignalLevels.timeMs);
    }
    // LH_AUDIOCHAIN_EVENT_RESULT
    if ( type & LH_AUDIOCHAIN_EVENT_RESULT ){
        LH_ERROR         lhErr = LH_OK;
        LH_OBJECT        hNBestRes = lh_GetNullObj();
        LH_OBJECT        hCtx      = lh_GetNullObj();
        printf ("Receiving event LH_AUDIOCHAIN_EVENT_RESULT at time %d ms.\n", timeMs);
        // Get the NBest result object and process it.
        lhErr = lh_RecCreateResult (pRecogObjects->hRec, &hNBestRes);
        if (LH_OK == lhErr) {
            if (LH_OK == lh_ResultBorrowSourceCtx(hNBestRes, &hCtx)){
                int i;
                int _ready = 0;
                for(i=0; ihCtx)/sizeof(pRecogObjects->hCtx[0]); i++){
                    if(!lh_ObjIsNull(pRecogObjects->hCtx[i])){
                        if(hCtx.pObj == pRecogObjects->hCtx[i].pObj){
                            if(!lh_ObjIsNull(pRecogObjects->hResults[i])){
                                lh_ObjClose(&pRecogObjects->hResults[i]);
                            }
                            pRecogObjects->hResults[i] = hNBestRes;
                            hNBestRes = lh_GetNullObj();
                            _ready = 1;
                            break;
                        }
                    } else {
                        break;
                    }
                }
                if (_ready) {
                    for (i=0; ihCtx)/sizeof(pRecogObjects->hCtx[0]); i++) {
                        if(!lh_ObjIsNull(pRecogObjects->hCtx[i])){
                            if(lh_ObjIsNull(pRecogObjects->hResults[i])){
                                _ready = 0;
                            }
                        }
                    }
                }
                ASSERT(lh_ObjIsNull(hNBestRes));
                if (_ready) {
                    ProcessResult (pRecogObjects);
                    for(i=0; ihResults)/sizeof(pRecogObjects->hResults[0]); i++){
                        if(!lh_ObjIsNull(pRecogObjects->hResults[i])){
                            lh_ObjClose(&pRecogObjects->hResults[i]);
                        }
                    }
                }
            }
            // Close the NBest result object.
        }
    }
    return 0;
error:
    return -1;
}
static int ProcessResult (RECOG_OBJECTS   *pRecogObjects) {
    LH_ERROR  lhErr = LH_OK;
    size_t    curCtx, i, k, count=0;
    size_t    nbrHypothesis;
    ASRResult *r = NULL;
    long lid;
    // get total hyp count
    for(curCtx=0; curCtxhCtx)/sizeof(pRecogObjects->hCtx[0]); curCtx++){
        if(!lh_ObjIsNull(pRecogObjects->hResults[curCtx])){
            if(LH_OK == lh_NBestResultGetNbrHypotheses (pRecogObjects->hResults[curCtx], &nbrHypothesis)){
                count += nbrHypothesis;
            }
        }
    }
    // traces
    printf ("\n");
    printf (" __________RESULT %3d items max_______________\n", count);
    printf ("|        |        |\n");
    printf ("| result | confi- | result string [start rule]\n");
    printf ("| number | dence  |\n");
    printf ("|________|________|___________________________\n");
    printf ("|        |        |\n");
    if (count>0) {
        r = ASRResult_New(count);
        // Get & print out the result information for each hypothesis.
        count = 0;
        curCtx = sizeof(pRecogObjects->hCtx)/sizeof(pRecogObjects->hCtx[0]);
        for(; curCtx>0; curCtx--){
            LH_OBJECT hNBestRes = pRecogObjects->hResults[curCtx-1];
            if(!lh_ObjIsNull(hNBestRes)){
                LH_HYPOTHESIS   *pHypothesis;
                if(LH_OK == lh_NBestResultGetNbrHypotheses (hNBestRes, &nbrHypothesis)){
                    for (i = 0; i < nbrHypothesis; i++) {
                        char            *szResultWords;
                        // Retrieve information on the recognition result.
                        if (LH_OK == lh_NBestResultFetchHypothesis (hNBestRes, i, &pHypothesis)){
                            // Get the result string.
                            if (LH_OK == lh_NBestResultFetchWords (hNBestRes, i, &szResultWords)){
                                printf ("| %6lu | %6lu | '%s' [%s]\n", i, pHypothesis->conf, szResultWords, pHypothesis->szStartRule);
                                // Return the fetched data to the engine.
                                lh_NBestResultReturnWords (hNBestRes, szResultWords);
                            }
                            lh_NBestResultReturnHypothesis (hNBestRes, pHypothesis);
                        }
                    }
                }
            }
        }
    }
    // traces
    printf ("|________|________|___________________________\n");
    printf ("\n");
    return 0;
}



Obviously, as in the case of TTS, the code is quite large, and the preliminary steps take up a lot of space. And this is not yet fully working code! When I published, I threw a lot of unnecessary. All this once again shows those who have read up to this point that the use of voice I / O technology requires a rather high “entry threshold”.

14. Stream recognition (dictation)


The final word in technology right now is inline recognition, or dictation. The technology is already available on modern smartphones for Android and iOS. Including - in the form of API. Here, the programmer does not need to specify the recognition context when creating grammars. At the entrance there is speech - at the exit, recognized words. Unfortunately, the details of how this method works are not yet available to me. The recognition process does not go on the device itself, but on the server where the voice is transmitted and the result is obtained from there. However, I would like to believe that after years, the technology will be available on the client side.

Conclusion


That's probably all I wanted to tell about ASR and TTS technologies. I hope it did not turn out too boring and quite informative. In any case, questions are welcome.

Also popular now: