Speech recognition from Yandex. Under the hood of Yandex.SpeechKit

    imageAt Yet another Conference 2013, we introduced the developers to our new Yandex SpeechKit library. This is a public speech recognition API that Android and iOS developers can use. Download SpeechKit, as well as read the documentation, here .

    Yandex SpeechKit allows you to directly access the backend that has been successfully used in Yandex mobile applications. We have been developing this system for a long time and now correctly recognize 94% of words in the Navigator and Mobile Maps, as well as 84% ​​of words in the Mobile Browser. At the same time, recognition takes a little more than a second. This is a very worthy quality, and we are actively working to improve it.


    It can be argued that in the near future, voice interfaces will practically not differ in reliability from classical input methods. A detailed story about how we managed to achieve such results, and how our system works, under the cut.

    Speech recognition is one of the most interesting and complex tasks of artificial intelligence. Here the achievements of very different fields are involved: from computer linguistics to digital signal processing. To understand how a machine that understands speech should be arranged, let's first figure out what we are dealing with.

    I. Fundamentals

    Sounding speech for us is, first of all, a digital signal. And if we look at the recording of this signal, then we will not see there either words or pronounced phonemes - different “speech events” smoothly flow into each other without forming clear boundaries. The same phrase, uttered by different people or in different settings, at the signal level will look different. At the same time, people somehow recognize each other's speech: therefore, there are invariants, according to which the signal can be restored, what, in fact, was said. The search for such invariants is the task of acoustic modeling.

    Suppose that human speech consists of phonemes (this is a gross simplification, but in a first approximation it is true). We define the phoneme as the minimum meaningful unit of the language, that is, sound, the replacement of which can lead to a change in the meaning of a word or phrase. Take a small portion of the signal, say 25 milliseconds. We call this section "frame". What phoneme was pronounced on this frame? It is difficult to answer this question unequivocally - many phonemes are extremely similar to each other. But if it is impossible to give an unambiguous answer, then we can reason in terms of “probabilities”: for a given signal, some phonemes are more probable, others less, others can be excluded from consideration altogether. Actually, the acoustic model is a function, receiving a small portion of the acoustic signal (frame) at the input and giving out the probability distribution of various phonemes on this frame. Thus, the acoustic model gives us the opportunity to recover from the sound what was said - with varying degrees of confidence.

    Another important aspect of acoustics is the probability of a transition between different phonemes. We know from experience that some combinations of phonemes are pronounced easily and are common, others are more difficult to pronounce and are less commonly used in practice. We can generalize this information and take it into account when assessing the “credibility” of a given sequence of phonemes.

    Now we have all the tools to construct one of the main "workhorses" of automatic speech recognition - the hidden Markov model (HMM, Hidden Markov Model). To do this, imagine for a while that we are not solving the problem of speech recognition, but the exact opposite - the conversion of text to speech. Suppose we want to get the pronunciation of the word Yandex. Let the word Yandex consist of a set of phonemes, say, [th] [a] [n] [d] [e] [k] [s]. We construct a finite state machine for the word Yandex, in which each phoneme is represented by a separate state. At each moment of time, we are in one of these states and “pronounce” the sound characteristic of this phoneme (we know how each phoneme is pronounced thanks to the acoustic model). But some phonemes last a long time (like [a] in the word Yandex), others are practically swallowed. Here, information about the probability of transition between phonemes is useful to us. Having generated a sound corresponding to the current state, we make a probabilistic decision: to remain in the same state or move on to the next one (and, accordingly, the next phoneme).


    More formally, HMM can be represented as follows. First, we introduce the concept of emission. As we remember from the previous example, each of the HMM states “generates” a sound characteristic of this particular state (ie phonemes). On each frame, the sound is “played out” from the probability distribution corresponding to the given phoneme. Secondly, transitions between states are possible, which also obey predefined probability patterns. For example, the probability that the phoneme [a] will “stretch” is high, which cannot be said about the phoneme [e]. The emission matrix and transition matrix uniquely define the hidden Markov model.

    Well, we looked at how a hidden Markov model can be used to generate speech, but how to apply it to the inverse problem of speech recognition? Viterbi's algorithm comes to the rescue. We have a set of observable quantities (in fact, sound) and a probabilistic model correlating latent states (phonemes) and observable quantities. Viterbi's algorithm allows you to restore the most likely sequence of hidden states.

    Let in our recognition dictionary there are only two words: “Yes” ([d] [a]) and “No” ([n '] [e] [t]). Thus, we have two hidden Markov models. Next, let us have a voice recording of a user who says yes or no. Viterbi's algorithm will allow us to get an answer to the question of which of the recognition hypotheses is more likely.

    Now our task is reduced to restoring the most probable sequence of states of the hidden Markov model, which “gave rise” (or rather, could give rise) to the audio recording presented to us. If the user says yes, then the corresponding sequence of states on 10 frames can be, for example, [d] [d] [d] [d] [a] [a] [a] [a] [a] [a] or [e] [a] [a] [a] [a] [a] [a] [a] [a] [a]. Similarly, various pronunciation options for "no" are possible - for example, [n '] [n'] [n '] [e] [e] [e] [e] [t] [t] [t] and [n' ] [n '] [e] [e] [e] [e] [e] [e] [t] [t]. Now we find the "best", that is, the most likely, way of pronouncing each word. On each frame, we will ask our acoustic model how likely it is that a specific phoneme sounds here (for example, [d] and [a]); in addition, we will take into account the transition probabilities ([d] -> [d], [d] -> [a], [a] -> [a]). So we get the most likely way to pronounce each of the hypothesis words; moreover, for each of them we will get a measure of how likely it is that this word was pronounced (we can consider this measure as the length of the shortest path through the corresponding graph). The “winning” (i.e. more likely) hypothesis will be returned as a result of recognition.

    Viterbi's algorithm is quite simple to implement (using dynamic programming) and works for a time proportional to the product of the number of HMM states by the number of frames. However, it is not always enough for us to know the most probable way; for example, when training an acoustic model, an estimate of the probability of each state on each frame is needed. For this, the Forward-Backward algorithm is used .

    However, the acoustic model is just one of the components of the system. What if the recognition dictionary does not consist of two words, as in the example considered above, but of hundreds of thousands or even millions? Many of them will be very similar in pronunciation or even coincide. However, in the presence of context, the role of acoustics falls: slurred, noisy or ambiguous words can be restored "in meaning". Again, probabilistic models are used to account for the context. For example, a native speaker of the Russian language understands that the naturalness (in our case, the probability) of the sentence “mom soap frame” is higher than “mom soap cyclotron” or “mother soap frame”. That is, the presence of a fixed context “soap mom ...” sets the probability distribution for the next word, which reflects both semantics and morphology. This type of language model is called n-gram language models (trigrams in the example above); Of course, there are much more complex and powerful ways of modeling the language.

    II. What's under the hood of the Yandex ASR?

    Now, when we imagine the general structure of speech recognition systems, we will describe in more detail the details of Yandex technology - the best, according to our data, Russian speech recognition system.
    When looking at the toy examples above, we intentionally made a few simplifications and omitted a number of important details. In particular, we argued that the main “building unit” of speech is the phoneme. In fact, the phoneme is too large a unit; in order to adequately model the pronunciation of a single phoneme, three separate states are used - the beginning, middle and end of the phoneme. Together they form the same HMM as presented above. In addition, phonemes are position-dependent and context-sensitive: formally, the “same” phoneme sounds significantly different depending on what part of the word it is in and which phonemes it is adjacent to. At the same time, a simple enumeration of all possible variants of context-sensitive phonemes will return a very large number of combinations, many of which are never found in real life;
    Thus, we, firstly, made phonemes context-sensitive, and secondly, we divided each of them into three parts. These objects - “parts of phonemes” - now make up our phonetic alphabet. They are also called senons. Each state of our HMM is a senon. In our model, 48 phonemes and about 4000 senons are used.

    So, our acoustic model still takes sound to the input, and at the output it gives the probability distribution over the senons. Now consider what exactly is being input. As we said, the sound is cut into sections of 25 ms ("frames"). Typically, the slicing step is 10 ms, so that adjacent frames partially overlap. It is clear that a “raw” sound - the amplitude of the oscillations in time - is not the most informative form of representation of an acoustic signal. The spectrum of this signal is already much better. In practice, a logarithmic and scaled spectrum is usually used, which corresponds to the laws of human auditory perception ( Mel-transformation ). The resulting values ​​are subjected to discrete cosine transform ( DCT ), and the result is MFCC- Mel Frequency Cepstral Coefficients. (The word Cepstral is obtained by rearranging the letters in Spectral, which reflects the presence of an additional DCT). MFCC is a vector of 13 (usually) real numbers. They can be used as the input of the acoustic model “raw”, but more often they undergo many additional transformations.

    Acoustic model training is a complex and multi-step process. Expectation-Maximization algorithms , such as the Baum-Welsh algorithm , are used for training.. The essence of algorithms of this kind is in the alternation of two steps: at the Expectation step, the existing model is used to calculate the expectation of the likelihood function, at the Maximization step, the model parameters are changed in such a way as to maximize this estimate. At the early stages of training, simple acoustic models are used: simple MFCC features are given to the input, phonemes are considered out of contextual dependence, a mixture of Gaussians with diagonal covariance matrices (Diagonal GMMs - Gaussian Mixture Models) is used to model the probability of emission in HMM. The results of each previous acoustic model are the starting point for training a more complex model, with a more complex input, output, or probability distribution function for emissions. There are many ways to improve the acoustic model,Deep Neural Network ), which almost doubles the quality of recognition. Neural networks lack many of the limitations typical of Gaussian mixtures and have better generalizing ability. In addition, acoustic models on neural networks are more resistant to noise and have better performance.

    A neural network for acoustic modeling is trained in several stages. To initialize the neural network, a stack of Restricted Boltzmann Machines is used., RBM). RBM is a stochastic neural network that trains without a teacher. Although the weights learned by her cannot be directly used to distinguish between classes of acoustic events, they reflect in detail the structure of speech. You can treat RBM as a feature extractor - the resulting generative model is an excellent starting point for constructing a discriminative model. The discriminative model is trained using the classical algorithm of back propagation of errors, and a number of techniques are used to improve convergence and prevent overfitting. As a result, at the input of the neural network there are several MFCC-features frames (the central frame is subject to classification, the rest form the context), at the output there are about 4000 neurons corresponding to different senons.


    Consider the decoding process in more detail. For the task of recognizing spontaneous speech with a large dictionary, the approach described in the first section is not applicable. A data structure is needed that combines all the possible sentences that the system can recognize. A suitable structure is the weighted finite-state transducer (WFST) - in fact, just a finite state machine with an output tape and weights on the edges. At the entrance of this machine are senons, at the exit are words. The decoding process comes down to choosing the best path in this machine and providing an output sequence of words corresponding to this path. In this case, the price of the passage along each arc is composed of two components. The first component is known in advance and is calculated at the assembly stage of the machine. It includes the cost of pronunciation, transition to a given state, likelihood assessment by the language model. The second component is calculated separately for a particular frame: this is the acoustic weight of the senon corresponding to the input symbol of the arc in question. Decoding takes place in real time, so not all possible paths are investigated: special heuristics limit the set of hypotheses to the most probable ones.

    Of course, the most interesting part from a technical point of view is the construction of such an automaton. This problem is solved offline. To move from simple HMMs for each context-sensitive phoneme to linear automata for each word, we need to use a pronunciation dictionary. Creating such a dictionary is impossible manually, and machine learning methods are used here (and the task itself in the scientific community is called Grapheme-To-Phoneme, or G2P). In turn, the words “dock” with each other into a language model, also presented in the form of a finite state machine. The central operation here is WFST composition, but various methods for optimizing WFST in terms of size and stacking efficiency are also important.


    The result of the decoding process is a list of hypotheses that can be further processed. For example, you can use a more powerful language model to rearrange the most likely hypotheses. The resulting list is returned to the user, sorted by the value of confidence - the degree of our confidence that the recognition has passed correctly. Often, only one hypothesis remains, in this case, the client application immediately proceeds to the execution of a voice command.

    In conclusion, we will address the issue of quality metrics for speech recognition systems. Most Popular Word Error Rate Metric(and its opposite Word Accuracy). In essence, it reflects the proportion of incorrectly recognized words. To calculate the Word Error Rate for a speech recognition system, manually-labeled cases of voice queries corresponding to the subject of the application using speech recognition are used.


    1. Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559-584). Springer Berlin Heidelberg.
    2. Hinton, G., Deng, L., Yu, D., Dahl, GE, Mohamed, AR, Jaitly, N. et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 (6), 82-97.
    3. Jurafsky D., Martin JH (2008) Speech and language processing, 2nd edition. Prentice Hall.

    Also popular now: