Speech recognition. Part 2. Typical structure of speech recognition system
Speech recognition is a multi-level pattern recognition task in which acoustic signals are analyzed and structured into a hierarchy of structural elements (for example, phonemes), words, phrases and sentences. Each level of the hierarchy may include some temporal constants, for example, possible sequences of words or known types of pronunciation, which can reduce the number of recognition errors at a lower level. The more we know (or assume) a priori information about the input signal, the better we can process and recognize it. The structure of a standard speech recognition system is shown in the figure. Consider the basic elements of this system.
- Raw Speech. Typically, a stream of audio data recorded with high sampling rate (20 KHz when recording from a microphone or 8 KHz when recording from a telephone line).
- Signal analysis. The incoming signal must be initially transformed and compressed to facilitate subsequent processing. There are various methods for extracting useful parameters and compressing source data tens of times without losing useful information. The most used methods:
- Fourier analysis;
- linear speech prediction;
- cepstral analysis.
- Speech frames. The result of the analysis of the signal is a sequence of speech frames. Usually, each speech frame is the result of analyzing a signal over a short period of time (of the order of 10 ms.), Containing information about this section (of the order of 20 coefficients). To improve the quality of recognition, information about the first or second derivative of the values of their coefficients can be added to frames to describe the dynamics of speech change.
- Acoustic models. To analyze the composition of speech frames, a set of acoustic models is required. Consider the two most common of them.
- Template Model. In some way, the saved model is an example of a recognizable structural unit (words, commands). Recognition variability by such a model is achieved by storing different pronunciation variants of the same element (many speakers repeat the same command many times). It is used mainly for recognizing words as a whole (command systems).
- State model. Each word is modeled as a sequence of states indicating a set of sounds that can be heard in a given section of a word, based on probabilistic rules. This approach is used in larger systems.
- Acoustic analysis. It consists in comparing different acoustic models to each frame of speech and produces a matrix for matching the sequence of frames and many acoustic models. For the template model, this matrix represents the Euclidean distance between the template and the recognized frame (i.e., it calculates how much the received signal is removed from the recorded template and the template that is most suitable for the received signal is found). For state-based models, the matrix consists of the probabilities that a given state can generate a given frame.
- Time adjustment. Used to process temporal variation that occurs when pronouncing words (for example, “stretching” or “eating” sounds).
- Sequence of words. As a result of the work, the speech recognition system generates a sequence (or several possible sequences) of words that most likely corresponds to the input speech stream.