Speech recognition. Part 1. Classification of speech recognition systems
Epigraph
In Russia, the direction of speech recognition systems is really quite poorly developed. Google has long announced a system for recording and recognizing telephone conversations ... Unfortunately, I have not heard about systems of a similar scale and quality of recognition in Russian, yet.
But one does not need to think that all overseas have long since opened everything and we will never catch them. When I was looking for material for this series, I had to dig through the cloud of foreign literature and dissertations. Moreover, these articles and dissertations were remarkable American scientists Huang Xuedong; Hisayoshi Kojima; DongSuk Yuk et al. Clearly, on whom does this branch of American science rest? ; 0)
In Russia, I know only one sensible company that has managed to bring domestic speech recognition systems to the commercial level:Center for Speech Technology . But, perhaps, after this series of articles, it would occur to someone that it is possible and necessary to develop such systems. Moreover, in terms of algorithms and mat. we practically did not lag behind the device.
Classification of speech recognition systems
Today, the term “speech recognition” conceals a whole sphere of scientific and engineering activities. In general, each speech recognition task comes down to isolating, classifying and responding appropriately to human speech from the input sound stream. This may be the performance of a specific action on a person’s team, and the allocation of a specific marker word from a large array of telephone conversations, and a system for voice input of text.
Signs of classification of speech recognition systems
Each such system has some tasks that it is designed to solve and a set of approaches that are used to solve the tasks. Consider the main features that can be used to classify human speech recognition systems and how this feature can affect the operation of the system.
- Dictionary size. Obviously, the larger the size of the dictionary that is embedded in the recognition system, the greater the error rate when the system recognizes words. For example, a dictionary of 10 digits can be recognized almost unmistakably, while the error rate when recognizing a dictionary of 100,000 words can reach 45%. On the other hand, even recognition of a small dictionary can produce a large number of recognition errors if the words in this dictionary are very similar to each other.
- Speaker independence or speaker independence of the system. By definition, a speaker-dependent system is intended for use by a single user, while a speaker-independent system is designed to work with any speaker. Speaker independence is an elusive goal, because when training the system, it is tuned to the parameters of the speaker, for example, which is being trained. The recognition error rate of such systems is usually 3-5 times higher than the error rate of speaker-dependent systems.
- Separate or continuous speech. If in a speech each word is separated from another by a section of silence, then they say that this speech is separate. Integral speech is naturally pronounced sentences. Recognition of continuous speech is much more difficult due to the fact that the boundaries of individual words are not clearly defined and their pronunciation is greatly distorted by blurring of pronounced sounds.
- Destination. The purpose of the system determines the required level of abstraction at which speech recognition will occur. In a command system (for example, voice dialing in a cell phone), most likely, recognition of a word or phrase will occur as recognition of a single speech element. A text dictation system will require greater recognition accuracy and, most likely, when interpreting a spoken phrase, it will rely not only on what has been spoken at the moment, but also on how it relates to what was spoken before. Also, a set of grammatical rules must be built into the system, which the spoken and recognizable text must satisfy. The stricter these rules, the easier it is to implement a recognition system and the more limited will be the set of sentences that it can recognize.
Differences in Speech Recognition Methods
When creating a speech recognition system, you need to choose which level of abstraction is adequate to the task, what parameters of the sound wave will be used for recognition and methods for recognizing these parameters. Consider the main differences in the structure and operation of various speech recognition systems.
- By type of structural unit. When analyzing speech, individual words or parts of spoken words, such as phonemes, di- or trifons, allophones, can be selected as the basic unit. Depending on which structural part is selected, the structure, versatility and complexity of the dictionary of recognizable elements change.
- By highlighting the signs. The very sequence of sound wave pressure readings is excessively redundant for sound recognition systems and contains a lot of unnecessary information that is unnecessary or even harmful during recognition. Thus, to represent a speech signal, it is necessary to select any parameters from it that adequately represent this signal for recognition.
- By the mechanism of functioning. In modern systems, various approaches to the mechanism of functioning of recognition systems are widely used. The probabilistic network approach consists in the fact that the speech signal is divided into certain parts (frames, either by phonetic criterion), after which a probabilistic assessment of which part of the recognized dictionary this part and (or) the entire input signal is related is performed. An approach based on the solution of the inverse problem of sound synthesis consists in the fact that the character of movement of the articulators of the speech tract is determined by the input signal and, according to a special dictionary, the pronounced phonemes are determined.
UPD: Moved to Artificial Intelligence. If there is interest, I will continue to publish in it.