Speech recognition. Part 3. Vocal tract, auditory tract

Why do we need it

When it comes to speech recognition, it is impossible to remain exclusively in the sphere of “signal analysis” (that is, individual works and branches of science). You should always remember that in the analysis of speech we work with a special kind of signal that is reproduced by a particular biological system. On the one hand, it is limited by its amplitude-frequency characteristics (AFC), and on the other hand, by the language itself and the standard set of sounds that can be uttered by its native speaker (for example, when analyzing the Russian language we will not take into account the possibility of a clatter and whistle ) Based on the task, it is possible to accurately determine the characteristics of the speech signal, and its main properties.
Lesson topic

On the other hand, for this signal, nature has developed a receiver that is close to ideal. This is our auditory tract. So far no other system has been invented and found that could equally accurately and efficiently deal with speech recognition. It would be sacrilege to neglect the opportunity to learn this from nature. If you get to know the features of the auditory tract more closely, you begin to understand that wavelets and Fourier transforms to such problems did not come from the ceiling. And systems that provide decomposition of the signal into the frequency spectrum appeared much earlier than the first cave drawing ...

Vocal tract

A voice signal is generated using air waves emitted by the speaker’s mouth and nasal openings. In most languages of the world, the composition of phonemes can be divided into 2 main classes:

consonants - pronounced in the presence of compression of the throat or obstruction in the oral cavity (tongue, teeth, lips) of the speaker;
vowels - pronounced in the absence of any obstacles in the speech path.

Subsequently, on the basis of various articular properties, sounds can be classified into smaller classes. These parameters are generated from the anatomy of various articulators of a person and their points of touch of the speech tract. A significant contribution to speech formation is made by the lungs, trachea, larynx, pharynx (throat), oral and nasal cavities.
Speech path

Lungs are a source of air during speech.
Vocal cords: when the vocal cords are a short distance from each other and oscillate relative to each other during speech, they say that the sound is voiced. If the ligaments do not oscillate, then they say that the sound is unvoiced.
Soft palate: acts as a flap that opens the passage of air into the nasal cavity.
Hard palate: the long, relatively hard surface of the upper wall of the oral cavity, combined with the tongue, allows you to pronounce consonant sounds.
Language: flexible articulator. When moving away from the palate, it allows you to pronounce vowel sounds, while approaching the palate - consonants.
Teeth: Used in combination with the tongue when pronouncing some consonants.
Lips: can round or stretch, changing the sound of vowels, or close to stop the air flow when pronouncing some consonants.

The main difference between sounds is their distinction between voiced and unvoiced sounds.

Voiced sounds in their frequency and time structure have a quasiperiodic component. It is introduced when the vocal cords, vibrating at different frequencies (from 60 Hz in an adult man to 300 Hz or higher in a girl or child) are involved in the pronunciation of sound. The frequency of vibration of the vocal cords is called the fundamental frequency of sound, since it is the base frequency for the remaining high-frequency harmonics created in the larynx and oral cavity. Also, the fundamental frequency is greater than any other factor affecting the fundamental tone of speech.

The figure shows the stages of the cycle of the state of the vocal cords of a person during the passage of air flow through them. In step (a) , the glottis is closed and airflow stops in front of the vocal cords.

At some point ( stage b ), the air pressure in front of the ligaments overcomes the barrier, and air escapes through the glottis. However, the tissues and muscles of the vocal cords, due to natural elasticity, return to their original state, closing the glottis ( stage c ). This creates a sequence of sound vibrations, which is the source of energy for all voiced sounds.

When pronouncing unvoicedsounds, the vocal cords are either relaxed or very tense, as a result of which they do not produce sound vibrations. Air flows freely from the lungs to the oral and / or nasal cavity of the vocal tract. As a result of the interaction of air with various articulators, airflow is transformed, which leads to the pronunciation of a sound.
FROM

The figure shows an example of a signal corresponding to two sounds: voiced “O” and unvoiced “T”. Obviously, they have a vastly different properties, which must be taken into account in the analysis. A speech recognition problem arises when a word beats or ends with an unvoiced sound. In this case, it is necessary to use special algorithms to distinguish this sound from extraneous noise and accurately determine the moment of the beginning (end) of the speech signal. We will talk about such algorithms in the following parts.

Auditory tract

The speech perception system has 2 main components: the external auditory organs and the auditory part of the brain. The ear processes the signal that the sound wave carries in itself by converting it into the mechanical vibration of the eardrum and then mapping this vibration into a sequence of impulses transmitted by the auditory nerve. Useful information is extracted in various parts of the auditory region of the human brain.
Auditory tract

The human ear consists of 3 sections: the outer ear, middle ear and inner ear. The outer ear consists of the visible part and the external auditory canal, which ends with the eardrum. Sound passing through an external sound channel acts on the eardrum and it viruets.

Middle earIs an air area with a volume of approximately 6 cm3. The vibrations of the eardrum are transmitted by a system of sound bones (malleus, anvil and stirrup) into the membrane, which is called the “oval window”. This is the interface between the middle ear and the inner ear (cochlea), since the rest of the surface of the inner ear consists of bone tissue.
Snail

An important, for the perception of sound, structure of the inner ear is the cochleawhich communicates directly with the auditory nerve. A longitudinal membrane divides the cochlea spiral into two fluid-filled parts. The inner surface of the cochlea is covered with ciliated receptor cells, which are connected directly to the auditory nerve and perceive information about fluid pressure at a certain point in the cochlea. The structure of the inner ear is arranged so that at different frequencies of the initial signal, the maximum amplitude of the change in fluid pressure in the cochlea will be recorded at a certain distance from its base (see figure). Thus, the cochlea can be represented as a comb of filters, the output signal of which is ordered by distance from the base of the cochlea . Filters closer to the base of the cochlea are responsible for higher frequencies.

The auditory nerve is a set of frequency channels. Each frequency channel includes a group of neurons connected to one or neighboring cochlear filters , that is, those that have the same or close characteristic frequencies. This set of features is served as an instantaneous image of the signal in the human brain, in which, through a complex neural network, useful information is extracted from the received signal. Unfortunately, there is no exact data on how this information is extracted inside the human brain. There are only a number of theories that describe the possible neural structures within the brain and their interactions in different ways.
Acoustic tract analogies

Scales

Many elements of various speech recognition systems are based on the human auditory tract and try to imitate the mechanisms of his work. Thus, the most popular characteristic feature of a speech signal ( MFCC coefficients ) is based on the study of signal conversion methods in the inner ear of a person. Also, the development and development of neural network algorithms are associated with studies of the human brain.

Studies have been conducted to extract a gradation of frequencies that would simulate the natural response of the human speech system in which the snail acts as a spectral analyzer. The complex mechanism of the inner ear and the auditory nerve suggests that the properties of the perception of sounds at different frequencies cannot obviously be simple or linear. It is widely known that in modern Western culture the musical tone is divided into octaves and midtones.

The frequency f1 is higher than the frequency f2 by an octave if and only if f1 = 2f2. There are 12 semitones in 1 octave, therefore, f1 is higher than the frequency f2 by a semitone if and only if
f1 = 2 ^ (1/12) f2

As a result of various studies based on the human sensations of sounds of various frequencies, a number of scales were developed that made it possible to represent the frequency of sound in quantities closer to human perception. So, in one of the first attempts to create such a scale, the Bark scale was developed . It was expected that the processing of spectral energy based on the Bark scale gives a more accurate correspondence with the information heard by the person.

The bark scale is divided into 24 main ranges of audibility. Audible resolution at low frequencies is greater than at higher frequencies. The frequency can be converted from Hz to the Bark scale using the following formula:
b (f)

where f is the sound frequency in Hz,
b is the sound frequency in Bark.

But another scale was more widely used in recognizing human speech - the mel-scale , linear at frequencies below 1 kHz and logarithmic at frequencies above 1 kHz. The Mel-scale was obtained as a result of experiments with exemplary tones (sinusoids) in which the subjects were required to divide these frequency ranges into 4 equal intervals or adjust the frequency of the required tone so that it was half the frequency of the original. 1 mel is defined as 1 thousandth of a tone level at 1 kHz. As with any other attempt to create such scales, it is calculated that the mel scale more accurately models the sensitivity of the human ear. Calculation of mel-values can be approximately represented by the following formula:
B (f)

where f is the frequency of sound in Hz,
B is the frequency of sound in mel.
A number of modern speech processing techniques are based on the use of such scales.

Home Reading Links

Huang Xuedong. Spoken language processing: a guide to theory, algorithm and system development. –New Jersey: Prentice Hall PTR, 2001. 910 p. (The handbook of anyone who wants to engage in speech recognition. Much of what is shown in the series of my notes is taken from this book. Must Have.)
Chistovich L.A., Ventsov A.V., Granstrom M.P. Physiology of speech. The perception of speech by a person. - L .: Nauka, 1976. (Books on speech recognition in Russian, unfortunately, ceased to be released as far back as the 80s. But even those that were released are worth it to study. From this book I got information about auditory path, device of the cochlea. If anyone is interested in the performance characteristics of the auditory canal - you are welcome.)
DongSuk Yuk. Robust speech recognition using neural networks and hidden Markov models. Adaptations using non-linear transformations. - New Jersey: The State University of New Jersey, 1999. (Many American scientists are posting their dissertation texts for free access. Many human thanks for that.)

Tags: