The concept of a voice interface for managing a computing system to help people with speech disorders



    Currently, much attention is paid to creating an accessible environment for people with disabilities and disabilities. An important means of ensuring accessibility and improving the quality of life, social interaction, integration into society for people with disabilities are computer equipment and specialized information systems. An analysis of the literature has shown that today various developments are underway to facilitate the interaction between man and computer, including the development of voice interfaces for managing the computing system. However, these developments are focused on creating speaker-independent systems that are trained on big data and do not take into account the peculiarities of computer command pronunciation by people with various disabilities of speech functions.

    The goal of the research work is to design a voice-dependent voice interface for controlling a computing system based on machine learning methods.

    Tasks solved in work:

    1. Conduct a review of voice interfaces and how to use them to control computing systems;
    2. Examine approaches to personalization of voice control computing system;
    3. Develop a mathematical model of the voice interface for managing a computing system;
    4. Develop an algorithm for software implementation.

    Solution methods. To solve the set tasks, the methods of system analysis, mathematical modeling, machine learning are used.

    Voice interface as a way to manage a computing system

    Creating speech recognition systems is an extremely complex task. It is especially difficult to recognize the Russian language, which has many features. All speech recognition systems can be divided into two classes:

    Systems that are dependent on the speaker - are adjusted to the speaker's speech in the learning process. To work with another speaker, such systems require a complete reconfiguration.

    Systems independent of the speaker - whose work does not depend on the speaker. Such systems do not require prior training and are able to recognize the speech of any speaker.

    Initially, systems of the first type appeared on the market. In them, the sound image of the team was stored in the form of a holistic standard. For comparing the unknown pronouncement and the standard of the command, dynamic programming methods were used. These systems worked well in recognizing small sets of 10–30 commands and understood only one speaker. To work with another speaker, these systems required a complete reconfiguration.
    In order to understand fluent speech, it was necessary to go to dictionaries of much larger sizes, from several tens to hundreds of thousands of words. The methods used in the systems of the first type were not suitable for solving this problem, since it is simply impossible to create standards for so many words.

    In order to understand fluent speech, it was necessary to go to dictionaries of much larger sizes, from several tens to hundreds of thousands of words. The methods used in the systems of the first type were not suitable for solving this problem, since it is simply impossible to create standards for so many words.

    In addition, there was a desire to make a system that does not depend on the speaker. This is a very difficult task, since each person has an individual manner of pronunciation: the rate of speech, the timbre of the voice, and the peculiarities of pronunciation. Such differences are called speech variability. To take it into account, new statistical methods were proposed, based mainly on the mathematical tools of Hidden Markov Models (SMM) or Artificial Neural Networks. The best results are achieved when combining these two methods. Instead of creating patterns for each word, patterns are created for the individual sounds that make up the words, the so-called acoustic models. Acoustic models are formed by statistical processing of large speech databases containing records of the speech of hundreds of people. In existing speech recognition systems, two fundamentally different approaches are used:

    Voice Tag Recognition - Recognition of speech fragments according to a previously recorded pattern. This approach is widely used in relatively simple systems designed to execute pre-recorded speech commands.

    Recognition of lexical items- selection from speech of the simplest lexical elements, such as phonemes and allophones. This approach is suitable for creating text dictation systems in which a complete transformation of the spoken sounds into text occurs.

    Overview of various Internet sources allows you to select the following software products that solve speech recognition problems and their main characteristics:

    Gorynych PROF 3.0 is an easy-to-use program for speech recognition and typing by dictation with the support of the Russian language. It is based on Russian developments in the field of speech recognition.

    • announcer;
    • language dependence (Russian and English);
    • Accuracy of recognition depends on the core system of the American program "Dragon Dictate";
    • provides voice control of individual functions of the operating system, text editors and application programs;
    • requires training.

    VoiceNavigator is a high-tech solution for contact centers, designed to build Voice Self-Service Systems (GHS). VoiceNavigator allows you to automatically handle calls using speech synthesis and speech recognition technologies.


    • speaker independence;
    • resistance to ambient noise and interference in the telephone channel;
    • Russian speech recognition works with a reliability of 97% (100 words dictionary).

    Speereo Speech Recognition - speech recognition takes place directly on the device, not on the server, which is a key advantage, according to the developers.


    • Russian speech recognition works with a reliability of about 95%;
    • speaker independence;
    • vocabulary of about 150 thousand words;
    • simultaneous support for multiple languages;
    • compact engine size. Sakrament ASR Engine (developed by Sakrament)

    Sakrament ASR Engine - (developed by Sakrament) - speech recognition technology is used to create speech management tools — programs that control the actions of a computer or other electronic device using voice commands, as well as organizing telephone help and information services.

    • speaker independence;
    • language independence;
    • recognition accuracy reaches 95-98%;
    • speech recognition in the form of expressions and small sentences;
    • no learning opportunity.

    Google Voice Search - from recent Google voice search is built into the Google Chrome browser, which allows using this service on various platforms.


    • Russian language support;
    • the ability to embed speech recognition on web resources;
    • voice commands, phrases;
    • For work, you need a permanent connection to the internet network.

    Dragon NaturallySpeaking - (Nuance Company) A world leader in human speech recognition software. The ability to create new documents, send e-mail, manage popular browsers and a variety of applications through voice commands.


    • there is no support for the Russian language;
    • recognition accuracy up to 99%.

    ViaVoice - (IBM) is a software product for hardware implementations. ProVox Technologies, based on this core, has created a system for dictating VoxReports 'radiologists' reports.


    • recognition accuracy reaches 95-98%;
    • speaker independence;
    • The system dictionary is limited to a set of specific terms.

    Sphinx is a well-known and workable open source speech recognition software. Development is carried out at Carnegie Mellon University, distributed under the terms of the license Berkley Software Distribution (BSD) and is available for both commercial and non-commercial use.


    • speaker independence;
    • continuous speech recognition;
    • learnability;
    • availability of the version for embedded systems - Pocket Sphinx.

    Thus, the review showed that the market is dominated by software products targeted at a large number of users, are independent-independent, as a rule, have a proprietary license, which significantly limits their use for tasks of managing a computing system by people with disabilities. Systems for voice control of specialized tools, such as smart home, exoskeleton, etc., are not universal. However, interest in new technologies is increasing, there are opportunities to control various devices through mobile communications, bluetooth technologies. Including household appliances. The use of user-oriented voice control technology will improve the quality of everyday life and social adaptation for people with disabilities.

    Mathematical apparatus for recognizing the status of the announcer and its features

    To solve the problem posed in the work, let's analyze the system requirements.

    The system should be:

    1. announcer-dependent;
    2. be trained in the particular pronunciation of a particular user;
    3. recognize a certain number of voice tags and translate them into control commands.

    The voice interface should be: dictationary, with a limited set of vocabulary.

    Voice commands are a sound wave. The sound wave can be represented as a spectrum of frequencies entering into it. Digital sound is a method of representing an electrical signal by means of discrete numerical values ​​of its amplitude. The input file for the voice interface is a sound file in the RAM, as a result of the file being fed to the neural network, the program produces the corresponding result.

    Digitization is the fixation of the amplitude of the signal at certain intervals and the registration of the obtained amplitude values ​​as rounded digital values. Signal digitization includes two processes - the sampling process and the quantization process.

    The sampling process is the process of obtaining signal values, which is converted with a certain time step, such a step is called a discretization step. The number of measurements of the magnitude of a signal performed in one second is called the sampling frequency or sampling frequency, or sampling frequency. The smaller the sampling step, the higher the sampling rate and the more accurate we will get an idea of ​​the signal.

    Quantization is the process of replacing the real values ​​of the signal amplitude with approximate values ​​with some accuracy. Each of the 2N possible levels is called a quantization level, and the distance between the two closest quantization levels is called a quantization step. If the amplitude scale is divided into levels linearly, quantization is called linear or homogeneous.

    The recorded amplitude values ​​of the signal are called counts. The higher the sampling rate and the more quantization levels, the more accurate the digital representation of the signal.

    As a mathematical tool for solving the problem of selecting characterizing features, it is advisable to use a neural network that can learn and automatically select the necessary features. This will allow to train the system for the pronunciation of voice commands of a particular user. Comparing the mechanisms of different neural networks, we selected two of the most appropriate. This is the network of Kosko and Kokhoken.

    Kohonen self-organizing map- neural network with learning without a teacher, performing the task of visualization and clustering. It is a method of projecting multidimensional space into a space with a lower dimension (most often, two-dimensional), it is also used to solve problems of modeling, forecasting, identifying sets of independent features, searching patterns in large data arrays, developing computer games. It is one of the versions of Kohonen's neural networks.

    Kohonen's network is a suitable network, since this network can automatically partition training examples into clusters, where the number of clusters is specified by the user. After learning the network, you can calculate to which cluster the input example belongs, and output the corresponding result.

    Kosko's neural network, or bidirectional associative memory (FAD), is a single-layer neural network with feedback, based on two ideas: the adaptive resonant theory of Stefan Grosberg and the Hopfield auto-associative memory. The WCT is heteroassociative: the input vector is fed to one set of neurons, and the corresponding output vector is produced on a different set of neurons. Like the Hopfield network, the WCT is capable of generalization, generating the right reactions, despite the distorted inputs. In addition, adaptive versions of the WCT can be implemented, highlighting the reference image of the noisy copies. These capabilities strongly resemble the process of human thinking and allow artificial neural networks to take a step towards brain modeling.

    The advantage of this network is that the basis of the discrete neural networks of the adaptive resonant theory has been developed a new bidirectional associative memory that can memorize new information without retraining the neural network. This allows the user to replenish voice tags when needed.


    The concept of software implementation contains three stages that are implemented in a single software product that has an ergonomic graphical interface.

    Collection of training examples.

    For learning the neural network, the user is prompted to pronounce the vocabulary voice tags several times. Since the recorded phrases consist of one word, the file size does not matter. And for further processing, the sound is recorded in the WAV format. This is a PCM lossless recording format. It is a standard for further sound processing using the python language python_speech_features . The audio file must be accompanied by its “value”, which is necessary for the further training of the neural network (relevant commands).

    Neural network training.

    The program reads audio files, and generates new audio files by changing the length of the audio track, as well as changing the pitch, volume and voice. This is necessary to increase the number of examples for the training sample, which will increase the quality of recognition by the neural network. In the program, the user will be asked to train the network on previously recorded voice tags. The user can also supplement the base with training voice tags, and further training the neural network later.

    Using the program.

    After learning the program in given words, the user can start work or add new voice tags to the training. A trained neural network can recognize the supplied audio files.


    Thus, the research paper reviewed the current market of voice interfaces and their use. It is shown that this type of software is focused on the use of voice-activated voice control in systems and does not take into account the individual characteristics of the user, which is especially important for people with disabilities and those with speech disorders.

    The requirements for a voice interface for managing a computing system to help people with speech disorders are defined.

    A mathematical apparatus suitable for the implementation of the concept is described. An algorithm for the software implementation of the voice interface has been compiled.

    Further development involves the development of a program with a convenient graphical interface for implementing a prototype voice control interface, which can be used for various tasks, such as the management of household appliances, computers, robotic technology (exoskeleton) for people with disabilities.

    Also popular now: