Voice control media center

Perhaps the dream of all science fiction writers since the advent of science fiction as such is voice control of a computer. What else, if not a lively dialogue with the machine, allows you to simulate the presence of the latest artificial intelligence and gives reason to believe that coffee grinders will go crazy sooner or later, take over the world and put insignificant people in the matrix?

The first attempts to implement speech recognition took place as early as the middle of the last century, and with the spread of personal computers it turned out to be a natural desire to use their power for this process. I remember that about 15 years ago there were already programs for Windows that allowed you to create macros that correspond to voice commands. With their help, I threw the guests into a trembling thrill when, in response to a request to go in three letters, Windows shut down and gave way to the classic inscription "now you can turn off the computer’s power." The basis of the work of these programs was a comparison of the received commands with the ones written in advance. This comparison was carried out using the analysis of sound waves, and the minus of this approach is obvious - the commands must be pronounced with the same intonation and, preferably, in the same state of consciousness.

Picture for “voice control of a computer”. Harrison Ford is telling us “enhance 34 to 36”, whatever that means ...

A more logical approach is the analysis of the phonetic features of the spoken phrase and an attempt to compare each of the words with the dictionary, which reduces the impact on the recognition result of such features as the manner of speech and even some “fiction defects”. So how do you qualitatively recognize Russian speech? Google is the first to come up with the appropriate API. Some even quite successfully integrate the use of this API into their "smart home" - the script sends the corporation good every phrase heard, and then tries to compare the recognized text with one of the given commands. Naturally, I immediately dismissed this option, otherwise I will have to turn off the system every time I need to discuss how best to get rid of the corpse. Moreover, it is not known how long this freebie will last and whether Google decides to suddenly block this service.

Therefore, when I once realized that I wanted to talk with my HTPC, I turned to offline recognition systems. I started with one of the most popular - CMU Sphinx . The first phrase that I tried to convey to her again and again was “turn on the light!”. I provide a log of my testing:

thinking paradise
and knows how to drink a pint
then at the top of the corpse of experience
really nose vodka to the world
again and again
the fact of
about it
first this morning
right here

That is, as a generator of lyrics for Zemfira, it may come down, but it is not suitable for full use. Adapting the acoustic model and limiting the vocabulary did not greatly improve the situation.

At this point, I came to the conclusion that so far the most sane way to organize voice control is to negotiate with a soulless piece of iron in the language of the most alleged enemy. It is no secret that English is simpler than Russian in many respects, including phonetically, which is especially important for us. And the functionality necessary for recognizing English speech is already present in the latest versions of Windows. “Excuse me, but we did not finish the Oxford! "- one of the readers will object. And they did it right. Voronezh Construction College is much better prepared for life in the real world. And the presence of the ideal prononsa, as it turned out, is not necessary at all. If the computers of the future even understand the indistinct mumbling of Harrison Ford, then why are we worse? For instance, my emphasis is a mixture of Borat and some crazy Russian general from a Hollywood thrash movie, which can be seen by watching the video below. I wasn’t too lazy to make subtitles, because I myself hardly understand what I’m carrying there.

How it works?

As a "spacer" between the user and Windows Speech Recognition uses a product called VoxCommando (~ $ 27). This program using Windows tools recognizes the phrase and compares it with the commands specified by the user. Due to the restriction of the dictionary, recognition accuracy is close to 100%.

VoxCommando comes with a large number of useful plugins, including for XBMC, which was especially interesting to me. In addition to the XBMC plugin, also deserve attention:
  • EventGhost plugin - I use to send IR-control signals to the TV and receiver.
  • arbitrary HTTP request plugin - I refer to the Yandex translator API, the one that translated “snake scale” as a “snake of scale”.
  • There are also plugins for Vera and X10 that allow you to control home automation, such as lighting.

Configure voice commands. The left window is a list of commands and their corresponding phrases and their variations. Right - the editor of the current command with a list of necessary actions (in this case, access to XBMC using the JSON-RPC API).

VoxCommando allows you to use Text-to-Speech engines installed in the system, so you can try to organize a full-fledged dialogue with the machine. I did not focus on this, I just taught the young lady to answer “I am” to the question “Who's your daddy?” and calmed down on this.


Another important issue is the choice of microphone. Those who have ever encountered speech recognition know that a headset is best suited for this. But giving orders to artificial intellect, having fastened a heap of wires and plastic on your head, is somehow never cyberpunk - in any science fiction film you will be laughed at for such a thing. Some quite successfully use Kinect or such a thing as The Voice Tracker, but these devices have enough drawbacks - the range of speech quality is quite limited, high dependence on background noise, false positives from the content currently being played. It is quite possible that the protagonist of a melodrama, during a declaration of love, accidentally pronounces the name of a music album in the style of porn grind, and the media center perceives this as an unambiguous signal that it is time to touch the beautiful.

In search of a solution to this problem, I came across an Amulet Remote . It looks like a regular MCE-remote, but in addition to the infrared transmitter, it also contains a wireless microphone that activates when the device is brought up.

Amulet Remote. When the device is brought upright, the logo on the remote control lights up red, hinting that he wants to communicate.

Despite some shortcomings (short battery life compared to conventional remotes and learning disabilities), I think this is the most successful HTPC voice control device at the moment. Now Amulet Remote is offered for $ 69, but since the manufacturer sends its products only to the United States, you will have to use the services of an intermediary company for delivery. The recognition quality using Amulet Remote is at a very high level, and it’s not surprising - the device was developed in Ireland and, most likely, has passed rigorous stress testing with an Irish accent.


The option described above can be used not only to control the media center, but also to control various smart home systems, as well as for most other tasks that require automation, whether it is accessing a certain web service, launching an application, or sending IR -signal. For example, using a voice command, you can find out the weather or turn on the air conditioning. It’s not possible to send a beer yet, but we will hope for further steps of technological progress in this direction.

Also popular now: