Superfast speech recognition without servers on a real example

  • Tutorial

In this article I will tell in detail and show how to correctly and quickly fasten the recognition of Russian speech on the Pocketsphinx engine (for iOS OpenEars port ) on a real Hello World example of controlling home appliances.
Why exactly home appliances? Yes, because thanks to this example, you can evaluate the speed and accuracy that can be achieved using completely local speech recognition without servers such as Google ASR or Yandex SpeechKit .
I also attach all the source code of the program and the assembly for Android to the article .

Why all of a sudden?

Having recently stumbled upon an article about screwing Yandex SpeechKit to an iOS application , I asked the author why he wanted to use server-based speech recognition for his program (in my opinion, this was redundant and led to some problems). To which I got a counter-question : could I describe in more detail the use of alternative methods for projects where there is no need to recognize anything, and the dictionary consists of a finite set of words. Yes, and with an example of practical application ...

Why do we need something else besides Yandex and Google?

As the very “practical application”, I chose the topic of voice control of a smart home .
Why such an example? Because on it you can see those several advantages of completely local speech recognition over recognition using cloud solutions. Namely:
  • Speed - we are not dependent on servers and therefore are not dependent on their availability, bandwidth, etc. factors
  • Accuracy - our engine only works with the dictionary that interests our application, thereby increasing the quality of recognition
  • Cost - we do not have to pay for each request to the server
  • Voice activation - as an additional bonus to the first points - we can constantly “listen to the broadcast”, without wasting our traffic and not loading the server

I must say right away that these advantages can be considered advantages only for a certain class of projects , where we know in advance exactly which dictionary and which grammar the user will operate on. That is, when we do not need to recognize arbitrary text (for example, an SMS message, or a search query). In the opposite case, cloud recognition is indispensable.

So Android can recognize speech without the Internet!

Yes, yes ... Only on JellyBean. And only half a meter, no more. And this recognition is the same dictation, only using a much smaller model. So we cannot manage it and configure it either. And what she will return to us next time is unknown. Although for SMS ok just right!

What do we do?

We will implement a voice control panel for home appliances that will work accurately and quickly, from a few meters and even on cheap brake junk of very inexpensive Android smartphones, tablets and watches.
The logic will be simple but very practical. We activate the microphone and say one or more device names. The application recognizes them and turns them on / off depending on the current state. Or he receives a fortune from them and pronounces it in a pleasant female voice. For example, the current temperature in the room.

We will activate the microphone either by voice, or by clicking on the microphone icon, or even simply putting our hand on the screen. The screen, in turn, can be completely turned off.

Practical applications
In the morning, without opening their eyes, they clapped a palm on the screen of the smartphone on the bedside table and the command “Good morning!” - the script is launched, the coffee maker turns on and buzzes, pleasant music is heard, the curtains are opened.
We’ll hang on a cheap (thousands for 2, no more) a smartphone in each room on the wall. We go home after work and command into the void "Smart home! Light, TV! ” - what happens next, I think, no need to say.

The video shows what happened in the end . Next, we will talk about a technical implementation with excerpts from really working code and a bit of theory.

What is Pocketsphinx?

Pocketsphinx is an open source recognition engine for Android. It also has a port for iOS , WindowsPhone , and even JavaScript .
It will allow us to run speech recognition directly on the device and at the same time configure it specifically for our tasks. It also offers the voice activation function “out of the box” (see below).

We can “feed” the recognition engine to the Russian language model (you can find it in the source) and the grammar of user queries. This is exactly what our application will recognize. It cannot recognize anything else. And therefore, almost never gives out something that we do not expect.

Grammar JSGF
The JSGF grammar format is used by Pocketsphinx, as are many other similar projects. It is possible to describe with sufficient flexibility those variants of phrases that the user will pronounce. In our case, the grammar will be built from the names of devices that are on our network, something like this:
 = лапма | монитор | температура;

Pocketsphinx can also work on a statistical model of the language, which allows you to recognize spontaneous speech that is not described by context-free grammar. But for our task, this is just not necessary. Our grammar will consist only of device names. After the recognition process, Pocketsphinx will return to us the usual line of text where the devices will go one after another.

#JSGF V1.0;
grammar commands;
public  = +;
 = лапма | монитор | температура;

A plus sign indicates that the user can name not one, but several devices in a row.
The application receives a list of devices from the smart home controller (see below) and generates such a grammar in the Grammar class .


The grammar describes what the user can say . In order for Pocketsphinx to know how it will pronounce it, it is necessary for each word from the grammar to write how it sounds in the corresponding language model. That is the transcription of each word. This is called a dictionary .

Transcriptions are described using special syntax. For instance:
умный  uu m n ay j
дом  d oo m

In principle, nothing complicated. The double vowel in transcription indicates stress. Double consonant - a soft consonant followed by a vowel. All possible combinations for all sounds of the Russian language can be found in the language model itself .

It is clear that we cannot describe all transcriptions in our application in advance, because we do not know in advance those names that the user will give to their devices. Therefore, we will generate such transcriptions “on the fly” according to some rules of Russian phonetics. To do this, you can implement such a PhonMapper class that can receive a line at the input and generate the correct transcription for it.

Voice Activation

This is the ability of the speech recognition engine to “listen to the air” all the time in order to respond to a predetermined phrase (or phrases). In this case, all other sounds and speech will be discarded. This is not the same as describing the grammar and just turning on the microphone. I will not present the theory of this problem and the mechanics of how this works. I can only say that recently, programmers working on Pocketsphinx have implemented such a function, and now it is available “out of the box” in the API.

One thing worth mentioning is a must. For the activation phrase, you need not only to indicate transcription, but also to choose the appropriate value of the sensitivity threshold. Too small a value will lead to many false positives (this is when you did not say the activation phrase, and the system recognizes it). And too high - to immunity. Therefore, this setting is of particular importance. An approximate range of values ​​is from 1e-1 to 1e-40 , depending on the activation phrase .

Proximity Sensor Activation
This task is specific to our project and is not directly related to recognition. The code can be seen right in the main activity .
It implements the SensorEventListener and at the moment of approaching (the sensor value is less than the maximum) it starts the timer, checking after a certain delay whether the sensor is still blocked. This is done to eliminate false positives.
When the sensor is not blocked again, we stop recognition, receiving the result (see description below).

Launch recognition

Pocketsphinx provides a convenient API for configuring and starting the recognition process. These are the SppechRecognizer and SpeechRecognizerSetup classes .
Here is how the configuration and recognition start look like:

PhonMapper phonMapper = new PhonMapper(getAssets().open("dict/ru/hotwords"));
Grammar grammar = new Grammar(names, phonMapper);
DataFiles dataFiles = new DataFiles(getPackageName(), "ru");
File hmmDir = new File(dataFiles.getHmm());
File dict = new File(dataFiles.getDict());
File jsgf = new File(dataFiles.getJsgf());
saveFile(jsgf, grammar.getJsgf());
saveFile(dict, grammar.getDict());
mRecognizer = SpeechRecognizerSetup.defaultSetup()
    .setBoolean("-remove_noise", false)
mRecognizer.addKeyphraseSearch(KWS_SEARCH, hotword);
mRecognizer.addGrammarSearch(COMMAND_SEARCH, jsgf);

Here we first copy all the necessary files to the disk (Pocketpshinx requires an acoustic model, grammar and a dictionary with transcriptions on the disk). Then the recognition engine itself is configured. The paths to the model and dictionary files, as well as some parameters (sensitivity threshold for the activation phrase) are indicated. Next, configure the path to the grammar file, as well as the activation phrase.

As can be seen from this code, one engine is configured immediately for both grammar and recognition of the activation phrase. Why is this done? So that we can quickly switch between what we need to recognize at the moment. Here is how the activation phrase recognition process starts:


And so - speech recognition according to a given grammar:

mRecognizer.startListening(COMMAND_SEARCH, 3000);

The second argument (optional) is the number of milliseconds after which recognition will automatically end if no one says anything.
As you can see, you can use only one engine to solve both problems.

How to get recognition result

To get the recognition result, you must also specify an event listener implementing the RecognitionListener interface .
It has several methods that are called by pocketsphinx when one of the events occurs:
  • onBeginningOfSpeech - the engine heard some kind of sound, maybe this is a speech (or maybe not)
  • onEndOfSpeech - the sound is over
  • onPartialResult - there are intermediate recognition results. For an activation phrase, this means that it worked. Hypothesis Argument Contains Recognition Data (String and Score)
  • onResult - the end result of recognition. This method will be called after calling the stop method on SpeechRecognizer . Hypothesis Argument Contains Recognition Data (String and Score)

By implementing the onPartialResult and onResult methods in one way or another, you can change the recognition logic and get the final result. Here is how it is done in the case of our application:

public void onEndOfSpeech() {
  Log.d(TAG, "onEndOfSpeech");
  if (mRecognizer.getSearchName().equals(COMMAND_SEARCH)) {
public void onPartialResult(Hypothesis hypothesis) {
  if (hypothesis == null) return;
  String text = hypothesis.getHypstr();
  if (KWS_SEARCH.equals(mRecognizer.getSearchName())) {
  } else {
    Log.d(TAG, text);
public void onResult(Hypothesis hypothesis) {
  String text = hypothesis != null ? hypothesis.getHypstr() : null;
  Log.d(TAG, "onResult " + text);
  if (COMMAND_SEARCH.equals(mRecognizer.getSearchName())) {
    if (text != null) {
      Toast.makeText(this, text, Toast.LENGTH_SHORT).show();

When we get the onEndOfSpeech event, and if at the same time we recognize the command to execute, then we need to stop the recognition, after which onResult will be called immediately.
OnResult needs to check what has just been recognized. If this is a command, then you need to run it for execution and switch the engine to recognition of the activation phrase.
In onPartialResult, we are only interested in recognizing an activation phrase. If we detect it, then immediately start the process of recognizing the command. Here's what it looks like:

private synchronized void startRecognition() {
  if (mRecognizer == null || COMMAND_SEARCH.equals(mRecognizer.getSearchName())) return;
  new ToneGenerator(AudioManager.STREAM_MUSIC, ToneGenerator.MAX_VOLUME).startTone(ToneGenerator.TONE_CDMA_PIP, 200);
  post(400, new Runnable() {
    public void run() {
      mRecognizer.startListening(COMMAND_SEARCH, 3000);
      Log.d(TAG, "Listen commands");
      post(4000, mStopRecognitionCallback);

Here we first play a small signal to alert the user that we have heard him and are ready for his team. The microphone should be turned off at this time. Therefore, we start recognition after a short timeout (slightly longer than the duration of the signal so as not to hear its echo). It also starts a thread that will stop recognition forcibly if the user speaks for too long. In this case, it is 3 seconds.

How to turn a recognized string into commands

Well, here everything is already specific for a particular application. In the case of a naked example, we simply pull out the device names from the line, look for the necessary device from them and either change its state using an HTTP request to the smart home controller, or report its current state (as is the case with the thermostat). This logic can be seen in the Controller class .

How to synthesize speech

Speech synthesis is the opposite of recognition. Here, on the contrary, you need to turn a line of text into speech so that the user can hear it.
In the case of the thermostat, we must force our Android device to say the current temperature. Using the TextToSpeech API , this is quite simple (thanks to Google for a beautiful female TTS for the Russian language):

private void speak(String text) {
  synchronized (mSpeechQueue) {
    HashMap params = new HashMap(2);
    params.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID, UUID.randomUUID().toString());
    params.put(TextToSpeech.Engine.KEY_PARAM_STREAM, String.valueOf(AudioManager.STREAM_MUSIC));
    params.put(TextToSpeech.Engine.KEY_FEATURE_NETWORK_SYNTHESIS, "true");
    mTextToSpeech.speak(text, TextToSpeech.QUEUE_ADD, params);

I’ll probably say a banality, but before the synthesis process, you must definitely turn off recognition . On some devices (for example, all Samsung) it is generally impossible to simultaneously listen to the microphone and synthesize something at the same time.
The end of speech synthesis (that is, the end of the process of speaking a text by a synthesizer) can be tracked in the listener:

private final TextToSpeech.OnUtteranceCompletedListener mUtteranceCompletedListener = new TextToSpeech.OnUtteranceCompletedListener() {
  public void onUtteranceCompleted(String utteranceId) {
    synchronized (mSpeechQueue) {
      if (mSpeechQueue.isEmpty()) {

In it, we simply check if there is anything else in the queue for the synthesis, and turn on the recognition of the activation phrase if there is nothing else.

And it's all?

Yes! As you can see, quickly and efficiently recognizing speech directly on the device is not difficult at all, thanks to the presence of such wonderful projects as Pocketsphinx. It provides a very convenient API that can be used in solving tasks related to the recognition of voice commands.

In this example, we screwed recognition to a completely coherent task - voice control of smart home devices . Due to local recognition, we achieved a very high speed and minimized errors.
It is clear that the same code can be used for other tasks related to voice. This does not have to be a smart home.

All sources, as well as the assembly of the application, can be found in the repository on GitHub .
Also on my YouTube channel you can see some other implementations of voice control, and not just smart home systems.

Also popular now: