Speech recognition in python using pocketsphinx or how I tried to make a voice assistant
This is a tutorial on using the pocketsphinx library in Python. I hope he helps you
quickly deal with this library and not step on my rake.
It all started with the fact that I wanted to make myself a voice assistant in python. Initially, it was decided to use the speech_recognition library for recognition . As it turned out, I'm not the only one . For recognition, I used Google Speech Recognition, since it was the only one that did not require any keys, passwords, etc. For speech synthesis, gTTS was taken. In general, it turned out to be almost a clone of this assistant, because of which I could not calm down.
True, I could not calm down not only because of this: the answer had to wait a long time (the recording did not end right away, sending speech to the server for recognition and text for synthesis took a lot of time), the speech was not always recognized correctly, I had to scream more than half a meter from the microphone , it was necessary to speak clearly, the speech synthesized by Google sounded terrible, there was no activation phrase, that is, sounds were constantly recorded and transmitted to the server.
The first improvement was speech synthesis using yandex speechkit cloud:
URL = 'https://tts.voicetech.yandex.net/generate?text='+text+'&format=wav&lang=ru-RU&speaker=ermil&key='+key+'&speed=1&emotion=good' response=requests.get(URL) if response.status_code==200: with open(speech_file_name,'wb') as file: file.write(response.content)
Then came the recognition queue. I immediately became interested in the inscription "CMU Sphinx (works offline)" on the library page . I will not talk about the basic concepts of pocketsphinx, as before me, chubakur did this (for which many thanks to him) in this post.
I must say right away that it’s not so easy to install pocketsphinx (at least I didn’t succeed), so Installing via pip will only work if you have swig installed. Otherwise, to install pocketsphinx you need to go here and download the installer (msi). Please note: the installer is only for version 3.5!
pip install pocketsphinxit won’t work, it will fail, it will swear on wheel.
Speech recognition with pocketsphinx
Pocketsphinx can recognize speech from both a microphone and a file. He can also look for hot phrases (I didn’t succeed, for some reason the code that should be executed when the hot word is found is executed several times, although I only pronounced it). Pocketsphinx differs from cloud solutions in that it works offline and can work in a limited dictionary, which increases accuracy. If interested, there are examples on the library page . Pay attention to the item "Default config".
Russian language and acoustic model
Initially, pocketsphinx comes with English language and acoustic models and a dictionary. You can download Russian at this link . The archive must be unpacked. Then you need to
move the folder to the folder
it is the folder into which you unpacked the archive. The moved folder is an acoustic model. The same procedure should be done with the files
. A file
ru.lmis a language model, and
ru.dicit is a dictionary. If you did everything correctly, then the following code should work.
import os from pocketsphinx import LiveSpeech, get_model_path model_path = get_model_path() speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=os.path.join(model_path, 'ru.lm'), dic=os.path.join(model_path, 'ru.dic') ) print("Say something!") for phrase in speech: print(phrase)
First check that the microphone is connected and working. If the inscription does not appear for a long time
Say something!- this is normal. Most of this time is taken up by creating an instance
LiveSpeechthat has been created for so long because the Russian language model weighs more than 500 (!) Mb. I have an instance
LiveSpeechcreated about 2 minutes.
This code should recognize almost any phrase you uttered. Agree, the accuracy is disgusting. But it can be fixed. And
LiveSpeechyou can increase the speed of creation .
Instead of a language model, you can make pocketsphinx work on a simplified grammar. A
jsgffile is used for this . Its use speeds up instantiation
LiveSpeech. How to create grammar files is written here . If there is a language model, the
jsgffile will be ignored, so if you want to use your own grammar file, you need to write like this:
speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=False, jsgf=os.path.join(model_path, 'grammar.jsgf'), dic=os.path.join(model_path, 'ru.dic') )
Naturally, a grammar file must be created in a folder
C:/Users/tutam/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/pocketsphinx/model. And one
jsgfmore thing : when used, you will have to speak more clearly and separate the words.
Create your own dictionary
A dictionary is a set of words and their transcriptions; the smaller it is, the higher the recognition accuracy. To create a dictionary with Russian words you need to use the ru4sphinx project . Download, unpack. Then we open the notebook and write the words that should be in the dictionary, each from a new line, then save the file
my_dictionary.txtin a folder
text2dictin UTF-8 encoding . Then, open the console and write:
C:\Users\tutam\Downloads\ru4sphinx-master\ru4sphinx-master\text2dict> perl dict2transcript.pl my_dictionary.txt my_dictionary_out.txt. Open
my_dictionary_out.txt, copy the contents. Open the notepad, paste the copied text and save the file as
my_dict.dic(instead of "text file" select "all files"), in UTF-8 encoding .
speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=os.path.join(model_path, 'ru.lm'), dic=os.path.join(model_path, 'my_dict.dic') )
Some transcriptions may need to be tweaked.
Using pocketsphinx via speech_recognition
Using pocketsphinx through speech_recognition only makes sense if you recognize English speech. In speech_recognition, you cannot specify an empty language model and use jsgf, and therefore you will have to wait 2 minutes to recognize each fragment. Verified.
Ditching a few evenings, I realized that I wasted my time. In a two-word dictionary (yes and no), the sphinx manages to make mistakes, and often. 30-40% of celeron eats away, and with the language model also a bold piece of memory. And Yandex recognizes almost any speech accurately, while it does not eat memory and processor. So think for yourself whether it is worth undertaking at all.
PS : this is my first post, so I'm waiting for advice on the design and content of the article.
Only registered users can participate in the survey. Please come in.
Which speech recognition solution do you like more?
- 22.7% sphinx 25
- 36.3% Yandex Speechkit Cloud 40
- 28.1% Google Cloud Speech API 31
- 12.7% Custom 14