Seredin March 6, 2014 at 12:12

Why does the robot need ears? (Survey: Do I need OpenTod)

The second of the laws of robotics, formulated by the notorious American science fiction writer Isaac Asimov, states that the robot must obey the orders given by man. How can I give orders to the robot? If you believe most science fiction films, the most comfortable way to communicate with the robot is natural human speech. That is why we provided the robot Tod, as a real servant of man, the long-awaited opportunity to understand voice control commands and speech synthesis in Russian. Now it’s enough, for example, to give the order “Robot, go to the kitchen” for the robot to complete the necessary task. Under the cut, we will tell you more about the software used for recognizing and synthesizing speech on the robot, and in the videos we will show examples of using voice commands.
The vector of development of our project depends on the opinion of the habrasociety. Are you interested in using Tod as an open source development platform? Please vote in our poll.

Speech Recognition in Pocketsphinx

Most owners of modern smartphones have already tried some kind of voice search system and appreciated some of its advantages over the traditional method of touch data input. And some automation lovers have taught their PC to understand voice control commands, the benefit on Habré and on the network is enough manuals on this topic.
If your robot uses Linux, then it will not be much harder to teach him to understand speech than to do the same on a home PC. You can use any of the open source engines for speech recognition. Unlike cloud speech recognition services, this allows the robot to stay connected and in the absence of the Internet.
Our robot uses the open source CMU Sphinx voice engine, developed by Carnegie Mellon University of America and actively supported by the Massachusetts Institute of Technology and Sun Microsystems. One of the advantages of this engine is the ability to adapt the sound model for a specific person. And what is important for us, the engine is easily integrated into ROS - a robotic framework for our Tod.
CMU Sphinx consists of 3 main components:

acoustic model transforming sound into phonemes
vocabulary
language model - builds a sentence from the received words

The acoustic model is a set of sound recordings, divided into speech segments. For a small dictionary, you can create an acoustic base yourself, but it is better to use the acoustic base of the VoxForge.org project, which contains more than ten hours of dictation in Russian.
The next step in adapting the acoustic model is optional, but it will make the recognition better for your voice. The phrases you dictate are added to the main acoustic model, which allows you to take into account the features of your pronunciation when recognizing.
A dictionary in CMU Sphinx is just a text file with phrases and their corresponding phonemes. Our dictionary consists of different robot control commands:

Vocabulary

without bb je s
without (2) bb iz
without (3) bb is
without (4) bb je z
without (5) bb je s
forward f pp i rr jo t
time v rr je mm i
where g dd je
two dv aa
two or three dv aa t rr ii
day dd je nn
tomorrow z aa ftr ay
hall z aa l
hello zdr aa stvuj you
know zn aa i sh
name is zav uu t
like k aa k
what k aa k ay i
what (2) kak aa i
what kak oo j
end kanc aa
who kt oo
cuisine k uu h nn uj
love ll ju bb i sh
me mm i nn ja
cute mm ii lyj
me m nn je
can m oo zh y sh
my m oo j
find naj tt ii
weeks nn i dd je ll i
look back ag ll ja t kk i
one a dd ii n
dad p aa pp i
beer pp ii v ay
you got p ay zh yv aa i sh
play p ay igr aa im
while pak aa
weather pag oo d ay
item p rr id mm je t
bring p rr i vv i zz ii
hi p rr i vv je t
tell r ay ska zh yy
today ss iv oo d nn i
now ss ij ch ja s
now (2) ss i ch ja s
now (3) sch ja s
how much sk oo ll k ay
you tt i bb ja
you (2) tt ja
you (3) tt i
point t oo ch k ay
three t rr ii
three-four t rr ii ch it yy rr i
you t yy know
how u mm je i sh
four ch it yy rr i
anything sh t oo nn ib uu tt
anything (2) ch t oo nn ib uu tt
anything (3) ch t oo nn ibu tt

The dictionary is converted into a language model that is understandable for the CMU Sphinx engine. So, in the end, the process of speech recognition looks like.

In ROS, any program node, subscribing to the / recognizer / output topic, can now receive sentences built in the CMU Sphinx language model in text format. We wrote a small voice control node that receives recognized phrases and converts them into patrol commands or synthesizes robot response phrases. Below you will find a video on this topic.

Speech synthesis in Festival

For full communication with the robot, there is not enough voice feedback. Our robot Tod was helped in speaking with the Festival speech synthesis package available on Linux. Festival is also a joint development of several large universities, which provides high-quality speech synthesis and supports the Russian language. Based on a bunch of Sphinx / Festival, you can implement full dialogue. And here is a video demonstrating the use of voice commands of our robot.

What else can you hear?

Speaking about the task related to sound, one cannot but mention HARK. HARK is Japanese sound software that greatly expands the ability to process sound. Here is some of them:

sound source localization
highlighting several useful sound sources (for example, phrases of several people talking at the same time)
noise filtering to extract “clean” speech from the sound stream
creating a three-dimensional audio effect for a telepresence task

It does not make much sense to use HARK with only one microphone, since most of the sound processing tasks are solved on the basis of the so-called array of microphones. And here Kinect fits perfectly in the right way, with an array of 4 microphones mounted on the front.
Of course, we did not miss the opportunity to use HARK in our project. In the process of patrolling the territory, the robot must respond to surrounding events, including the treatment of a person. The sound source localization module provided by HARK can help the robot find its interlocutor even if it is not in direct line of sight. Such a task is reduced to localization of the sound source and rotation of the head so that it is opposite the interlocutor. See how it looks in our video.

Regular readers of our blog must have noticed that since the last publication, the Tod robot has not only grown wiser, but also grown, has acquired a manipulator and a second Kinect. In the next post, we will talk about how to control the manipulator and use it to capture objects. See you again in our blog.

Only registered users can participate in the survey. Please come in.

Are you interested in creating an open source community of developers of the Tod robot with open source code and the hardware of the project, a forum, tutorials, webinars and documentation in Russian

28.2% I would like to take an active part in the process of developing software and hardware for the Tod robot, writing documentation, supporting the forum and helping the development of the community in every possible way. 33
62.3% I would like to use open source software and Tod robot hardware in my programming practice, radio engineering, or in scientific research. 73
5.9% I would be interested in the development of the project in another direction (please indicate in the comments which direction) 7
3.4% I do not think your project is promising (please indicate in the comments why) 4

Tags:

Why does the robot need ears? (Survey: Do I need OpenTod)

Speech Recognition in Pocketsphinx

Speech synthesis in Festival

What else can you hear?

Are you interested in creating an open source community of developers of the Tod robot with open source code and the hardware of the project, a forum, tutorials, webinars and documentation in Russian

Also popular now: