
User Feedback Based on Intel RealSense Technology
- Transfer

The software helps to overcome a person’s limited abilities. Programs provide an opportunity to read to people with visual impairments, programs have helped a person to land on the moon and return, programs can exchange information on a global scale with incredible ease. A few decades ago, all these possibilities seemed fantastic. Nevertheless, despite the full power of programs in our lives, the ways in which we interact with programs are far from perfect.
With the advent of natural user interfaces (NUIs) such as Intel® RealSense ™ technology, we can interact with the programs in a new, more natural way. NUI interfaces give us the opportunity to work more convenient, simpler and more efficient. But for these new ways of interacting with programs, you need to create a new language.
This article describes Chronosapien Interactive ’s experience developing natural game interfaces with a particular focus on user feedback in this environment.
User Expectations
Software in its current state is inflexible and, if you will, ruthless. It does not accept anything without explicit user actions and expects to receive complete commands to perform the necessary actions. We are well trained and adapted to the requirements of the programs. But using natural interfaces, the picture changes. Everything that we learned about computers and how they learn about the world disappears when we say hello to a computer.When we are told to wave our hands in front of the screen, but the computer does not respond immediately, we are embarrassed, because, from our point of view, we did exactly what was asked of us. Part of this misunderstanding stems from a lack of knowledge of the technology, but it is mainly due to the fact that users are asked to communicate with the computer naturally, and this leads to the “humanization” of the computer. Users behave as if they are communicating with a person, but they will not receive the same prompts as in natural communication: facial expressions, eye contact, gestures, etc. It is necessary to compensate for the absence of such feedback signals, creating obvious answers to user actions, you need to give the user answers like "We received your message," "I did not understand this, because ..." and "Accepted, working on getting an answer. " In addition, for the formation of the desired expectations, users require some training. Treat this as meeting a new person from another country.
Have you ever had a conversation with someone who took a long pause in the middle of a phrase to better articulate their thoughts? Or imagine that you waved to a man, and in return he awkwardly raised his hand. Or you were in a room where it was very noisy, and only heard snatches when a friend shouted to you: “It's time to leave!”. In such situations, relying on contextual clues and past experience, you could correctly interpret people's intentions with only partial information. But such moments form serious difficulties in natural interfaces.
In the above examples of interaction, some of the information was missing, but in most cases we could restore it based on other related information. When someone stops in the middle of a phrase to collect his thoughts, you do not forget what was said earlier and do not respond to the spoken half of the phrase, allowing the interlocutor to finish it. The reason is that you know, based on indirect information, such as speech intonation, facial expressions, and eye contact, that the person you are talking is going to say something else. If someone awkwardly waved your hand at you, you won’t be confused because this gesture does not fully comply with the generally accepted standard. Instead, you interpret this gesture based on the interlocutor’s most likely behavior in that context, and possibly make any assumptions about the personality of the interlocutor in order to better adapt to the information coming from him in the future. If you hear only part of a phrase in a noisy crowded room, you don’t need to hear a complete sentence to guess that it’s time to leave. These examples highlight two important things: context and related information. In my examples of user feedback in natural interfaces, you will constantly come across the following premise: it is better to give too much information than not enough. These examples highlight two important things: context and related information. In my examples of user feedback in natural interfaces, you will constantly come across the following premise: it is better to give too much information than not enough. These examples highlight two important things: context and related information. In my examples of user feedback in natural interfaces, you will constantly come across the following premise: it is better to give too much information than not enough.
The analogy with trying to talk with someone in a noisy crowded room is very good at working with natural interfaces. The situation is aggravated by the fact that in this case, your interlocutor (computer) has a short-term memory of a newborn baby, and by the ability to perceive the context, it is at the level of a fruit fly-Drosophila. Listed below are some of the main issues with creating user feedback using data in Intel RealSense applications.
- Often you do not know when the user began to interact with the application.
- You will not be able to distinguish situations when the user interacts with the application and when the user does something completely extraneous.
- You can not, without significant effort, teach the program to distinguish between a user who interacts with the application and another person who simply appears in the camera’s field of vision.
- There will be a lot of interference in the data for interaction, sometimes such data will be false.
- The data has no limitations related to the real world.
- It takes time to process the data, which results in awkward pauses between receiving a command and responding to it.
These hand-held interaction issues are discussed in the sections below based on various implementations of Intel RealSense technology. There are a number of general principles that should be remembered when designing both feedback and the interaction itself. In the course of my work, I managed to find a solution to some problems, but they are still a serious obstacle to the natural use of computers. When developing using or for natural interfaces, be prepared for huge amounts of testing and many sequential steps. Some of the problems you will encounter will be hardware related, others with the SDK, and still others with natural interfaces in general.
Hand Tracking in Intel RealSense SDK
The ability of programs to interpret hand movements opens up new possibilities for program creators. In addition to creating an intuitive platform on which to build interaction between a person and a computer, the use of hands provides a new level of “immersion” in the application, otherwise unattainable. Using the Intel RealSense SDK, developers can work with a variety of monitored hand nodes, with its current state of “openness”, with various postures, movements and gestures. These capabilities, of course, are associated with certain limitations, as in other modes of Intel RealSense applications, and these restrictions will have to be circumvented in some way. Below I talk about these limitations, and also describe the different methods of control with the hands that we tried to use.
Hand Interaction Limitations
Tracking Volume

The Intel® RealSense ™ application tracking scope in hand tracking mode is finite and may limit application capabilities
One of the problems with hand-held interaction in the SDK is the hardware-limited tracking range. Since the range of motion of the hands of a person is large enough, often the hands go beyond this range. Going beyond tracked volume is the most common problem that new users encounter when trying to interact with Intel RealSense applications with their hands.
Overlay

Взаимное наложение рук из программы Robust Arm and Hand Tracking компании Unsupervised Context Learning
Вторым по распространенности ограничением SDK и других систем на основе отслеживания изображений является наложение. Просто говоря, наложение — это когда один предмет загораживает другой. Эта проблема наиболее важна при отслеживании рук, поскольку во многих естественных позах и жестах руки находятся одна перед другой с точки зрения камеры. Напротив, если экран используется как средство просмотра, руки часто загораживают экран от пользователя.
Размер руки по отношению к размеру экрана
When interacting with the application using hands, it seems natural to create an interface as if the user touched the viewer, that is (in the most common case), the screen. However, if you use hands with this method of interaction, then on the screen there is no more space for almost anything. Because of this, problems arise for both the graphical user interface and the application itself.
Hand fatigue
Managing the digital world with your hands is a new degree of freedom, but it's easy to overdo it. One of the most important problems noted in both our and other applications: when using hands to work with applications, users begin to feel tired after 60–90 seconds. The situation is slightly simplified if there is a table on which the user can put his elbows, but this does not solve the problem to the fullest.
Lack of tactile feedback
Of all that we lose when we abandon traditional computer interfaces, tactile feedback is the most important. If you make hand gestures in the air, the simplest feedback is lost - the mechanical sensation of pressing a button. This means that the application, since tactile feedback is not possible, should provide clear and audible feedback.
Hands as a pointer

Our implementation of the hands as a pointer in the game Space Between . Pointer - a luminous ball next to sharks
In our Space Between game, we have come to the conclusion that you can conveniently control the application using your hands as a pointer. This provides an intuitive connection between managing the application in the traditional way (with the mouse) and the new way (with the hands). Below I describe some of the problems that we encountered during such a trip, our implementation, as well as our successes and failures in terms of ease of use.
Our tasks
Here are the problems we discovered while trying to use hands as a pointer.
Users did not understand exactly what they controlled.
In the Space Between game, users directly controlled a luminous ball, which follows the position of their hands on the screen in real time. In our games, the character controlled by the player follows the pointer. The result is a somewhat indirect control. Many times, when users first tried to play our game, it took them a lot of time to realize that they were controlling the pointer, not the character itself.
Users did not understand what the pointer controls.
Since in our game pointer control is used in different contexts and in different ways, sometimes users could not understand what exactly the pointer should control.
The hands of users often went beyond the monitored volume.
As mentioned earlier, this is the most common problem when using hands to interact with Intel RealSense applications. Even when the visible pointer was at the edge of the screen, users did not associate this with the fact that their hands reached the boundaries of the monitored volume.
Pointer in Space Between
In Space Between, we used a two-dimensional pointer in three ways.
Gust of wind.

A pointer in the form of a gust of wind in the game Space Between
What worked.
Of all three options, a gust of wind was the most abstract. It was fortunate that its amorphous outlines made it possible to mask most of the interference of location data that inevitably occurs in Intel RealSense applications. In addition, voice acting was used, the volume of which varied depending on the speed of the pointer. This was convenient because users knew whether their hand movements were being tracked or not (this could also be determined by the movement of the clouds on the screen).
What didn't work
Amorphous outlines were convenient for masking interference, but they did not give the opportunity to accurately determine the place on the screen. Because of this, difficulties arose when trying, for example, to select a particular game by moving the pointer over objects on the screen.
Glowing ball

Another pointer in the game Space Between
What worked
The pointer emitted light on the environment, but it was drawn on top of it. Thanks to this, users knew where exactly in the environment their character would move, and at the same time there were no problems like “the pointer was lost among the walls”. Due to the relatively small size, we could also see the accuracy of the hand tracking module in the SDK. Initially, we used the ball itself as a pointer. But at the same time a problem arose: it was easy to lose sight of him if you make quick movements with your hand. To handle this, we created a particle trail that remained behind the pointer for about a second. This solution had a nice side effect: it was interesting to just move the pointer in space to draw shapes with it. Finally, to connect the pointer to the player’s character, we created a trace connecting them. It was especially helpful
What did not work
The main problem with the luminous ball in our games: sometimes users did not realize that they were controlling the pointer, and not the character itself. Another problem with a luminous ball. In addition to controlling the character’s position, we also tried to use it for another function - displaying an open or closed palm. To do this, we increased the light intensity and made it brighter. In the future, we will finalize the pointer to visually show its change when opening the palm. In addition, we can briefly display the image of a hand next to it, so that users understand exactly what they control.
Hand

pointer A hand pointer in Space Between is used to interact with the What worked menu.
The hand pointer was the simplest and most intuitive to use of all three options. Since the pointer was shaped like a hand (and the right size), users immediately understood what and how they operate. We moved further and created animated images of transitions between different positions of the hand, corresponding to the current state. This was very convenient, because the user immediately saw that the system recognized the change and was responding to it. Even if some action was not used in the current context, the players could easily find out what exactly the application interprets and how well.
What didn't work
From the point of view of convenience, the hand-shaped pointer was magnificent, but it did not fit into the gaming environment at all in style. Either we violated the immersion of the player in the atmosphere of the game world, or we could use this pointer only in the context of managing the application, for example, in the menu for pausing the game and setting parameters.
conclusions
Our experience shows that there is no single answer to the question of how to implement a pointer when using hands: it all depends on the context and application. Nevertheless, there are a number of universal rules regarding the provision of feedback to users.
- In all possible cases, provide the user with visual and audible feedback when the state of the hands changes. This helps players understand what is being monitored by the system, and also allows you to naturally perceive gaming capabilities.
- If the user is outside the monitored volume, immediately and clearly report it. We don’t have this opportunity in Space Between at the moment, but it solves many problems related to user convenience, for example, when users don’t understand why the game no longer tracks their gestures or works with delays when returning hands to the camera’s field of view.
Hands and gestures

The first stage of the uplift gesture in The Risen
Gestures is a powerful tool to express your thoughts and perform actions. The constancy of gestures allows you to arrange very precise control and create sensations unique to the environment in which gestures are used. Using gestures helped to create our games based on Intel RealSense technology, namely Space Between and The Risen, and to connect players with the actions they perform. As mentioned earlier, first I will talk about the problems that we encountered when using gestures, how we implemented them, and what, in our opinion, worked and what didn’t.
Our tasks
Gestures are harder than just tracking. Here are some of the problems we found while working on gesture control.
There is no way to determine the beginning of a gesture.
This, of course, to some extent depends on the particular gesture used, but in general, ready-made gestures supported by the corresponding Intel RealSense mode do not contain any indications of their beginning before the gesture is actually made. It seems to be nothing serious, but with complex gestures you have to wait for their completion and only then find out that the gesture did not work, and repeat it again.
Many users perform gestures correctly, but not accurately enough for the application to recognize
As I said above, gesture recognition software works very meticulously. Scrolling gestures must travel a certain distance, hands must move in a certain way, at a certain distance from the camera, etc. All this sometimes makes using gestures very uncomfortable.
Some hand angles are not optimized for tracking using Intel RealSense technology
One of the biggest drawbacks of hand tracking algorithms is the inability to track certain angles. Currently, the system perfectly detects hands if the palms are turned towards the camera, but detection is much worse if the palms are directed perpendicularly. This affects many different gestures, but especially gestures with complex movements. For example, in The Risen, we created a gesture for raising skeletons: first, the user shows the palms to the camera, then lowers his hands, turning his palms up, and then raising them. In that part of this gesture, when the palms take a flat position, the application often stops tracking them, which interrupts the gesture in the middle of it.
Raising gesture in the game The Risen

Второй этап жеста поднятия в игре The Risen
В игре The Risen используется нестандартный жест поднятия, важный для того, чтобы игрок проникся игровой атмосферой и ощутил себе частью игрового мира. И вот что мы узнали в ходе работы.
Что сработало
Нам удалось добиться полного понимания игроками нужных движений, поскольку жест многократно используется в игре. Кроме того, нам хотелось избежать сложных текстов, описывающих в мельчайших подробностях изменение положения рук со временем. Решение было таким: мы показали анимированные руки на сцене в разделе обучения, чтобы было видно, как в точности следует делать нужный жест. Руки в анимации были такого же размера, как руки пользователя на сцене, поэтому пользователи сразу понимали, что от них требуется.
When creating the gesture, we knew that the hands of users with a high probability would not be correctly positioned. We also took into account the limitations of hand tracking in the SDK. To solve this problem, we chose the initial gesture position so that the tracking module recognizes it well. In addition, we warn the user: "Yeah, now there will be a gesture of raising." The user receives a visual and audible notification that the system knows about the beginning of the gesture. This avoids unnecessary repetitions and tells the player exactly what the system needs.
Following the principle of dividing the gesture into parts to enhance convenience, we also trigger visual and sound effects when the second part of the gesture is reached. Since this gesture is quite complex (and non-standard), it signaled to the players that they were doing everything right.
We divided the gesture into parts for technical reasons and for reasons of convenience, but you can perform it together, in one motion. Parts are used only to display prompts about correct execution and to indicate errors, if any.
What didn’t work
Our main problem with using the gesture was related to tracking restrictions. When the palms become perpendicular to the camera during the execution of the gesture, tracking often stops and the gesture is canceled halfway. We still cannot control this, but here we can help inform users about this feature.
conclusions
Here are some things to keep in mind when creating feedback for input using gestures.
- Proper preparation and explanation are crucial to ensure that users understand how to perform gestures. We used an animated image of three-dimensional hands, and this method seems optimal, because users understand what needs to be done.
- Providing feedback at different stages of complex gestures helps to avoid annoyance. When users get used to technology, informing them that the system is working (or not working) helps to avoid forced repetition of gestures over and over again.
Virtual hands

Using virtual hands to interact with the environment in The Risen game The
ability to reach out to the virtual world and interact with it as with our own world is an unforgettable experience. The level of immersion achieved in this case cannot be compared with anything. At The Risen, we give users the opportunity to reach out to the game world to open doors or turn on traps. Below I list some of the problems associated with hand-held interaction, describe our implementation of virtual hands in The Risen, and how successful it has been.
Detected problems
Managing virtual hands is very cool, but implementing such control using the out-of-the-box SDK features is difficult. Here are some problems that you will have to solve in some way.
A lot of noise in the data
When displaying the hand and managing the hand data from the SDK, a lot of interference is generated. There are anti-aliasing algorithms in the SDK, but they do not completely remove unnecessary noise.
Data has no restrictions associated with the real world.
In addition to interference, nodes (corresponding to the joints of the arm) sometimes get locations that are physically impossible in the real world. In addition, sometimes they jump across the screen at the speed of light for several frames, this occurs when the visibility of part of the hand is insufficient.
Small interactions are very difficult to perform and detect.
We wanted to give players the opportunity to interact with objects of relatively small size compared to the size of their hands. But due to significant interference in the data, due to a fuzzy sense of depth and the lack of tactile feedback, this turned out to be almost impossible.
Virtual Hands in The Risen
Players can interact with the world using the hands of a ghost skeleton in The Risen. With the help of hands, the player helps the skeletons in various ways, for example, opens doors or includes traps for opponents. The implementation of virtual hands has helped us learn a lot.
What worked

The Risen game interface: displaying the detected face and right hand
The first thing to note is the graphical user interface we created for The Risen. The skull in the upper left corner represents the player and the controls currently being monitored. When the system detects hands, they are displayed on the screen in an animated form, showing the player that the system recognizes them. It would seem that it is very simple, but actually useful if the player can determine what works and what does not. For example, if the system detects a player’s head but doesn’t detect his hands, this means that they are outside the monitored volume.
To indicate which objects of the world you can use with your hands, we show an icon that, when objects first appear on the screen, hangs over them and shows how you can interact with them. We wanted the players to know how to use different things, and also so that they could discover the interactive possibilities of the environment. Displaying the icon in the early stages of the game turned out to be a well-balanced solution.
I describe our initial approach to interacting with environment objects below in the section “What didn’t work”, but in the end we achieved what we wanted: a simple capture gesture, in which the whole hand was used, worked quite acceptable. This allowed to a certain extent to solve the two problems mentioned above (a fuzzy sense of depth and the lack of tactile feedback) without significant damage to the game. At the same time, however, I had to take a stricter approach to the selection of objects with which you can interact in this way, because if there are two or more objects in the area of the hand, they will be affected all at once when interacting.
To indicate to users that their hands are in a state of interaction (clenched palm), we changed the color of the hands. At the same time, using hands became similar to using buttons: there was an inactive state, and there was an active state in which it was quite obvious what the application was expecting. Based on this, users should have guessed where to interact.
What didn't work
When we first thought of using hands to interact with the environment, we imagined such movements as “pull the target”, “move the book”, as if these objects were directly in front of the player. The problem was that it was very difficult to perform these movements exactly. Grabbing a chain with your fingers when you cannot correctly perceive depth and do not get tactile feedback has proved to be a very difficult task with a huge number of unsuccessful attempts. This problem can be somewhat mitigated by more accurate tracking, but it can really be solved by using a stereoscopic screen and tactile feedback for the hands.
conclusions
A brief summary of key findings from trying to use virtual hands.
- Взаимодействие с помощью простейших жестов работает лучше всего. Возможно, когда технология будет усовершенствована (или при использовании других средств просмотра) можно использовать и мелкие жесты, но пока следует остановиться на самом простом.
- Предоставляйте наглядную и звуковую обратную связь, когда руки находятся во «взаимодействующем» состоянии. За счет этого пользователь узнает, что система находит объекты в пределах досягаемости, а это упрощает взаимодействие.
Отслеживание головы в Intel® RealSense™ SDK

Head tracking in the Intel RealSense SDK shows the orientation of the user's head.
Suppose the application “knows” where the user's head is. So what? The benefits of this are not always obvious. Unlike hands, perhaps, you should not develop sophisticated control mechanisms based on head tracking, otherwise users can not do without dizziness. However, the limited use of head tracking will help give the app a unique color and immerse users deeper into the virtual world. The Intel RealSense SDK has several limitations when tracking the head. These limitations should be kept in mind, especially when creating user feedback in an application.
Head Tracking Limitations
Tracking Volume
As we already said in the first part of this article, the understanding that tracking in the Intel RealSense SDK is carried out only in the field of view of the camera is extremely important to know how to use this device correctly. This is really the most important problem that users experience when working with any Intel RealSense application, and it applies to all Intel RealSense operating modes, excluding speech recognition. This restriction is less pronounced when using the head tracking module, because usually users still sit at a table in front of a computer. However, difficulties are also possible here, if the user has to lean to the side to work with the application.
Most of the time, users are sitting facing the camera, so detecting a face with a program is not too difficult. Face detection is necessary mainly at the beginning, as well as with the loss of tracking. The program is most difficult to find the user's face when the camera is located above or below the head (as a result of which the face is in the field of view of the camera at an acute angle). The solution, as in the case of hands, is to show the camera what it needs, that is, the face. The need for face detection also imposes certain restrictions on head tracking, for example, nape tracking is not possible (if the user has turned away from the camera).
Head as a pointer

At the Twilight Zone stage of the Space Between game, a luminous pointer was used to track the user's head.
In the Space Between game, we used a head pointer representing the two-dimensional position of the player’s head on the screen. Our users did not have to select objects on the screen with their heads, as is usually done with a pointer. In the end, we decided to arrange whale management based on the previously described principle of “hands as a pointer”.
Now let's talk about some difficulties in developing such an interaction, then we will move on to our implementation and discuss what worked and what doesn’t work in terms of usability.
Our tasks
People often bend over the boundaries of tracking.
This is one of the most common problems when people first use Intel RealSense applications, but different people immerse themselves in the application in different ways, and some leaned much more than we expected. Going beyond the monitored area is not a problem if Intel RealSense is required to rediscover the user because of this. We found that in such cases all the benefits of natural management are lost, and a workaround was needed.
The up and down movement was not as intuitive as we expected.
The horizontal movement with a lateral tilt of the head was realized quite simply, but with the vertical movement everything turned out to be more complicated. To control the vertical position, raising and lowering the head literally did not make sense, since users would have to get up or cower in order for their head to move up or down relative to the camera. Instead, we decided to use the distance from the camera (the user had to lean either toward the camera or away from it), but found that this method was not too intuitive.
Pointer in Space Between
In Space Between, we used a head pointer to control the whale in a phase called Twilight Zone. At this point, the player can lean left and right to swim in the appropriate direction. Forward tilt corresponds to a whale sinking, and a backward tilt corresponds to ascent. The whale had to move along the luminous track, and when tilting its head, it sailed a certain distance along the track, giving the player the opportunity to score points along the way.
What worked
From the point of view of user feedback, displaying the connection of the pointer with the position of the head helped to understand how the head is tracked. Each time the game was loaded, instructions and graphic prompts showing what input modes are used were displayed on the screen. This helped players to master with control. As soon as users understood what the pointer (head position) was showing, they immediately began to intuitively understand the field of vision of the camera: a pointer on the edge of the screen meant that the player’s head was on the border of the camera’s field of vision.

A fragment of instructions in the game Space Between, showing the input used from the Intel RealSense SDK.
What did not work
We had an animation of whale rotation when the user tilted his head left or right, but the animation was not very noticeable, and sometimes people did not understand that they had already turned in the right direction. A more clear visual indication was required that the tilt of the user's head is directly related to the movement of the whale to the left, right, up or down. In addition, at first there was confusion as to what exactly the pointer represents. To avoid confusion, it was better to show or explain that the pointer corresponded to the position of the head.
conclusions
- It is important to prepare users for the type of input that will be applied.
- To avoid losing tracking, somehow show where the border of the camera’s field of view is relative to the input used.
- Управление становится гораздо интуитивнее, когда наглядная обратная связь привязана к тому, чем управляет пользователь.
Распознавание речи в Intel RealSense SDK
I left the hardest part for dessert and start with a statement: most of the information we received relates to limitations and what didn't work. For those new to the Intel RealSense SDK, there are two types of speech recognition: commands and dictation. In command mode, the user issues specific commands that are listened to by the Intel RealSense SDK. In dictation mode, a line of recognized text is displayed. We managed to significantly increase user convenience when working with the voice module, but still it turned out (by a wide margin) the most inconvenient for both users and us. The challenge is to use user feedback to overcome the technical limitations of speech recognition.
Speech Recognition Limitations
The accuracy of the module does not meet the expectations of users.
Many users already have experience with speech recognition software such as Apple Siri *, Google Now *, or Microsoft Cortana *. In all of these solutions, the software is cloud based and works on the basis of huge amounts of data and complex algorithms. All these features are not available in a local solution, such as in the Intel RealSense SDK. User expectations are based on more powerful cloud solutions, so you need to deal with restrictions by providing users with feedback and instructions.
Sometimes there are significant delays between the pronunciation of commands and their recognition.
Depending on the application, sometimes there are significant delays between the moment the user pronounces the command and the moment the Intel RealSense SDK recognizes it and returns it as text.
The tone of the voice, its timbre and volume play an important role for the accuracy of speech recognition.
Our experience shows that adult male speech is best recognized, while quiet speech or high voice is less recognized.
Speech affects the accuracy of speech recognition.
For English in Intel RealSense SDK, you can specify two versions: American and British. Of course, such a choice is not enough to cover all the features of pronunciation, so the speech of people speaking with an accent is less recognized.
The quality of the microphone significantly affects the accuracy of speech recognition.
The microphone integrated in the Intel RealSense (F200) camera works just as well as the omnidirectional microphone of the webcam, but the headset microphone is better for speech recognition.
Ambient noise significantly affects the accuracy of speech recognition.
This is the most serious problem for any application that uses speech recognition. In different environments, ambient noise can vary greatly. Speech recognition works best in a quiet environment where speech is heard clearly and clearly. This problem is partially solved if you use a microphone built into the headset, but in general you should not expect satisfactory operation of speech recognition systems somewhere outside a quiet home office.
Speech recognition as input controller

This is where I command my skeletons in The Risen.
Using voice commands in applications is one of the most powerful ways to overcome the barriers between people and computers. When it works, it's just wonderful. But when it does not work, it is very annoying. In The Risen, players can give voice commands to their subordinates - skeletons. Now we will talk about our tasks and how we approached their solution in terms of user feedback.
Our tasks
Often, voice commands are simply not recognized by the system.
This fact in itself is a sufficient basis to reflect on the appropriateness of using voice control as such. Designing user feedback that can overcome the technical limitations of the Intel RealSense SDK is also a daunting task.
Users often do not understand why their team was not recognized.
Either they pronounced it not loud enough, or not clearly enough, or the module was not initialized, or perhaps the program simply did not like the user's voice? These are just a few of the many questions people have when trying to use speech recognition.
When playing for the first time, it's easy to forget which teams were.
We tried to present voice commands with icons, but if you see only words, then they are easy to forget.
Speech Recognition in The Risen
In The Risen, you can give skeletons simple commands: forward, backward, attack, defend. Each of these teams translates the skeletons into a certain state, which allows them to control their actions at a high level. The status of the skeletons is displayed using colored icons in the graphical user interface and affects the skeletons themselves.
We also came up with an interface to provide users with feedback about the beginning and end of speech recognition, as well as a slider that controls the microphone volume. As a feedback when detecting commands, we reproduce the animation of the moving mouth at the graphic image of the skeleton when we receive the LABEL_SPEECH_BEGIN notification and stop the animation when we receive the LABEL_SPEECH_END notification. The slider for adjusting the microphone volume is designed to improve recognition quality. In addition, the color of the slider changed accordingly, indicating that the voice is too loud or too quiet.

Microphone slider in The Risen game
What worked
From the point of view of determining the state in which the skeletons are, the most informative way are the visual effects applied to the skeletons. Before we implemented them, users gave out several commands in a row, not knowing that the skeletons were already in the right state, or did not understand why the state of some skeletons was different from the rest (according to game mechanics). Visual effects also helped in debugging skeleton artificial intelligence.
The microphone volume slider turned out to be so useful that I highly recommend the implementation of a similar function in all games that use speech recognition. This makes it possible not only to dynamically adjust the microphone volume and increase the likelihood of correct recognition of commands, but also to inform users why voice control may not work. This is very important for combating user annoyance: they received implicit confirmation that the microphone was working, commands were recognized, and understood how to increase the efficiency of management.
What didn't work
The animated skeleton of the player should indicate that the recognition of teams is in progress, but in practice this almost did not work. Here, I think, the fact is that in the interface there are a lot of different elements that need to be monitored, so they simply did not pay attention to another animated image. Note, however, that for this game we created only a short demo level and users simply did not have enough time to get comfortable with the interface.
I also think that the icons representing the state of the skeletons were also often ignored. For a regular game (without voice control) this would work, but we needed to inform the user that a command had just been discovered (and when exactly), and this was not so good. It seems to me that to confirm recognition of a voice command, you need to display a blinking recognized word for a second or so in order to already accurately attract the user's attention. With this approach, users will, by the way, better remember the necessary commands.
conclusions
- Inform the user that speech is being detected, even before the end of its processing, in order to avoid irritation and repetition of commands during speech processing.
- What the user controls using speech should be obvious, and state changes should be clearly noticeable.
- Provide users with a microphone control slider that can also be seen when the user speaks too quietly or too loudly.
- You can display system commands on the screen to help users remember them.
- Inform the user that the command has been recognized.
Man’s new friend
Computers gradually penetrate all areas of our lives. As technology advances, we find new ways to use computers and entrust them with increasingly important tasks. The time is fast approaching when small computers will be built into ourselves. Our relationship with computers is becoming more natural.
The first step in this direction is Intel RealSense technology and other natural user interfaces. These technologies give us the opportunity to truly take a new approach to interacting with the world. However, here everything is just beginning, and we, as developers and creators, are responsible for ensuring that technologies develop in the right direction. One day, computers will become like our best friends: they will be able to foresee our intentions even before we begin to express them. Now computers are closer to the level of pets: to tell us that they need a walk, they need our help.