How to generate binaural sound on a mono channel audio track - video will help

Specialists from the University of Texas at Austin (UT Austin) have developed a neural network that processes mono-channel audio recording on video and recreates its “surround” sound.

We tell how it works.

Photo marneejill / CC BY-SA

New method for creating 3D sound

Surround sound is often found in games or movies, but in conditional videos on the network, 3D sound is rare. To record it requires expensive equipment, which is not always accessible to video creators - often smartphones are used exclusively for shooting.

The audio track recorded in this way limits our perception of the video: it is not able to convey how sound sources are located in space and how they move. Because of this, the sound of the video can be felt "flat."

The solution to this problem was taken up at UT Austin - a university professor Kristen Grauman and a student Ruohan Gao. They created a system based on machine learning algorithms, which makes it possible to turn a mono-channel audio recording into a “volumetric” video recording. The technology is called "2.5D Visual Sound".

This is not a full-fledged spatial sound, but “simulated”. However, according to the developers, for an ordinary listener the difference will be almost imperceptible.

How technology works

The system developed at UT Austin uses two neural networks.

The first neural network is based on the ResNet architecture , which in 2015 was presented by researchers from Microsoft. It recognizes objects in the video and collects information about their movement in the frame. At the output, the network generates a matrix, called a feature map, with the coordinates of the objects on each frame of the video.

This information is transmitted to the second neural network - Mono2Binaural. It was developed at the University of Texas. The network also takes as input spectrograms of audio recordings obtained using the window Fourier transform using the Hann function .

Mono2Binaural consists of tenconvolutional layers. After each of these layers, there is a batch normalization block (batch normalization) in the network, which increases the prediction accuracy of the algorithm, and a linear rectification unit with the ReLU activation function .

The convolutional layers of a neural network analyze the frequency changes in the spectrogram and make up a matrix containing information about which part of the spectrogram should belong to the left audio channel and which part should belong to the right one. Then, using the inverse window Fourier transform, a new audio recording is generated.

In this case, Mono2Binaural is able to reproduce the spatial sound for each of the objects in the video separately. For example, a neural network can recognize two instruments in a video clip - a drum and a pipe - and create a separate audio track for each of them.

Opinions on "2.5D Visual Sound"

According to the developers themselves, they managed to create a technology that recreates "realistic spatial sensation." Mono2Binaural showed a good result during testing, and therefore the authors are confident that their project has great potential.

To prove the effectiveness of its technology, experts conducted a series of experiments. They invited a group of people who compared the sound of two tracks: one was created using Mono2Binaural, and the second - by the Ambisonics method.

The latter was developed at the University of California at San Diego. This method also creates “surround” audio from mono sound, but, unlike the new technology, it works only with 360-degree video.

Most listeners chose Mono2Binaural audio as closest to the actual sound. Testing also showed that in 60% of cases, users unmistakably located the source of the sound by ear.

The algorithm still has some drawbacks. For example, a neural network poorly distinguishes the sounds of a large number of objects. Plus, obviously, she will not be able to determine the position of the sound source, which is not in the video. However, developers are planning to solve these problems.

Analogs of technology

In the field of sound recognition by video, there are several similar projects. We wrote about one of them earlier. This is a “ visual microphone ” from MIT specialists. Their algorithm recognizes on silent video microscopic oscillations of objects under the influence of acoustic waves and restores the sound that was heard in the room on the basis of these data. Scientists managed to “count” the melody of the song Mary Had a Little Lamb from a pack of chips, a homemade plant and even a brick.

Photo by Quinn Dombrowski / CC BY-SA

In other projects, technologies are being developed to record sound in 360-degree videos. One of them is Ambisonics, which we mentioned earlier. The principle of the algorithm is similar to Mono2Binaural: it analyzesmoving objects in the frame and correlates them with changes in the sound. However, Ambisonics technology has several limitations: the neural network only works with 360-degree video and doesn’t produce sound if there is an echo on the recording.

Another project in this area is Sol VR360 from G-Audio. Unlike other developments, the technology has already been implemented in a custom service for sound processing Sol. It creates spatial audio for 360-degree videos from concerts or sports. Lack of service - generated videos are played only in Sol applications.

findings

The developers of systems for creating spatial sound see the main area of application of technology in VR and AR-based applications for maximum human immersion into the atmosphere of a game or film. If we manage to overcome a number of difficulties that they face, the technology can also be applied to help the visually impaired. With the help of such systems, they will be able to understand in more detail what is happening in the frame on video clips.

More about audio technologies - in our Telegram-channel: the InSight first recorded sounds of the Martian wind Eight audio technologies that fall into TECnology Hall of Fame in 2019. The windows with active noise canceling drown out the sounds of the metropolis

Tags: