Architecture and algorithms for indexing audio recordings VKontakte



    We’ll tell you about how the search for similar tracks is arranged among all VKontakte audio recordings.

    Why is all this necessary?


    We really have a lot of music. A lot - this is more than 400 million tracks, which weigh about 4 PB. If you download all the music from VKontakte to 64 GB iPhones, and put them on top of each other, you get a tower above the Eiffel. Every day you need to add another 25 iPhones to this pile - or 150 thousand new audio recordings with a volume of 1.5 TB.

    Of course, not all of these files are unique. Each audio has information about the artist and the name (optionally, text and genre), which the user fills out when uploading a song to the site. No pre-moderation. As a result, we get the same songs under different names, remixes, concert and studio recordings of the same compositions, and, of course, completely incorrectly named tracks.

    If you learn how to accurately find the same (or very similar) audio recordings, you can use it with benefit, for example:

    • Do not duplicate in the search one track under different names;
    • Offer to listen to your favorite song in higher quality;
    • Add covers and lyrics to all variations of a song
    • improve the mechanism of recommendations;
    • Improve the handling of complaints from content owners.

    Perhaps the first thing that comes to mind is ID3 tags. Each mp3 file has a set of metadata, and this information can be taken into account as a priority than what the user specified in the site’s interface when loading a track. This is the simplest solution. And it’s not too good - tags can be edited manually, and they do not necessarily correspond to the content.

    So, all the information associated with the file that we have is human-dependent and may be unreliable. So, you need to take for the file itself.

    We set ourselves the task of identifying tracks that are the same or very similar to hearing, while analyzing only the contents of the file.

    It seems someone has already done this?


    Searching for similar audio is a pretty popular story. The already classic solution is used by everyone in a row, from Shazam to biologists studying howling wolves. It is based on acoustic prints.

    An acoustic fingerprint is a representation of an audio signal in the form of a set of values ​​describing its physical properties.

    Simply put, the fingerprint contains some information about the sound, and this information is compact - its volume is much smaller than that of the original file. Songs that are similar to hearing will have the same prints, and vice versa, prints that are different in sound will not match.

    We started by trying to use one of the ready-made C ++ solutions to generate acoustic fingerprints. They fastened their search to it, tested it on real files and realized that for a significant part of the sample the results were bad. The same track is successfully “masked” by an equalizer, the addition of background noise or jingles, gluing with a fragment of another track.

    Here's how it looks (compared to the original track):



    Live performance



    Echo



    Remix

    In all these cases, a person will easily understand that this is the same composition. We have many files with similar distortions, and it is important to be able to get a good result on them. It became clear that we needed our own implementation of the fingerprint generator.

    Fingerprint Generation


    Well, we have an audio recording in the form of an mp3 file. How to turn it into a compact print?

    You need to start by decoding the audio signal, which is packed into this file. MP3 is a chain of frames (blocks) that contain encoded audio data in PCM format (pulse code modulation) - this is an uncompressed digital sound.

    To get PCM from MP3, we used the libmad library in C and our own wrapper for it in Go. Later, they opted for the direct use of ffmpeg .

    One way or another, as a result, we have an audio signal in the form of an array of values ​​describing the dependence of the amplitude on time. You can imagine it in the form of such a graph:

    Audio signal

    This is the sound that our ear hears. A person can perceive it as a whole, but in fact a sound wave is a combination of many elementary waves of different frequencies. Something like a chord consisting of several notes.

    We want to know what frequencies are in our signal, and especially which of them are the most “characteristic” for it. We resort to the canonical method of obtaining such data - fast Fourier transform (FFT).

    A detailed description of the mathematical apparatus is beyond the scope of this article. You can learn more about the application of the Fourier transform in the field of digital signal processing, for example, in this publication .

    Our implementation uses the GO-DSP (Digital Signal Processing) package, namely github.com/mjibson/go-dsp/fft- actually FFT and github.com/mjibson/go-dsp/window - for the Hannah window function.

    At the output, we get a set of complex numbers, which, when transferred to a plane, are called a spectrogram.

    A spectrogram is a visual representation of all three acoustic measurements: time, frequency, and amplitude of a signal. It expresses the value of the amplitude for a specific frequency at a particular point in time.

    For example:

    Spectrogram of the reference track.

    The time is counted along the X axis, the Y axis represents the frequency, and the amplitude value is indicated by the pixel color intensity. The spectrogram for the “reference” signal with a uniformly increasing frequency is shown in the illustration. For a regular song, the spectrogram looks like this:



    Spectrogram of an ordinary track

    This is a rather detailed "portrait" of an audio track from which you can (with a certain approximation) restore the original track. From the point of view of resources, keeping such a “portrait” is completely unprofitable. In our case, this would require 10 PB of memory - two and a half times more than the audio recordings themselves weigh.

    We select key points (peaks) in the spectrogram based on the intensity of the spectrum in order to preserve only the most characteristic values ​​for this track. As a result, the amount of data is reduced by about 200 times.

    image

    Key values ​​on the spectrogram It

    remains to collect these data in a convenient form. Each peak is uniquely determined by two numbers - the values ​​of frequency and time. Adding all the peaks for the track into one array, we get the desired acoustic imprint.

    Fingerprint Comparison


    Suppose we did all of the above for two conditional tracks, and now we have their fingerprints. We return to the original task - to compare these tracks with the help of fingerprints and find out whether they are similar (identical) or not.

    Each fingerprint is an array of values. Let's try to compare them element by element, shifting tracks along the timeline relative to each other (a shift is needed, for example, to take into account the silence at the beginning or at the end of a track). There will be more coincidences in prints at some offsets, less at others. It looks something like this:



    Tracks with a common fragment and different tracks

    Sounds like the truth. For tracks with a common fragment, this fragment was found and looks like a surge in the number of matches at a specific time offset. The result of the comparison is the "similarity coefficient", which depends on the number of matches taking into account the bias.

    A software implementation of the Go library for generating and comparing fingerprints is available on GitHub . You can see graphs and results for your own examples.

    Now we need to integrate all this into our infrastructure and see what happens.

    Architecture



    Fingerprint generation and indexing / search engines in the VK architecture.

    The fingerprint generation engine runs on each server to download audio (there are about 1000 of them now). It receives an mp3 file as input, processes it (decoding, FFT, highlighting the peaks of the spectrum) and produces an acoustic imprint of this audio.

    The load is parallelized at the file level - each track is processed in a separate goroutine. For an average audio recording of 5-7 minutes, processing takes 2-4 seconds. Processing time increases linearly with increasing audio duration.

    Acoustic prints of all tracks, albeit with some loss of accuracy, will occupy about 20 TB of memory. All this volume of data needs to be stored somewhere and be able to quickly access it in order to find something in it. This problem is solved by a separate indexing and search engine.

    It stores fingerprint data in the form of inverse (inverted) indices:


    Reverse index

    To achieve speed and save on memory, we take advantage of the structure of the print itself. A fingerprint is an array, and we can consider its individual elements (hashes), which, as you remember, correspond to the peaks of the spectrum.

    Instead of storing the correspondence “track” → “fingerprint”, we break each fingerprint into hashes and store the correspondence “hash” → “list of tracks in which fingerprints it is”. The index is thinned, and 20 TB of fingerprints in the form of an index will occupy about 100 GB.

    How does it work in practice? The search engine receives a request with an audio recording. You need to find tracks similar to it. The fingerprint for this audio is downloaded from the repository. Lines containing hashes of this fingerprint are selected in the index. Frequently encountered tracks are selected from the corresponding lines; fingerprints from the repository are downloaded for them. These fingerprints are compared with the fingerprint of the source file. As a result, the most similar tracks with the corresponding matching fragments and the conditional “similarity coefficient” for these fragments are returned.

    The indexing and search engine runs on 32 machines, written in pure Go. Here, goroutines, pools of internal workers are used to the maximum, work with the network and the global index are parallelized.

    So, all the logic is ready. Let's collect fingerprints of all music, index and start working with them. By the way, how long will it take?

    We started indexing, waited a couple of days and estimated the deadlines. As it turned out, the result will be in about a year. This period is unacceptable - you need to change something.

    Implementation of sync.Pool wherever possible, carved 2 months. New term: 10 months. It is still too long. It’s better.

    We optimize the data type - the selection of tracks by index was implemented by merging the array. Using container / heap instead of an array promises to save half the time. We get 6 months of completion. Maybe it can be even better?

    We sharpen container / heap for using our data type instead of standard interfaces, and win another month of time. But this is not enough for us (that is, a lot).

    We correct stdlib by making our own implementation for container / heap - another minus 2 months, totaling 3. There are four times less than the first estimates!

    And finally, updating the Go version from 1.5 to 1.6.2 brought us to the final result. It took 2.5 months to create the index.

    What happened?


    Production testing revealed several cases that we did not take into account initially. For example, a copy of a track with a slightly changed playback speed:



    Accelerated track

    For the listener, this is almost the same thing, small acceleration is not perceived as a significant difference. Unfortunately, our fingerprint comparison algorithm considered such tracks to be completely different.

    To fix this, we added a bit of processing. It consists in the search for the greatest common subsequence (Longest Common Subsequence) in two prints. Indeed, the amplitude and frequency do not change, in this case only the corresponding value of time changes, and the general order of the points following each other is preserved.


    LCS

    Finding LCS allows you to determine the coefficient of "compression" or "stretching" of the signal on a time scale. Then it’s small, we compare the prints as usual, applying the found coefficient to one of them.

    The use of the LCS search algorithm significantly improved the results - many tracks that were not previously found in the fingerprint were successfully processed.

    Another interesting case is a match for fragments. For example, minus some popular song with amateur vocals recorded on top of it.



    Matching track fragments

    We lay out the result of the comparison in time and look at the number of matches for each second of the track. The picture above is just an example of amateur recording over a minus in comparison with the original track of this minus. Intervals with no matches — vocals, coincidence peaks — silence (i.e., net minus, which is similar to the original track). In such a situation, we count the number of fragments with matches and calculate the conditional “similarity coefficient” from the number of matches themselves.

    After clustering similar tracks, individual clusters were significantly larger than the rest. What is there? There are interesting situations in which it is not very clear how to correctly consider them. For example, the well-known Happy Birthday to You. We have several dozen variations of this song, which differ only in the name of the congratulated. Should they be considered different or not? The same applies to versions of the track in different languages. These distortions and their combinations became a serious problem at the launch stage.

    We had to pretty much modify the mechanism, taking into account these and other not quite ordinary situations. Now the system is working properly, including on such complex clusters, and we continue to improve it, meeting with new extraordinary transformations. We can find songs under fictitious names, accelerated, with jingles and without vocals. The task was completed, and, undoubtedly, it will come in handy more than once in further work on the development of the recommendation service, music search and the audio section as a whole.

    PS This is the first article in a series of long-promised stories about the technical side of VKontakte.

    In addition to Habr, we will publish them on our blog in Russian and English. You can also ask a question to the authors of articles not only here, but alsoin a separate community VK .

    Also popular now: