How does musical search work? Lecture in Yandex

    Usually, music search is understood as the ability to respond to textual requests for music. A search must understand, for example, that “Friday” is not always the day of the week, or find a song by the words “want sweet oranges”. But the tasks of musical search are not limited to this. It happens that you need to recognize the song that the user sang, or the one that plays in the cafe. And you can also find common in the compositions to recommend the user music to his taste. Elena Kornilina and Yevgeny Crofto told how to do this and what difficulties this entails for students of the Small ShAD.






    How does text music search work?


    According to some reports, about thirty million different music tracks have accumulated in the world today. These data are taken from various databases where these tracks are listed and cataloged. When working with music search, you will interact with familiar objects: artists, albums and tracks. To begin with, we will figure out how to distinguish musical queries from non-musical ones in a search. First of all, of course, by the keywords in the request. Markers of musicality can be either words that clearly indicate that they are looking for music, for example, “listen” or “who sings a song”, or more blurry: “download”, “online”. In this case, in a seemingly very simple case, when the request contains a part of the name of the track, artist or album, problems may arise related to ambiguity.

    For example, let's take the names of five domestic performers:
    • Friday
    • Pizza
    • Movie
    • 02/30
    • Aquarium

    If the user asks such a request, it is not so easy to understand immediately what exactly he wants to find. Therefore, the probabilities of this or that user's intention are evaluated, and the results of several vertical searches are displayed on the page of issue.

    On the other hand, performers are very fond of distorting words in their titles, track and album names. Because of this, many linguistic extensions used in the search simply fail. For instance:
    • TI NAKAYA ODNA00 (Dorn)
    • Sk8 (Nerves)
    • Crawled
    • N1nt3nd0
    • Oxxxymiron
    • dom! No
    • P! Nk
    • Sk8ter boi (Avril Lavign)

    In addition, there are many artists with similar or even identical names. Suppose a search came in [aguilera]. At first glance, everything is clear, the user is looking for Christina Aguilera. But still there is some chance that the user needed a completely different artist - Paco Aguilera.



    The situation with the joint performance of two or more performers is very common. For example, the song Can't Remember to Forget You can be attributed to two performers at once: Shakira and Rihanna. Accordingly, the database should provide for the possibility of adding joint performers.

    Another feature is regionality. In different regions, there may be artists with the same name. Therefore, it is necessary to consider from which region the request came from, and to issue in the search results of the most popular artist in this region. For example, there are performers with the name Zara both in Russia and in Turkey. At the same time, their audiences practically do not overlap.

    With the names of the tracks, too, everything is so simple. Cover versions are very common when one artist records his interpretation of the track of another artist. At the same time, the cover version may even be more popular than the original. For example, in most cases, upon request [seen night], the version of the Zdob si Zdub group will be more relevant than the original track of the Kino group. The situation with remixes is similar.

    It is very important to be able to find tracks by quote from the text. Often the user may not remember or not know the name of the track, nor the name of the artist, but only some line from the song. However, it should be borne in mind that some common phrases can occur immediately in many tracks.

    To solve this problem, several methods are used at once. Firstly, the user's query history is taken into account, to which genres and performers he has addictions. Secondly, the popularity of the track and the performer plays an important role: how often they click on a particular link in the search, how long they listen to the track.

    Translation difficulties


    Very often in different countries various spellings of the names of performers and composers are fixed. Therefore, it should be borne in mind that the user can search for any of these options, and different spellings can be found in the database. The name of Pyotr Ilyich Tchaikovsky has about 140 spellings. Here are just a few of them:
    • Pyotr Ilyich Chaikovsky
    • Peter Ilych Tchaikovsky
    • Pyotr Ilyich Tchaikovsky
    • Pyotr Il'ic Ciaikovsky
    • PI Tchaikovski
    • Pyotr Il'yich Tchaïkovsky
    • Piotr I. Tchaikovsky
    • Pyotr İlyiç Çaykovski
    • Peter Iljitsch Tschaikowski
    • Pjotr ​​Iljitsch Tschaikowski

    By the way, when we talked about the main objects with which we have to work in musical search, we did not mention composers. In most cases, this object does not really play an important role, but not in the case of classical music. Here the situation is almost the opposite, and it is necessary to rely on the author of the work, and the performer is a secondary object, although it is also impossible to ignore this object. Also has its own peculiarity in the names of albums and tracks.

    Audio analysis


    In addition to the metadata of the track and lyrics in the music search, you can use the data obtained by analyzing the audio signal directly. This allows you to solve several problems at once:
    • Recognition of music by a fragment recorded on a microphone;
    • Recognition by humming;
    • Search for fuzzy duplicates;
    • Search for cover versions and remixes;
    • Highlighting a melody from a polyphonic signal;
    • Music classification;
    • Auto tagging;
    • Search for related / recommendations.

    A digital audio signal can be represented as an image of a sound wave:



    If we will increase the detail of this image, i.e. stretch on a time scale, then sooner or later we will see points spaced at equal intervals. These points represent the moments at which the amplitude of the sound wave oscillations is measured:



    The narrower the intervals between the points, the higher the sampling frequency of the signal, and the wider the frequency range that can be encoded in this way.

    We see that the amplitude of the oscillations depends on time and correlates with the sound volume. And the oscillation frequency is directly related to the pitch. How do we get information about the oscillation frequency: convert a signal from a temporary domain to a frequency domain? Here the Fourier transform comes to our aid. It allows you to decompose a periodic function in the sum of harmonic with different frequencies. The coefficients in the terms will give us the frequencies that we wanted to get. However, we want to get the spectrum of our signal without losing its time component. For this, the window Fourier transform is used . In fact, we divide our audio signal into short segments - frames - and instead of a single spectrum we get a set of spectra - separately for each frame. Putting them together we get something like this:



    Time is displayed on the horizontal axis, and frequency is displayed on the vertical axis. The amplitude is highlighted in color, i.e. how much signal power is at a particular time and in a particular frequency layer.

    Classification of signs


    Such spectrograms are used by most analysis methods used in musical search. And before moving on, we should understand what signs of the spectrogram can be useful to us. There are two ways to classify symptoms. First, on a time scale:
    • Frame-level - attributes related to one column of the matrix.
    • Segment-level - attributes that combine several frames.
    • Global-level - signs describing the entire track.

    Secondly, features can be classified by presentation level, i.e. how high-level abstractions and concepts describe these attributes.
    • Low-level:
      • Zero Crossing Rate - allows you to distinguish between music and speech;
      • Short-time energy - reflects the change in energy over time;
      • Spectral Centroid - the center of mass of the spectrum;
      • Spectral Bandwidth - scatter relative to the center of mass;
      • Spectral Flatness Measure - characterizes the "smoothness" of the spectrum. It helps to distinguish a signal similar to noise from signals with a pronounced tonality.
    • Middle-level:
      • Beat tracker;
      • Pitch Histogram;
      • Rhythm Patterns.
    • High level:
      • Music genres;
      • Mood: cheerful, sad, aggressive, calm;
      • Vocal / Instrumental;
      • Perceived speed of music (slow, fast, medium);
      • Paul vocalist

    Having watched the lecture to the end, you will find out exactly how these signs help solve the problems of musical search, as well as what computer vision and machine learning have to do with it.

    Also popular now: