Do androids dream about electropannel? How I taught the neural network to write music

    In the Artezio machine learning courses, I became acquainted with an educational model capable of creating music. Music is an essential part of my life, I've played in bands for many years (punk rock, reggae, hip hop, rock, etc.) and am a fanatical listener.  

    Unfortunately, many groups, a big fan of which I was in my youth, fell apart for various reasons. Or they did not break up, but what they are recording now ... well, it would be better if they broke up.

    I wondered if there is now a ready-made model that can learn from the tracks of one of my favorite groups and create similar compositions. Since the musicians themselves are not very successful, maybe the neural network can handle them?

    A source

    Studying the finished models, I quickly stumbled upon such an article with an overview of the six most famous options. This is, of course, digital music formats. It is clear from the article that two main approaches to music generation can be distinguished: based on a digitized audio stream (the sound we hear from the speakers - raw audio, wav files) and based on working with MIDI (musical notation of music).

    I dropped the raw audio options, and here's why.

    • The results are not impressive - the use of such models for polyphonic music gives a very specific result. This is unusual, you can create interesting canvases, but not suitable for my purposes: it sounds weird, but I wanted to hear something similar to the original.


    A good example with piano music:

    And with orchestral music or rock, it sounds much more strange:

    Here then the guys tried to handle Black Metal, and not only in the raw audio.

    • In the compositions of my favorite groups there are different instruments - vocals, drums, bass, guitars, synthesizers. Each instrument sounds along with the rest. I am looking for a model that would act in the same way, that is, would work not only with individual instruments, but also take into account their joint sound.

      When a musician needs to learn the part of some instrument by ear, he tries to select the instrument he needs from the whole sound stream. Then he repeats his sound until he achieves a similar result. The task is not the easiest even for a person with a good ear - music can be difficult, the instruments “merge”.


    I came across software that tried to solve a similar problem. There are several projects that do this based on machine learning. For example, while I was writing this text, Magenta released the new Wave2Midi2Wave instrument, which is capable of “shooting” the notes of the piano and realistically playing them. There are other tools, although in general this problem has not been solved yet.

    So, in order to learn a part from a work, the easiest way is to take ready-made notes. This is the easiest way. It is logical to assume that the neural network will also be easier to work with the musical representation of music, where each instrument is represented by a separate track.

    • In the case of raw audio, the result is a mix of all instruments, the parts cannot be individually loaded into the sequencer (audio editor), correct, change the sound, and so on. I'm quite satisfied if the neural network composes a hit, but it makes a mistake in a couple of notes - when working with notes, I easily correct them, with raw audio it is almost impossible.

    The musical notation also has its drawbacks. It does not take into account a lot of nuances of execution. When it comes to MIDI, it is not always known who these MIDI files were, how close they are to the original. Maybe the compiler simply made a mistake, because it is not an easy task to just “remove” the game.

    When working with polyphonic notes, it is necessary to ensure that the instruments are consonant at each moment of time. In addition, it is important that the sequence of these moments is a logical music from the point of view of a person.  

    It turned out that there are not so many solutions that could work with notes, and not only with one instrument, but with several sounds at the same time. I initially overlooked the Magenta project from Google TensorFlow, because it was described as “non-polyphonic”. At that time, the MusicVAE library was not yet published, so I settled on the BachBot project.

    A source


    It turned out that the solution to my problem already exists. Listen to the Happy Birthday melody , crafted by BachBot and sounding like Bach chorale.

    Chorale is a specific music, it consists of four voices: soprano, alto, tenor and bass. Each of the instruments can issue one note at a time. There will have a little deeper into the music. We will talk about music in four quarters.

    In the musical notation, the note has two indicators - pitch (up, re, mi ...) and duration (whole, half, eighth, sixteenth, thirty-second). Accordingly, the note with the duration of a whole sounds the whole beat, two half sounds the whole beat, sixteen sixteenth sounds the whole beat.

    In preparing the data for training the neural network, the creators of BachBot considered the following:

    • in order not to knock down a model with chords from different keys, which together will not sound sweet, all the chorales led to the same key;
    • Neural network input must be given discrete values, and music is a continuous process, which means that discretization is necessary. One instrument can play a long whole note, and the other at the same time several sixteenths. To solve this problem, all notes were broken into sixteenths. In other words, if there is a quarter note in the notes, it goes to the input four times as the same sixteenth - the first time with the flag that pressed it, and the next three times with the flag, that it continues.

    The data format is as follows - (pitch, new note | continuation of the sound of the old note)

    (56, True) # Soprano
    (52, False) # Alto
    (47, False) # Tenor
    (38, False) # Bass

    Having banished all the chorals from the popular The music21 data set through such a procedure, the authors of BachBot found that only 108 combinations of four note combinations are used in chorals (if you bring them to one key), although it would seem that they could potentially be 128 x 128 x 128 x 128 (128 degrees pitch heights used in midi). The size of the conditional dictionary is not so big. This is a curious remark, we will return to it when we talk about MusicVAE. So, we have Bach choirs recorded in the form of sequences of such fours.

    It is often said that music is a language. Therefore, it is not surprising that the creators of BachBot applied the popular technology in NLP (Natural Language Processing) to music, namely, they trained the LSTM network in a dataset and got a model that can complement one or more instruments or even create chorals from scratch. That is, you set alto, tenor and bass, and BachBot writes a soprano melody for you, and all together sounds like Bach.

    Here is another example:  

    Sounds great!

    You can see this video in more detail . There is an amusing analytics, collected on the basis of a survey on the website  

    Users are invited to distinguish the original Bach choirs from the music created by the neural network. It is mentioned in the results that if a neural network creates a bass line with all other parameters specified, then only half of the users can distinguish the chorals created by the neural network from the original ones. It's funny, but music experts are most confused. With the rest of the tools things are a little better. It sounds insulting to me as a bass player - the violinist still seems to be needed, but it is time for the bass players to refresh their skills in working with drywall.


    Studying BachBot, I discovered that it was included in the Magenta project (Google TensorFlow). I decided to take a closer look at it and found out that within the framework of Magenta several interesting models were developed, one of which is devoted to work with polyphonic compositions. Magenta made their wonderful tools and even launched a plug-in for Ableton audio editor, which is especially nice in the application plan for musicians.

    My favorites are beat blender (creates variations on the theme of a given drum part) and
    latent loops (creates transitions between melodies).

    The main idea of ​​the MusicVAE tool, which I decided to use, is that the creators tried to combine the model and variational autoencoder - VAE on the LSTM network..

    If you remember, in the conversation about Bach Bot, we noticed that the chord dictionary does not consist of 128x128x128x128 elements, but only out of 108. The creators of MusicVAE also noticed this and decided to use compressed latent space.

    By the way, which is typical, for the training of MusicVAE it is not necessary to translate the sources into one key. Transposition is not necessary, I suppose, because the source will still be converted by the autoencoder and the information about the tonality will disappear.

    VAE is designed in such a way that allows the decoder to efficiently recover data from the training dataset, while the latent space is a smooth distribution of the features of the input data.

    This is a very important point. This makes it possible to create similar objects and conduct a logical meaningful interpolation. In the original space, we have 128x128x128x128 variants of the combination of four notes, but in fact not all of them are used (pleasantly sound for the human ear). A variational autoencoder turns them into a much smaller set in a hidden space, and you can think of mathematical operations on this space that have a meaningful meaning from the point of view of the original space, for example, neighboring points will be similar musical fragments.

    A good example is how to draw glasses on a photo using an autoencoder - in this article . You can read more about how Muisc VAE works on the official Magenta website in this article., there is also a link to arXiv.

    So, the instrument is selected, it remains to use it with my original goal - to create a new music based on the tracks already recorded and evaluate how it will turn out like the sound of the original group. Magenta does not work on my laptop with Windows, and for quite a long time cheats on a model without a GPU. Having suffered from virtual machines, a docker container, etc., I decided to use the cloud.

    Google provides colab notebooksin which it is quite possible to mess with the models of Magenta. However, in my case, it was not possible to train the model, the process was falling all the time due to various restrictions - the amount of available memory, shutdown by timeout, lack of a normal command line and root rights to install the necessary libraries. Hypothetically, there is even an opportunity to use a GPU, but, again, I did not manage to install the model and launch it.

    I was thinking about buying a server and, oh, good luck, I found that Google provides Google Cloud cloud services with a GPU, and there is even a free trial period. True, it turned out that in Russia they are officially available only to legal entities, but they let me in a test free mode.

    So, I created a virtual machine in GoogleCloud with one GPU module, found several midi files from one of my favorite groups on the Internet and uploaded them to the midi folder in the cloud.

    Install Magenta:

    pip install magenta-gpu

    It's great that all this can be installed in one team, I thought, but ... errors. It seems to have to touch the command line, sorry.

    We look at the errors: the rtmidi library is not installed on the cloud machine, without which Magenta does not work.

    And it, in turn, falls due to the absence of the libasound2-dev package, and I also have no root privileges.

    Not so scary:

    sudo su root
    apt-get install libasound2-dev

    Hurray, now pip install rtmidi passes without errors, as well as pip install magenta-gpu.

    We find on the Internet and download the source files in the midi folder. They sound like this .

    We transform midi into a data format with which the network can already work:

    convert_dir_to_note_sequences \
    --output_file=notesequences_R2Midi.tfrecord \
    --log=DEBUG \

    and we start training

    music_vae_train \
    --config=hier-multiperf_vel_1bar_med \
    --run_dir=/home/RNCDtrain/ \
    --num_steps=1 \
    --checkpoints_to_keep=2 \
    --hparams=sampling_rate=1000.0 \
    --hparams=batch_size=32,learning_rate=0.0005 \
    --num_steps=5000 \
    --mode=train \

    Again the problem. Tensorflow falls with an error - it can not find the library, the benefit a few days ago someone has already described this error, and the source code in Python can be fixed.

    Climb into the folder


    and replace the import line as described in the github bug.

    Launch music_vae_train again and ... Hurray! Training went!


    hier-multiperf_vel_1bar_med - I use a polyphonic model (up to 8 instruments), issuing one measure each.

    An important parameter is checkpoints_to_keep = 2, the disk volume in the clouds is limited, one of the problems is that the learning process was interrupted all the time due to disk overflow, the checkpoint is quite heavy - 0.6-1GB each.

    Somewhere in the 5000 epochs, the error starts jumping around 40-70. I don’t know if this is a good result or not, but it seems that with small training data, the network will be retrained further and there’s no point in wasting so much free time for the graphics processors in Google data centers. Go to the generation.

    For some reason, when installing, Magenta did not install the generation file itself, I had to drop it into the folder with my hands:

    curl -o

    Finally, create the fragments:

    music_vae_generate --config=hier-multiperf_vel_1bar_med --checkpoint_file=/home/RNCDtrain/train/ --mode=sample --num_outputs=32 --output_dir=/home/andrey_shagal/  --temperature=0.3

    config - generation type, exactly the same as during training - multitrack, 1 beat
    checkpoint_file - folder where to get the last file with the trained model
    mode - sample - create a sample (there is another option interpolate - create a transition beat between two cycles)
    num_outputs - how many pieces generate
    temperature - the randomization parameter when creating a sample, from 0 to 1. At 0, the result is more predictable, closer to the original sources, in 1 - I am an artist, I see that.

    At the exit, I get 32 ​​fragments in tact. After running the generator several times, I listen to the fragments and glue the best ones into one track: neurancid.mp3.

    So "I spent this summer." I am pleased. Of course, Maximum Radio is unlikely to take it to the playlist, but if you listen, it really looks like the original Rancid group. The sound, of course, differs from studio recording, but we primarily worked with notes. Then there is room for action - to process the midi with different VST plug-ins, re-record the parts with live musicians or wait until the guys from Wave2Midi2Wave get to the guitars with overload.

    There are no complaints about the notes. Ideally, I would like the neural network to create a masterpiece or at least a hit for the Billboard top 100. But while she learned from the rockers to use alcohol and drugs,play the whole beat of one note eighth (in fact, not only, but I am proud of her fatherly transition from 20 to 22 seconds). There are reasons for this, and more about them.

    1. Small amount of data.
    2. The model that I used gives fragments in the amount of one measure. In punk rock within one measure, as a rule, not so many events occur.
    3. Interesting transitions and melody work just against the background of swing riffs, transitions from chord to chord, and autoencoder together with a small amount of data seems to have lost most of the melodies, and also reduced all riffs to two consonant and several atonal chords. You need to try a model that works with 16 bars, sorry, only three voices are available in it.

    I contacted the developers, they recommended trying to reduce the dimension of the latent space, because they trained their network on 200,000 tracks, and I by 15. I couldn’t achieve the visible effect of reducing z-space, but I still have a lot of trouble.

    By the way, monotony and monotony - this is not always a minus. From the shaman rituals to the techno party, as you know, one step. We must try to train the model on something like that - rave, techno, dub, reggae, hip-hop minus. Surely, there is a chance to create something nice zombie. I found 20 pieces of Bob Marley songs in the midi and, voila, a very nice loop:

    Above midi, the parts are overwritten by live bass and guitars, processed by VST synthesizers, so that the piece sounded juicier. In the original, the network issued just notes. If you play them with a standard midi player, it sounds like this:

    Surely, if you create a certain number of basic themed drum drawings, run them in beat blender + basic parts of bass and synths with a latent loop (above), it is possible to launch an algorithm for techno radio, which will continuously create new tracks or even one endless track. Eternal buzz!

    MusicVAE also provides the ability to train the network to generate 16 trio clock fragments - drums, bass and lead. Interesting too. The input data are multitrack midi files - the system splits into triples in all possible combinations and further on this trains the model. Such a network requires much more resources, but the result is just 16 cycles! Unable to resist. I tried to imagine how a band could play, which plays something in between Rancid and NOFX, downloading for learning about an equal number of tracks from each band:

    There are also midi parts rewritten by live guitars. Standard midi player like this:

    Interesting! This is definitely better than my first group! And, by the way, the same model gives us decent free jazz:

    Problems I encountered:

    1. The lack of a good, convenient stand that would reduce the time for waiting for training. The model works only under linux, the training is long, without a GPU for a very long time, and all the time I want to try changing the parameters and see what happens. For example, a cloud server with a single GPU processor of 100 epochs for the “16-clock trio” model cheated for 8 hours.
    2. A typical machine learning problem is a lack of data. Only 15 midi files are very few to understand music. Neural network, unlike me in my youth, did not listen to 6 Rancid albums to holes, did not go to concerts, this result was obtained from 15 unknown people composed by midi tracks that are far from the original. Now, if you stick sensors around the guitarist and shoot each overtone from each note ... Let's see how the idea of ​​Wave2Midi2Wave develops. Maybe in a few years it will be possible to abandon the notes in solving this problem.
    3. The musician should get into the rhythm, but not perfectly. On weekends, midi in the notes (for example, in drums) there is no dynamics, they are all performed at the same volume, exactly in a click (as the musicians say, exactly in a beat), even if you randomly vary them, the music starts to sound more alive and more pleasant. Again, this issue is already engaged in Wave2Midi2Wave.

    Now you have some idea of ​​the possibilities of AI in creating music and my musical preferences. What do you think, what role awaits AI in the creative process in the future? Can a machine create music on a par or even better than a person, be an assistant in the creative process? Or artificial intelligence will become famous in the music field only by primitive handicrafts.

    Also popular now: