markhor May 26, 2017 at 12:02 pm

Experiments with malloc and neural networks

More than a year ago, when I worked as an anti-spammer at Mail.Ru Group, it hit me, and I wrote about experiments with malloc . At that time I helped to conduct seminars on AKOS at FIVTe MIPT for my pleasure, and there was a topic about memory allocation. The topic is large and very interesting, at the same time it covers both the low level of the core and quite algorithm-intensive structures. In all textbooks it is written that one of the main problems of dynamic memory allocation is its unpredictability. As they say, I would have known the ransom - I would have lived in Sochi. If the oracle told in advance the whole plan according to which memory will be allocated and freed, then it would be possible to draw up an optimal strategy that minimizes heap fragmentation, peak memory consumption, etc. From here went fuss with manual allocators. In the process of thinking, I came across a lack of logging tools malloc()and free(). I had to write them! Just about it there was an article (I also studied macOS). Two parts were planned, but life turned abruptly and was not up tomalloc(). So, it's time to restore justice and realize what was promised: hit with deep training in predicting work with a bunch .

Inside:

Improving libtracemalloc, interceptor malloc().
Building LSTM on Keras - a deep recurrent network.
We train the model using the example of a real application ( vcmi / vcmi - and you thought, where does Heroes III?).
We are surprised by unexpectedly good results.
We fantasize about the practical application of technology.
Sources .

Interesting? Welcome to cat.

libtracemalloc

In the first article, we logged only malloc(). That implementation had several disadvantages:

Multithreading was ignored when serializing calls. In other words, we lost information from which thread the memory was allocated.
Multithreading was ignored when writing to a file. The call was write()made competitively, without mutex. Despite the fact that POSIX requires its atomicity, in Linux prior to version 3.14 there was an unpleasant bug with positioning, which was fixed by the Chief himself , and some still test this property in practice .
Not logged in free().

The following code fixes (1) and (2):

int fd = 0;
void* (*__malloc)(size_t) = NULL;
pthread_mutex_t write_sync = PTHREAD_MUTEX_INITIALIZER;
inline void get_time(long* sec, long* mcsec) { /* ... */ }
void* malloc(size_t size) {
  if (!__malloc) {
    __malloc = (void*(*)(size_t)) dlsym(RTLD_NEXT, "malloc");
  }
  if (!fd) {
    fd = open(LOG, O_WRONLY | O_CREAT | O_EXCL, 0666);
    if (fd < 0) {
      return __malloc(size);
    }
  }
  long sec, mcsec;
  get_time(&sec, &mcsec);
  void* ptr = __malloc(size);
  char record[64];
  pthread_mutex_lock(&write_sync);
  write(fd, record, sprintf(record, "%ld.%06ld\t%ld\t%zu\t%p\n",
                            sec, mcsec, pthread_self(), size, ptr));
  pthread_mutex_unlock(&write_sync);
  return ptr;
}

Accordingly, we use pthread_self()and execute write()under the mutex. If you look closely, you can still see O_EXCLin the y flags open(). I added it to fix early behavior fork(), i.e. before starting work with a bunch. Two forked processes simultaneously opened the file and overwritten each other.

It remains to fix (3):

void (*__free)(void*) = NULL;
void free (void *ptr) {
  if (!__free) {
    __free = (void(*)(void*)) dlsym(RTLD_NEXT, "free");
  }
  if (!fd) {
    __free(ptr);
    return;
  }
  long sec, mcsec;
  get_time(&sec, &mcsec);
  char record[64];
  pthread_mutex_lock(&write_sync);
  write(fd, record, sprintf(record, "%ld.%06ld\t%ld\t-1\t%p\n",
                            sec, mcsec, pthread_self(), ptr));
  pthread_mutex_unlock(&write_sync);
  __free(ptr);
}

Logging occurs completely by analogy with malloc().

As a result, we get something like the following:

0.000000        140132355127680 552     0x2874040
0.000047        140132355127680 120     0x2874270
0.000052        140132355127680 1024    0x28742f0
0.000079        140132355127680 -1      0x2874270
0.000089        140132355127680 -1      0x28742f0
0.000092        140132355127680 -1      0x2874040
0.000093        140132355127680 -1      (nil)
0.000101        140132355127680 37      0x2874040
0.000133        140132355127680 32816   0x2874070
0.000157        140132355127680 -1      0x2874070
0.000162        140132355127680 8       0x2874070

LSTM on Keras

To begin with, we will determine what exactly we want to predict. It would be desirable to predict prior to the call history malloc()and free()the sequence of the future, and once with the size. It would be very cool to apply attention and immediately indicate which sections of memory will be freed, but I suggest that an inquisitive reader do this - an excellent graduation thesis will come out. Here I will go easier and predict how much memory is freed.

The next point is that predicting the exact size by RMSE is a rotten and pernicious undertaking. I tried: the network treacherously converges to the average value on the entire dataset. Therefore, I introduce size classes in powers of two, just like the famous buddy allocator . In other words, the prediction problem is posed as a classification problem. Example:

the size	the class
1	1
2	2
3	2
4	3
5	3
6	3

In total, we have 32 classes, putting a mental (incorrect) restriction on malloc(uint32_t size). In fact, there is 64-bit size_t, and you can allocate more than 4GB.

free()We will take the class y from the corresponding call malloc(), because we know the pointer to the freed memory. To somehow distinguish classes malloc()from classes free(), we add 32 to the last ones. In total, there are 64 classes. This is, quite frankly, quite a bit. If there are any patterns in our data, then any even the most stupid network should learn something from them. The following code parses the log received from libtracemalloc.

threads = defaultdict(list)
ptrs = {}
with gzip.open(args.input) as fin:
    for line in fin:
        parts = line[:-1].split(b"\t")
        thread = int(parts[1])
        size = int(parts[2])
        ptr = parts[3]
        if size > -1:
            threads[thread].append(size.bit_length())
            ptrs[ptr] = size
        else:
            size = ptrs.get(ptr, 0)
            if size > 0:
                del ptrs[ptr]
            threads[thread].append(32 + size.bit_length())

As you can see, we save the new pointers and delete the old ones, as if emulating a bunch. Now you need to turn the dictionary threadsinto digestible xand y.

train_size = sum(max(0, len(v) - maxlen) for v in threads.values())
try:
    x = numpy.zeros((train_size, maxlen), dtype=numpy.int8)
except MemoryError as e:
    log.error("failed to allocate %d bytes", train_size * maxlen)
    raise e from None
y = numpy.zeros((train_size, 64), dtype=numpy.int8)
offset = 0
for _, allocs in sorted(threads.items()):
    for i in range(maxlen, len(allocs)):
        x[offset] = allocs[i - maxlen:i]
        y[offset, allocs[i].bit_length()] = 1
        offset += 1

Here maxlenis the length of the context by which we will predict. By default, I set it to 100. It remains to create the model itself. In this case, 2 LSTM layers and a full top link, standard.

from keras import models, layers, regularizers, optimizers
model = models.Sequential()
embedding = numpy.zeros((64, 64), dtype=numpy.float32)
numpy.fill_diagonal(embedding, 1)
model.add(layers.embeddings.Embedding(
    64, 64, input_length=x[0].shape[-1], weights=[embedding],
    trainable=False))
model.add(getattr(layers, layer_type)(
    neurons, dropout=dropout, recurrent_dropout=recurrent_dropout,
    return_sequences=True))
model.add(getattr(layers, layer_type)(
    neurons // 2, dropout=dropout, recurrent_dropout=recurrent_dropout))
model.add(layers.Dense(y[0].shape[-1], activation="softmax"))
optimizer = getattr(optimizers, optimizer)(lr=learning_rate, clipnorm=1.)
model.compile(loss="categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy", "top_k_categorical_accuracy"])

To handle memory carefully, I use untrained Embedding and do not do one hot encoding in advance. T.O. xand yI have 8-bit (64 <256), and the minibatch is 32-bit. This is important, because very shortly running programs manage to generate tens of millions of samples. neuronsby default 128, there are no dropouts. Let's start training!

model.fit(x, y, batch_size=batch_size, epochs=epochs,
          validation_split=validation)

I was forced to do export TF_CPP_MIN_LOG_LEVEL=1, otherwise Tensorflow spammed PoolAllocators.

results

As an experimental, let's take vcmi / vcmi - the GPL implementation of the third heroes engine, which I used to be a contributor to. A very complete and quite good implementation by the way, but the guys urgently need autotests. And since there is a talk, I really want to fasten the Python API to the engine and try to implement neural network AI. Sometimes people object, they say that Heroes have many degrees of freedom. I answer that no one forces AlphaGo to be done right away, you can start small, with fights.

We clone, collect the vcmi code, get the original resources, run the game and collect the heap call log:

# ... easy ...
LD_PRELOAD=/path/to/libtracemalloc.so /path/to/vcmiclient

I launched Arrogance, wandered a couple of weeks, collected resources. Speed sags a lot, however, this is not valgrind for you. The resulting log was uploaded to Google Drive - 94MiB. The distribution of the first 32 classes:

histogram

T.O. our baseline is 0.26 accuracy. Let's try to improve it. Model training took 10.5 hours per epoch on Pascal Titan (2 eras, 21 hours, validation_split= 15%). In general, the second era does not greatly change the result, and only one is enough. The following metrics turned out:

loss:     0.0512
val_loss: 0.1160
acc:      0.9837
val_acc:  0.9711
top_k_categorical_accuracy:     1.0000
val_top_k_categorical_accuracy: 0.9978

In my opinion, it turned out very cool. We achieved 97% accuracy in validation, which means about 20 correct predictions with an accuracy above 50%. If we replace LSTM with GRU, then we can reduce the time of the era to 8 hours with slightly better metrics achieved.

What I see the development vectors of the model:

Increasing the number of neurons "in the forehead", possibly, will give an improvement in metrics.
Experiments with architecture, metaparameters. At a minimum, you need to investigate the dependence on the context length and the number of layers.
Enrich the prediction context by expanding the call stack and applying debuginfo. From a chain of functions to complete source code.

The size of the trained model is 1 megabyte, which is much less than the full history - 98 MB in gzip. I’m sure that having properly tuned the network, it will be possible to reduce it without sacrificing quality.

Future

“Wait,” the reader will say, “no one in their right mind will drive the network forward with every call to malloc.” Of course, when every cycle counts, multiplying the matrix 128 by 100 by 64 looks silly. But I have an objection:

You need to start the network every 20 calls or even less. If technology is improved, this number will increase.
Multi-core processors have long become commonplace, and they are used far from 100%. In periods of program downtime, on a block in some kind read(), we have a tangible time slice and resources for inference.
Не за горами нейроакселераторы. TPU уже изготавливают Google, IBM, Samsung. Почему бы их не использовать для ускорения существующих "простых" программ, анализируя в динамике их поведение и динамически адаптируя к ним рантайм.

TL;DR

Модель RNN для предсказания динамического выделения и освобождение памяти в игре Heroes III удалось обучить с точностью 97 %, что весьма неожиданно и круто. Есть перспектива внедрить идею на практике для создания аллокаторов памяти нового поколения.

Source code machine learning is a new, exciting topic. Come to the source {d} tech talks on June 3rd, where Chals Sutton, author of half of all articles, and Georgios Giusius, star of Mining Software Repositories and founder of GHTorrent, will speak . Participation is free, seats are limited.

Tags: