Olegchir March 11, 2019 at 14:12

Homemade Garbage Collector for OpenJDK

Transfer

This is a translation of the article by Alexey Shipilev “Do It Yourself (OpenJDK) Garbage Collector” , published with the consent of the author. Report any typos and other bugs in PM - we will fix them.

The process of creating something in runtime is a fun exercise. At least the creation of the first version! To build a reliable, high-performance, fail-safe runtime subsystem, the behavior of which can be conveniently observed and debugged, is a very, very difficult task.

Making a simple garbage collector is deceptively simple, and now I want to do this in this article. Roman Kennke at FOSDEM 2019 made a talk and demo titled “Writing a GC in 20 Minutes,” using an earlier version of this patch. Despite the fact that the code implemented there demonstrates a lot and is plentifully commented on, there is a need for a good high-level description of what is happening - this is how this article appeared.

A basic understanding of the work of garbage collectors will greatly help in understanding what is written here. The article will use specifics and ideas in a specific implementation of HotSpot, but there will be no introductory course on GC design here. Take the GC Handbook and read the first chapters about the very basics of GC, and even faster will start the Wikipedia article .

Content

1. What GC consists of

Now that a lot of different GCs have been written, it’s quite simple to make your own - many already written elements can be (re) used to shift some of the worries about implementation details to proven and tested code.

1.1. Epsilon gc

OpenJDK 11 introduces a new JEP 318: "Epsilon: A No-Op Garbage Collector (Experimental) . " Its task is to provide a minimal implementation for the case when freeing memory is not necessary or even prohibited. JEP discusses in more detail why it might be useful.

From the point of view of implementation, “garbage collector” is a bad name, it would be more correct to use the term “automatic memory manager” , which is responsible for both allocating and freeing memory. Epsilon GC implements only "allocation", and does not deal with "release" at all. Therefore, you can take Epsilon GC and begin to implement the “release” algorithms from scratch.

1.1.1. Memory allocation

The most developed part of the Epsilon GC is responsible for allocating memory . It serves external requests for allocating memory of arbitrary size and creating Thread-Local Allocation Buffer (TLAB) of the desired size. The implementation itself is trying not to extend the TLAB too much, since there will be no free memory and no one will return the lost bytes.

1.1.2. Barriers

Some garbage collectors require interaction with the application to maintain GC invariants, forcing the runtime and application to create so-called barriers when trying to access the heap. This is true for all multi-threaded collectors, as well as for many collectors with generations and stopping the world.

Epsilon does not require barriers, but runtime and the compiler still want to know that barriers do nothing. Handling it every time everywhere can be tiring. Fortunately, starting with OpenJDK 11, there is a new JEP-304: “Garbage Collection Interface” , which makes it much, much easier to insert barriers. In particular, the barrier set in Epsilon is empty , and all the trivial work — save, load, CAS, arraycopy — can be delegated to implementations of trivial barriers from an existing superclass. If you are making a GC that also doesn’t need barriers, you can simply reuse the code from Epsilon.

1.1.3. Monitoring connection

The last tedious part of the GC implementation is hooks to a bunch of monitoring mechanisms inside the JVM: MX-bins, diagnostic commands, etc. should work. Epsilon has already done all this for you.

1.2. Rantime and GC

1.2.1. Root elements

The garbage collector, in the general case, needs to know what exactly in the Java runtime has heap references. These root elements, called GC roots , can be slots on stream stacks and local variables (including those found in JIT-compiled code!), Native classes and class-loaders, references in JNI, and so on. Attempts to identify these elements can be very complex and tedious. But in Hotspot, they are all tracked using the appropriate VM subsystems, so you can simply learn how existing GC implementations work with them. Further in the text we will see it.

1.2.2. Object Crawl

The garbage collector should bypass outbound links in Java objects. This operation is found everywhere, so the common parts of the runtime provide ready-made workarounds; you don’t need to write anything yourself. Below in the text there will be a section with a specific implementation, and there you can meet, for example, challenges obj→oop_iterate.

1.2.3. Displacements

The moving garbage collector needs to write down the new addresses of the moved objects somewhere. There are several places where you can write this forwarding data .

You can reuse the “marker word” in the object itself (Serial, Parallel, etc.). After the world stops, all accesses to the object are controlled, and it is guaranteed that no Java thread can see the temporary data that we decided to enter in the marker word. You can reuse it to store forwarding data.
You can maintain a separate native movement table ( ZGC , C4, and others). This completely isolates the GC from runtime and the rest of the application, since only the GC knows about the existence of such a table. Competitive assemblers usually use just such a scheme - they do not want to suffer with a bunch of unnecessary problems.
You can add another word to the object ( Shenandoah and others). This combination of the two previous approaches not only allows the runtime and application to work with existing headers without problems, but also saves forwarding data.

1.2.4. Marker data

The garbage collector needs to write marking data somewhere . And again, there are several ways to save them:

You can reuse the marker word in the object itself (Serial, Parallel, etc.). Again, in world stop mode, you can use the bits in the marker word to encode the fact of a label. Further, if you need to go around all living objects, we go along the heap, object after object - this is possible due to the fact that the heap is parsable .
You can maintain a separate structure for storing marking data (G1, Shenandoah, etc.). This is usually done using a separate bitmap , which maps every N bytes of the heap to 1 bit of the card. Usually, Java objects are aligned by 8 bytes , so the card maps every 64 bits from the heap to 1 bit of the card, occupying 1/64 of the heap size in native memory. These overheads pay off well when scanning the heap for the presence of living objects, especially sparse ones: bypassing the map is often much faster than bypassing the heap being disassembled object by object.
Encode labels into links themselves (ZGC, C4 and others). This requires coordination with the application, then you need to cut out all these labels from the links or perform some other tricks to maintain correctness. In other words, we need either barriers or some additional work from the GC.

2. General plan

Most likely, the easiest to implement on top of Epsilon is the Mark-Compact, in the LISP2 style. The basic idea of this GC is described both on Wikipedia and in the GC Handbook (chapter 3.2). A sketch of the algorithm will be in the section with the implementation below, but I strongly recommend reading a little Wikipedia or the GC Handbook to understand what we are going to do.

The algorithm in question is the shifting GC: the moving objects move in an array to the very beginning of the heap. It has its pros and cons:

It maintains the order of memory allocations. This is very good for controlling the layout in memory, if it matters to you (control freaks, it's your time!). The downside is that you won’t get automatic link locality this way.
Its complexity is O (N) of the number of objects. However, linearity comes at a price: GC is required to bypass a bunch of 4 times for each build cycle.
It does not require free memory on the heap! There is no need to reserve memory on the heap to evacuate live objects, so you can even work with a heap that is overflowed by 99. (9)%. If we take on other ideas of simple collectors, for example, a scavenger with a semi-space (semi-space scavenger), we will have to slightly rewrite the presentation of the heap and reserve a little space for evacuation, but this is beyond the scope of this exercise.
If you work a little on the issue, you can achieve zero memory and time consumption during periods when the GC is inactive. It starts on a memory in an arbitrary state, and stops, significantly compacting it. This fits very well with how Epsilon works: it just keeps highlighting right after the last object. This is also a minus: a few dead objects at the beginning of the heap lead to a large number of movements.
It simply does not require new barriers, it can be reused EpsilonBarrierSetas is.

For simplicity, the GC implementation will use a full stop of the world (stop-the-world, STW), it will not have generations or multithreading. For this case, it makes sense to use a bitmap to store marks and reuse the marker word to store movement data.

3. Implementation of the GC core

Reading and understanding the whole implementation may be too complicated for an ignorant person. In this section, we will understand it step by step.

3.1. Prologue

The garbage collector usually needs to do a couple of things to prepare for the collection. Read the comments, they should speak for themselves:

{
  GCTraceTime(Info, gc) time("Step 0: Prologue", NULL);
  // Выделить память на битовую карту меток. Существует несколько причин сделать это
  // до начала цикла: память не тратится, если нет сборки, память
  // «очищается» при первом использовании, а незатронутые части карты отображаются
  // на нулевую страницу, улучшая производительность на разреженных кучах.
  if (!os::commit_memory((char*)_bitmap_region.start(), _bitmap_region.byte_size(), false)) {
    log_warning(gc)("Could not commit native memory for marking bitmap, GC failed");
    return;
  }
  // Хорошо разгребаемой кучи для этого алгоритма не нужно, но хочется,
  // чтобы потоки отдали нам свои текущие TLAB-ы.
  ensure_parsability(true);
  // Сообщить разным системам рантайма о том, что мы делаем GC.
  CodeCache::gc_prologue();
  BiasedLocking::preserve_marks();
  // Косвенные ссылки будут заново искаться на фазе маркировки.
  // Нужно почистить и активировать таблицу для них.
  DerivedPointerTable::clear();
}

Since we use a bitmap to track the reachability of objects, we need to clear it before use. Or in our case, since we aim to never ask for resources before starting the GC cycle, we will have to commit the bitmap to memory in advance . This provides several interesting advantages, at least on Linux, where most of the bitmap will point to page zero, especially for sparse heaps.

Threads should free their TLABs and ask GC for new ones after the build is complete.

Do not confuse TLAB and java.lang.ThreadLocal. From the point of view of the GC, ThreadLocal are ordinary objects, and they will not be compiled by the GC unless specifically required otherwise in Java code.

Some parts of the runtime, especially those holding links to the Java heap, will break during garbage collection, so you need to specifically warn them that the GC will start working soon. This will allow the respective subsystems to prepare and save part of their state before the GC makes its move.

3.2. Marking

Marking in stopping the world mode becomes quite simple when almost everything has already been done for us. Labeling is pretty standard, and most likely, in many implementations, GC is the first step.

{
  GCTraceTime(Info, gc) time("Step 1: Mark", NULL);
  // Маркировочный стек и замыкание, которое делает большую часть работы. Замыкание
  // просканирует исходящие ссылки, пометит их, и протолкнет свежепомеченные
  // объекты на стек для дальнейшей обработки.
  EpsilonMarkStack stack;
  EpsilonScanOopClosure cl(&stack, &_bitmap);
  // Засеять маркировку ссылками из корневых элементов.
  process_roots(&cl);
  stat_reachable_roots = stack.size();
  // Сканировать оставшуюся часть кучи, пока объекты не закончатся. 
  // Этот процесс гарантировано закончится, поскольку существует момент, 
  // когда все живые объекты окажутся помеченными.
  while (!stack.is_empty()) {
    oop obj = stack.pop();
    obj->oop_iterate(&cl);
    stat_reachable_heap++;
  }
  // После завершения маркировки косвенных ссылок не осталось.
  DerivedPointerTable::set_active(false);
}

This works exactly the same as for any other graph: you start the traversal with the initial set of reachable vertices, go along the outgoing edges and record all the visited vertices. The traversal continues until all the unvisited peaks end. In GC, “vertices” are objects, and “edges” are links between them.

Technically, we could just recursively go through the graph of objects, but this is a bad idea for arbitrary graphs that can have very large diameters. Imagine a linked list of a billion peaks! Therefore, to limit the depth of recursion, we use a marking stack that records detected objects.

The initial set of reachable objects is GC roots. Now do not dwell on what it is process_roots, more on that later. For now, let's just say that it bypasses all reachable links from the VM side.

Bitmap stamped and serves as a tool for recording marking Front ( marking wavefront ) (the set of already visited sites), and in the end - as the desired result storage, the set of all reachable objects. The real work takes place in EpsilonScanOopClosure, it is applied to all interesting objects and iterated over all the links of the selected object.

Look, Java knew how to close (closure) before it became fashionable!

class EpsilonScanOopClosure : public BasicOopIterateClosure {
private:
  EpsilonMarkStack* const _stack;
  MarkBitMap* const _bitmap;
  template 
  void do_oop_work(T* p) {
    // p - это ссылка на место памяти, где расположен oop, нужно загрузить
    // оттуда значение и распаковать сжатую ссылку, если необходимо:
    T o = RawAccess<>::oop_load(p);
    if (!CompressedOops::is_null(o)) {
      oop obj = CompressedOops::decode_not_null(o);
      // Объект найден. Посмотреть, отмечен ли он. Если нет,
      // пометить и сбросить на маркировочный для дальнейшего обхода. 
      // Здесь подойдёт неатомарная проверка+запись, 
      // поскольку замыкание выполняется строго в одном треде.
      if (!_bitmap->is_marked(obj)) {
        _bitmap->mark((HeapWord*)obj);
        _stack->push(obj);
      }
    }
  }
};

After completing this step, _bitmapcontains bits indicating the location of live objects. Thanks to this, it is possible to bypass all living objects, for example:

// Пройти по маркировочному битмапу и позвать замыкание на каждый помеченный объект.
// Это гораздо быстрее, чем пообъектный обход (очень разреженной) разбираемой кучи, но 
// на размещение битмапа тратится вплоть до 1/64 размера кучи.
void EpsilonHeap::walk_bitmap(ObjectClosure* cl) {
   HeapWord* limit = _space->top();
   HeapWord* addr = _bitmap.get_next_marked_addr(_space->bottom(), limit);
   while (addr < limit) {
     oop obj = oop(addr);
     assert(_bitmap.is_marked(obj), "sanity");
     cl->do_object(obj);
     addr += 1;
     if (addr < limit) {
       addr = _bitmap.get_next_marked_addr(addr, limit);
     }
   }
}

3.3. Calculate new addresses

This is also a fairly simple step, and it implements exactly what the algorithm says.

// Мы собираемся хранить forwarding data (место, где размещена новая копия)
// в маркировочном слове. Часть этих маркировочных слов нужно очень аккуратно сохранить.
// Здесь мы будем хранить и поддерживать список таких специальных слов.
PreservedMarks preserved_marks;
// Новый конец памяти после GC.
HeapWord* new_top;
{
  GCTraceTime(Info, gc) time("Step 2: Calculate new locations", NULL);
  // Обойти все живые объекты, вычислить их новые адреса и сохранить эти
  // адреса в маркировочные слова. Возможно, сохранить какие-то отметки.
  EpsilonCalcNewLocationObjectClosure cl(_space->bottom(), &preserved_marks);
  walk_bitmap(&cl);
  // После вычисления адресов мы знаем положение новой вершины памяти. 
  // Мы не можем прямо сейчас использовать её, ибо внутренние проверки
  // в следующих фазах всё ещё ожидают, что объекты всё ещё лежат "ниже"
  // этой вершины.
  new_top = cl.compact_point();
  stat_preserved_marks = preserved_marks.size();
}

The only thing that catches your eye is that we decided to store new addresses in the marking word of Java objects, and this word can already be occupied by something important, for example, information about locks. Fortunately, such nontrivial marking words are quite rare, and we can simply store them separately, if at all necessary: this is what is used PreservedMarks.

Real algorithmic work does EpsilonCalcNewLocationObjectClosure:

class EpsilonCalcNewLocationObjectClosure : public ObjectClosure {
private:
  HeapWord* _compact_point;
  PreservedMarks* const _preserved_marks;
public:
  EpsilonCalcNewLocationObjectClosure(HeapWord* start, PreservedMarks* pm) :
                                      _compact_point(start),
                                      _preserved_marks(pm) {}
  void do_object(oop obj) {
    // Записываем новое местоположение объекта: это текущая точка уплотнения.
    // Если объект остаётся на том же месте (это верно для объектов в плотном префиксе,
    // которое обычно и получается), не стоит беспокоиться о записи перемещения,
    // позволим последующему коду проигнорировать его.
    if ((HeapWord*)obj != _compact_point) {
      markOop mark = obj->mark_raw();
      if (mark->must_be_preserved(obj)) {
        _preserved_marks->push(obj, mark);
      }
      obj->forward_to(oop(_compact_point));
    }
    _compact_point += obj->size();
  }
  HeapWord* compact_point() {
    return _compact_point;
  }
};

forward_to- the most important part, because it stores the "moving address" in the marker word of the object. This will be needed in the next steps.

3.4. Fix pointers

Now you need to go through the heap again and rewrite all the links with their new addresses according to the following algorithm:

{
  GCTraceTime(Info, gc) time("Step 3: Adjust pointers", NULL);
  // Пройти все живые объекты _и их ссылочные поля_, и записать в них
  // «новые адреса». Мы знаем новые адреса из forwarding data,
  // хранящейся в маркировочном слове. Вначале позаботимся об объектах на куче.
  EpsilonAdjustPointersObjectClosure cl;
  walk_bitmap(&cl);
  // Теперь сделаем то же самое, но для всех корневых элементов VM, которые
  // самостоятельно держат ссылки на объекты: все эти ссылки тоже нужно обновить.
  EpsilonAdjustPointersOopClosure cli;
  process_roots(&cli);
  // Ну и наконец, нужно сообщить сохранённым меткам о том,
  // что объекты скоро начнут двигаться.
  preserved_marks.adjust_during_full_gc();
}

There are two kinds of references to shifted objects: outgoing either from objects on the heap itself, or from GC roots. You need to update both link classes. Some saved labels also store references to objects, so you need to ask them to update. PreservedMarksknows how to do this because it expects “forwarding data” in the same place where we saved it, in the marking word of the object.

Closures are divided into two types: some take objects and bypass their contents, others update these addresses. Here you can make a small performance optimization: if the object does not move, you can save a couple of records in a bunch:

class EpsilonAdjustPointersOopClosure : public BasicOopIterateClosure {
private:
  template 
  void do_oop_work(T* p) {
    // p - указатель на адрес в памяти, где расположен oop.
    // нужно загрузить оттуда значение и распаковать сжатую ссылку, если нужно:
    T o = RawAccess<>::oop_load(p);
    if (!CompressedOops::is_null(o)) {
      oop obj = CompressedOops::decode_not_null(o);
      // Перезаписать текущий указатель на объект на его новое положение.
      // Пропустить запись, если обновление не требуется.
      if (obj->is_forwarded()) {
        oop fwd = obj->forwardee();
        assert(fwd != NULL, "just checking");
        RawAccess<>::oop_store(p, fwd);
      }
    }
  }
};
class EpsilonAdjustPointersObjectClosure : public ObjectClosure {
private:
  EpsilonAdjustPointersOopClosure _cl;
public:
  void do_object(oop obj) {
    //Выполнить обновление для всех ссылок, достижимых из текущего объекта:
    obj->oop_iterate(&_cl);
  }
};

After completing this step, we essentially broke the heap: links point to the “wrong” addresses at which objects do not yet lie. Let's fix it!

3.5. We move objects

Time to move objects to new addresses, in accordance with the algorithm:

Again we go EpsilonMoveObjectsObjectClosurearound the heaps and apply the closure to all living objects:

{
  GCTraceTime(Info, gc) time("Step 4: Move objects", NULL);
  // Передвинуть все живые объекты на новые адреса. 
  // Все ссылки уже переписаны на эти адреса в предыдущем шаге.
  EpsilonMoveObjectsObjectClosure cl;
  walk_bitmap(&cl);
  stat_moved = cl.moved();
  // Так как все объекты уже передвинуты по новым адресам, можно урезать
  // «верх» кучи ровно до конца уплотнённого префикса.
  _space->set_top(new_top);
}

Immediately after that, you can drag the heap of the compaction point heap, making it possible to allocate memory directly from this place, immediately after the garbage collection cycle ends.

Note that in the shifting assembly we can overwrite the contents of existing objects, but since the scanning goes in the same direction, the overwritten objects are already copied to the right place.

The old and new locations of the same facility may intersect. For example, if you shift a 100-byte object by 8 bytes. The copying procedure should work it out by itself, and the intersecting content should be copied correctly, pay attention to Copy::aligned_*conjoint*_words.

The closure itself will simply move the moved objects to the new addresses:

class EpsilonMoveObjectsObjectClosure : public ObjectClosure {
public:
  void do_object(oop obj) {
    // Копируем объект по новому адресу, если нужно. Этот шаг - последний,
    // поэтому необходимо ре-инициализировать его mark word, 
    // и выбросить из него всю forwarding data.
    if (obj->is_forwarded()) {
      oop fwd = obj->forwardee();
      assert(fwd != NULL, "just checking");
      Copy::aligned_conjoint_words((HeapWord*)obj, (HeapWord*)fwd, obj->size());
      fwd->init_mark_raw();
    }
  }
};

3.6. Epilogue

Garbage collection is finished, the heap is again almost consistent, the last finishing touches are left:

{
  GCTraceTime(Info, gc) time("Step 5: Epilogue", NULL);
  // Восстановить специальные маркерные слова.
  preserved_marks.restore();
  // Сообщить остальному рантайму, что мы завершили сборку.
  DerivedPointerTable::update_pointers();
  BiasedLocking::restore_marks();
  CodeCache::gc_epilogue();
  JvmtiExport::gc_epilogue();
  // Карта маркировки больше не нужна.
  if (!os::uncommit_memory((char*)_bitmap_region.start(), _bitmap_region.byte_size())) {
    log_warning(gc)("Could not uncommit native memory for marking bitmap");
  }
  // Вернуть назад всю память, если нужно.
  // На больших хипах это может занять кучу времени.
  if (EpsilonUncommit) {
    _virtual_space.shrink_by((_space->end() - new_top) * HeapWordSize);
    _space->set_end((HeapWord*)_virtual_space.high());
  }
}

We inform the rest of the runtime that they should start post-assembly procedures. We restore the special marker words that we saved earlier. Farewell kiss to our marker card - it is no longer needed.

And, if you really want to, you can reduce the memory for allocations to a new size, thereby returning the memory to the operating system!

4. Connect the GC to the VM

4.1. Root Traversal

Remember, you need to bypass special, reachable links from VM? You can ask each special VM subsystem to bypass links hidden from other Java objects. An exhaustive list of such root elements in the current Hotspot looks something like this:

void EpsilonHeap::do_roots(OopClosure* cl) {
  // Нужно сказать рантайму, что мы собираемся обходить корневые элементы в 1 потоке.
  StrongRootsScope scope(1);
  // Нужно применить замыкание к нескольким специальным видам корневых элементов.
  CLDToOopClosure clds(cl, ClassLoaderData::_claim_none);
  MarkingCodeBlobClosure blobs(cl, CodeBlobToOopClosure::FixRelocations);
  // Обходим всевозможные части корневых элементов рантайма.
  // Некоторые элементы требуют держать блокировку в момент обхода.
  {
    MutexLockerEx lock(CodeCache_lock, Mutex::_no_safepoint_check_flag);
    CodeCache::blobs_do(&blobs);
  }
  {
    MutexLockerEx lock(ClassLoaderDataGraph_lock);
    ClassLoaderDataGraph::cld_do(&clds);
  }
  Universe::oops_do(cl);
  Management::oops_do(cl);
  JvmtiExport::oops_do(cl);
  JNIHandles::oops_do(cl);
  WeakProcessor::oops_do(cl);
  ObjectSynchronizer::oops_do(cl);
  SystemDictionary::oops_do(cl);
  Threads::possibly_parallel_oops_do(false, cl, &blobs);
}

There are extensions that can bypass root elements competitively or in parallel. For our single-threaded GC, a simple workaround is enough.

4.2. Safepoints and stopping the world

Since our GC works in the mode of stopping the world, we need to ask VM to execute this very pause of stopping the world. In Hotspot, this is achieved by implementing a new VM_Operationone that will call our GC code and ask for a special VM thread to execute it:

// VM_operation, выполняющая цикл сборки под сейфпоинтом
class VM_EpsilonCollect: public VM_Operation {
private:
  const GCCause::Cause _cause;
  EpsilonHeap* const _heap;
  static size_t _last_used;
public:
  VM_EpsilonCollect(GCCause::Cause cause) : VM_Operation(),
                                            _cause(cause),
                                            _heap(EpsilonHeap::heap()) {};
  VM_Operation::VMOp_Type type() const { return VMOp_EpsilonCollect; }
  const char* name()             const { return "Epsilon Collection"; }
  virtual bool doit_prologue() {
    // Перед тем как управлять хранилищем, нужно взять блокировку на кучу.
    // Это естественным образом приводит к сериализации запросов к GC,
    // позволяя объединять запросы на обработку кончившейся памяти из нескольких потоков.
    // Можно проигнорировать запросы, до которого не прошло аллокаций со времен прошлой
    // полной сборки. Если перед началом сборки подождать,
    // пока куча заполнится хотя бы на 1%, это, скорей всего, 
    // решит большинство проблем с гонками.
    Heap_lock->lock();
    size_t used = _heap->used();
    size_t capacity = _heap->capacity();
    size_t allocated = used > _last_used ? used - _last_used : 0;
    if (_cause != GCCause::_allocation_failure || allocated > capacity / 100) {
      return true;
    } else {
      Heap_lock->unlock();
      return false;
    }
  }
  virtual void doit() {
    _heap->entry_collect(_cause);
  }
  virtual void doit_epilogue() {
    _last_used = _heap->used();
    Heap_lock->unlock();
  }
};
size_t VM_EpsilonCollect::_last_used = 0;
void EpsilonHeap::vmentry_collect(GCCause::Cause cause) {
  VM_EpsilonCollect vmop(cause);
  VMThread::execute(&vmop);
}

And it will also help to resolve several races when all threads simultaneously want to make GC - which usually happens when the memory comes to an end.

4.3. Memory Allocation Errors

It's good that we learned how to make a GC on demand, but it's even better that the GC reacts to heap overflows when there is no more memory left. In fact, it’s enough to replace most of the calls allocate_workwith this convenient wrapper that starts the GC at the time of the memory allocation error:

HeapWord* EpsilonHeap::allocate_or_collect_work(size_t size) {
  HeapWord* res = allocate_work(size);
  if (res == NULL && EpsilonSlidingGC) {
    vmentry_collect(GCCause::_allocation_failure);
    res = allocate_work(size);
  }
  return res;
}

That's all!

5. Assembly

Our patch should roll over the fresh OpenJDK repository without any problems.

$ hg clone https://hg.openjdk.java.net/jdk/jdk/ jdk-jdk
$ cd jdk-jdk
$ curl https://shipilev.net/jvm/diy-gc/webrev/jdk-jdk-epsilon.changeset | patch -p1

Well, then we build OpenJDK as usual:

$ ./configure --with-debug-level=fastdebug
$ make images

We also start in the usual way:

$ build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:+EpsilonSlidingGC -version
openjdk version "13-internal" 2019-09-17
OpenJDK Runtime Environment (build 13-internal+0-adhoc.shade.jdk-jdk-epsilon)
OpenJDK 64-Bit Server VM (build 13-internal+0-adhoc.shade.jdk-jdk-epsilon, mixed mode, sharing

6. Testing

How to verify that our GC implementation is not broken? There are several useful tools:

Assertions. A bunch of assertions. The Hotspot code already has a lot of assertions, so running the JVM in fastdebug mode usually shows a bunch of interesting bugs here and here, including those related to the GC crash.
Internal checks. Our patch implements the last step in the build cycle, which goes through all living objects and checks their correctness. Usually, in this way you can stumble upon the most egregious errors (a broken heap) even before the runtime or application sees them at the end of the build cycle.
Tests. Assets and checks are useless if the code where they are written does not start at all. It’s useful to have a fair amount of unit tests and integration tests, and run them as soon as possible and more often.

For example, here's how to verify that our patch did not fall apart in a monstrous way:

$ CONF=linux-x86_64-server-fastdebug make images run-test TEST=gc/epsilon/
Building targets 'images run-test' in configuration 'linux-x86_64-server-fastdebug'
Test selection 'gc/epsilon/', will run:
* jtreg:test/hotspot/jtreg/gc/epsilon
Running test 'jtreg:test/hotspot/jtreg/gc/epsilon'
Passed: gc/epsilon/TestAlwaysPretouch.java
Passed: gc/epsilon/TestAlignment.java
Passed: gc/epsilon/TestElasticTLAB.java
Passed: gc/epsilon/TestEpsilonEnabled.java
Passed: gc/epsilon/TestHelloWorld.java
Passed: gc/epsilon/TestLogTrace.java
Passed: gc/epsilon/TestDieDefault.java
Passed: gc/epsilon/TestDieWithOnError.java
Passed: gc/epsilon/TestMemoryPools.java
Passed: gc/epsilon/TestMaxTLAB.java
Passed: gc/epsilon/TestPrintHeapSteps.java
Passed: gc/epsilon/TestArraycopyCheckcast.java
Passed: gc/epsilon/TestClasses.java
Passed: gc/epsilon/TestUpdateCountersSteps.java
Passed: gc/epsilon/TestDieWithHeapDump.java
Passed: gc/epsilon/TestByteArrays.java
Passed: gc/epsilon/TestManyThreads.java
Passed: gc/epsilon/TestRefArrays.java
Passed: gc/epsilon/TestObjects.java
Passed: gc/epsilon/TestElasticTLABDecay.java
Passed: gc/epsilon/TestSlidingGC.java
Test results: passed: 21
TEST SUCCESS

Are you satisfied? Now try to run the real application on the fastdebug assembly with verification enabled. Didn't break? You can already hope for something.

7. Performance

Let's run spring-petclinic , download Apache Bench and plug in our toy GC! There will be very little live data in this test, so it doesn’t matter if there are generations in our GC.

It is necessary to run with the parameters -Xlog:gc -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:+EpsilonSlidingGC::

Exhaust:

Heap: 20480M reserved, 20480M (100.00%) committed, 19497M (95.20%) used
GC(2) Step 0: Prologue 2.085ms
GC(2) Step 1: Mark 51.005ms
GC(2) Step 2: Calculate new locations 71.207ms
GC(2) Step 3: Adjust pointers 49.671ms
GC(2) Step 4: Move objects 22.839ms
GC(2) Step 5: Epilogue 1.008ms
GC(2) GC Stats: 70561 (8.63%) reachable from roots, 746676 (91.37%) reachable from heap, 91055 (11.14%) moved, 2237 (0.27%) markwords preserved
GC(2) Heap: 20480M reserved, 20480M (100.00%) committed, 37056K (0.18%) used
GC(2) Lisp2-style Mark-Compact (Allocation Failure) 20479M->36M(20480M) 197.940ms

200 milliseconds? Not bad for a single-threaded single-threaded GC! As you can see, all four main phases are performed in the same order. In fact, if you play with a different heap filling and its size, you will notice a pattern: an increase in the amount of live data means a significant slowdown in assembly (to tap through all living objects consistently is a sad procedure if there are really a lot of them). Increasing the size of the heap slows down the assembly only slightly (long-distance running, even over the allowed heap, has its price and its contribution to streaming performance).

For comparison, GC with generations or scavengers can easily cope with such a load. For example, let's run it -Xlog:gc -XX:+UseSerialGC- it collects mainly the young generation:

GC(46) Pause Young (Allocation Failure) 575M->39M(1943M) 2.603ms
GC(47) Pause Young (Allocation Failure) 575M->39M(1943M) 2.606ms
GC(48) Pause Young (Allocation Failure) 575M->39M(1943M) 2.747ms
GC(49) Pause Young (Allocation Failure) 575M->39M(1943M) 2.578ms

Wow, 2 milliseconds. This is because most of the objects in the young generation are dead, and such a GC has nothing to do here. If we turn off the -Xlog:gc -XX:+UseSerialGCextensions that are responsible for generations and run the full assembly, we will see a much less rosy picture:

GC(3) Pause Full (Allocation Failure) 16385M->34M(18432M) 1969.694ms
GC(4) Pause Full (Allocation Failure) 16385M->34M(18432M) 2261.405ms
GC(5) Pause Full (Allocation Failure) 16385M->34M(18432M) 2327.577ms
GC(6) Pause Full (Allocation Failure) 16385M->34M(18432M) 2328.976ms

There are tons of metrics and scripts to play with. Think of it as homework.

8. What next?

Then you can go in different ways. Many of these paths will build what GC from OpenJDK already knows - and what is already well tested, and therefore one should relate to such improvements in an educational way.

What can be improved:

Implement weak link processing. The current implementation ignores the fact that there are soft / weak / phantom links. She also ignores the existence of finalizable objects. This is not ideal in performance, but it is rather safe from the point of view of correctness, since the general code “only” considers all such links to be always reachable, therefore they will move and update in the same way as any others.
From the point of view of the GC, java.lang.ref.Reference.referentthis is just another Java field, always available, unless we go around the heap in some special way. Finalized objects have their own synthetic wrappers FinalReferencethat hold onto them.
Correct implementation should include connecting a common ReferenceProcessormarking code and marking / cleaning all surviving / dead links after the end of this marking.
Implement class unloading and other ways to clean VM. The current implementation never unloads classes and never cleans up data structures inside VMs that hold objects inaccessible from the heap, and therefore there is certainly something to clean up there. Realizing all this will require taking care of the weak and strong root elements. By default, only strong root elements are marked , and subsequently, if at the end of the labeling any of the weak elements are left unmarked, they can be demolished.
Add parallelism. The simplest way to make a parallel version is to divide the heap into regions corresponding to individual GC threads and do the same sequential compaction inside these regions. As a result, holes will appear between regions, so the memory allocation code needs to be modified so that it knows about the presence of many regions.
Parallel versions of this mark-compact GC are implemented as Full GC fallbacks in Shenandoah (starting with OpenJDK 8) and G1 (starting with OpenJDK 10, immediately after the release of JEP 307: “Parallel Full GC for G1” ).
Implement dense prefix processing . It often happens that after several assemblies in the heap a "sedimentary" layer is formed of always accessible objects, so you can save a couple of cycles by indicating that some kind of heap prefix does not move at all. Then you can avoid calculating addresses and moving objects to this zone. However, we still need to label it and correct the pointers there.
Extend a dense prefix to a full assembly with generations. By adding a bit of work with barriers, you can determine which part of the prefix is most interesting, and thereby save time on its marking and substitution of pointers. In the end, all this will end with the invention of “generations” - we will assemble the “young” generation immediately after the prefix and sometimes make the “complete” assembly to compact the prefix itself too.
Take some algorithm from the GC Handbook and try to implement it yourself.

9. Conclusions

What conclusions can be drawn from this exercise? Implementing a toy GC is fun, informative, and perhaps even a good idea for a university GC course.

Producing such an implementation will be a tedious and time-consuming process, so it’s easier to take some ready-made GC and configure it. It is noteworthy that if we continue to develop further, this implementation will eventually turn into one of the existing GCs (for example, Serial GC or Parallel GC), which makes such efforts a little less than completely useless.

Minute of advertising. In a month, April 5-6, 2019, JPoint will be held - the largest Java conference in Russia. The program expects a lot of reports on the details of the work of modern technologies - OpenJDK, GraalVM, Kotlin and others. You can familiarize yourself with the program and purchase tickets on the official website .

Tags: