Accelerating the creation of ConcurrentReferenceHashMap

    Greetings, in this article I will plan how to speed up creation with minor efforts org.springframework.util.ConcurrentReferenceHashMap.


    Interested in pumping performance? Welcome!


    Intelligence service


    We begin, of course, with measurements and try to understand what exactly we will improve. To do this, take JMH 1.21, JDK 8 and JDK 11, as well as async-profiler .


    To find out how much it takes to create an empty dictionary, let's put a simple experience:


    @Benchmarkpublic Object original(){
      returnnew ConcurrentReferenceHashMap();
    }

    Profile looks like this:


    55.21%  2429743  o.s.u.ConcurrentReferenceHashMap.calculateShift
    20.30%   891404  o.s.u.ConcurrentReferenceHashMap$Segment.<init>
     8.79%   387198  o.s.u.ConcurrentReferenceHashMap.<init>
     3.35%   147651  java.util.concurrent.locks.ReentrantLock.<init>
     2.34%   102804  java.lang.ref.ReferenceQueue.<init>
     1.61%    70748  o.s.u.ConcurrentReferenceHashMap.createReferenceManager
     1.53%    67265  o.s.u.ConcurrentReferenceHashMap$Segment.createReferenceArray
     0.78%    34493  java.lang.ref.ReferenceQueue$Lock.<init>
     0.76%    33546  o.s.u.ConcurrentReferenceHashMap$ReferenceManager.<init>
     0.36%    15948  o.s.u.Assert.isTrue

    The direction is clear, you can proceed.


    Maths


    So, we spend the lion's share of time in the method calculateShift. Here he is:


    protectedstaticintcalculateShift(int minimumValue, int maximumValue){
      int shift = 0;
      int value = 1;
      while (value < minimumValue && value < maximumValue) {
        value <<= 1;
        shift++;
      }
      return shift;
    }

    It’s hard to come up with something new, so let's switch to using it:


    publicConcurrentReferenceHashMap(/*...*/int concurrencyLevel, /*...*/){
      //...this.shift = calculateShift(concurrencyLevel, MAXIMUM_CONCURRENCY_LEVEL);
      //...
    }
    // ConcurrentReferenceHashMap$SegmentpublicSegment(int initialCapacity){
      this.referenceManager = createReferenceManager();
      this.initialSize = 1 << calculateShift(initialCapacity, MAXIMUM_SEGMENT_SIZE);
      this.references = createReferenceArray(this.initialSize);
      this.resizeThreshold = (int) (this.references.length * getLoadFactor());
    }

    Note the use of the constructor Segment:


    int roundedUpSegmentCapacity = (int) ((initialCapacity + size - 1L) / size);
    //...for (int i = 0; i < this.segments.length; i++) {
      this.segments[i] = new Segment(roundedUpSegmentCapacity);
    }

    The value is roundedUpSegmentCapacityconstant when passing through the loop, therefore the Segmentexpression executed in the constructor 1 << calculateShift(initialCapacity, MAXIMUM_SEGMENT_SIZE)will also always be constant. Thus, we can move the specified expression beyond the constructor and the cycle.


    The same statement is true for the expression (int) (this.references.length * getLoadFactor()), since the array referencesis created using a variable initialCapacityand its size is constant when creating each segment. Bring the expression out of the constructor and the loop.


    Arrays


    Consider the method createReferenceArray:


    private Reference<K, V>[] createReferenceArray(int size) {
      return (Reference<K, V>[]) Array.newInstance(Reference.class, size);
    }

    The use is Array::newInstanceclearly redundant, nothing prevents us from creating an array using a constructor:


    private Reference<K, V>[] createReferenceArray(int size) {
      returnnew Reference[size];
    }

    The performance of the constructor is not inferior to the call Array::newInstanceat the C2 level, but considerably exceeds it for small arrays in C1 (property -XX:TieredStopAtLevel=1) and interpreter (property -Xint) modes :


    //C2
                   length  Mode  Cnt    Score   Error  Units
    constructor        10  avgt   50      5,6   ± 0,0  ns/op
    constructor       100  avgt   50     29,7   ± 0,1  ns/op
    constructor      1000  avgt   50    242,7   ± 1,3  ns/op
    newInstance        10  avgt   50      5,5   ± 0,0  ns/op
    newInstance       100  avgt   50     29,7   ± 0,1  ns/op
    newInstance      1000  avgt   50    249,3   ± 9,6  ns/op
    //C1
                   length  Mode  Cnt    Score   Error  Units
    constructor        10  avgt   50      6,8   ± 0,1  ns/op
    constructor       100  avgt   50     36,3   ± 0,6  ns/op
    constructor      1000  avgt   50    358,6   ± 6,4  ns/op
    newInstance        10  avgt   50     91,0   ± 2,4  ns/op
    newInstance       100  avgt   50    127,2   ± 1,8  ns/op
    newInstance      1000  avgt   50    322,8   ± 7,2  ns/op
    //-Xint
                   length  Mode  Cnt    Score   Error  Units
    constructor        10  avgt   50    126,3  ±  5,9  ns/op
    constructor       100  avgt   50    154,7  ±  2,6  ns/op
    constructor      1000  avgt   50    364,2  ±  6,2  ns/op
    newInstance        10  avgt   50    251,2  ± 11,3  ns/op
    newInstance       100  avgt   50    287,5  ± 11,4  ns/op
    newInstance      1000  avgt   50    486,5  ±  8,5  ns/op

    The replacement will not be reflected on our benchmark, but it will speed up the code when the application is started, when C2 has not yet worked. More about this mode will be discussed at the end of the article.


    Key trivia


    Let’s go back to the constructor. ConcurrentReferenceHashMap


    ConcurrentReferenceHashMap(/*...*/) {
      Assert.isTrue(initialCapacity >= 0, "Initial capacity must not be negative");
      Assert.isTrue(loadFactor > 0f, "Load factor must be positive");
      Assert.isTrue(concurrencyLevel > 0, "Concurrency level must be positive");
      Assert.notNull(referenceType, "Reference type must not be null");
      this.loadFactor = loadFactor;
      this.shift = calculateShift(concurrencyLevel, MAXIMUM_CONCURRENCY_LEVEL);
      int size = 1 << this.shift;
      this.referenceType = referenceType;
      int roundedUpSegmentCapacity = (int) ((initialCapacity + size - 1L) / size);
      this.segments = (Segment[]) Array.newInstance(Segment.class, size);
      for (int i = 0; i < this.segments.length; i++) {
        this.segments[i] = new Segment(roundedUpSegmentCapacity);
      }
    }

    From curious for us: replacing Array.newInstancewith a constructor leads to a compilation error, we pass by. But the cycle is very curious, more precisely the appeal to the field segments. To see how destructive (sometimes) for performance such an appeal can be, I recommend the article by Nitzan Vakarta The volatile read suprise .


    The case described in the article, I think, corresponds with the code in question. Focus on the segments:


    this.segments = (Segment[]) Array.newInstance(Segment.class, size);
    for (int i = 0; i < this.segments.length; i++) {
      this.segments[i] = new Segment(roundedUpSegmentCapacity);
    }

    Immediately after creating the array, it is recorded in the field ConcurrentReferenceHashMap.segments, and it is with this field that the loop interacts. Inside the Segment constructor, writing to a volatile field occurs references:


    privatevolatile Reference<K, V>[] references;
    publicSegment(int initialCapacity){
      //...this.references = createReferenceArray(this.initialSize);
      //...
    }

    This means that it is impossible to improve access to the field segments, in other words, its contents are read at each turn of the cycle. How to check the truth of this statement? The easiest way is to copy the code into a separate package and remove the volatilefields from the declaration Segment.references:


    protectedfinalclassSegmentextendsReentrantLock{
      // былоprivatevolatile Reference<K, V>[] references;
      // сталоprivate Reference<K, V>[] references;
    }

    Check if something has changed:


    @Benchmarkpublic Object original(){
      returnnew tsypanov.map.original.ConcurrentReferenceHashMap();
    }
    @Benchmarkpublic Object nonVolatileSegmentReferences(){
      returnnew tsypanov.map.nonvolatile.ConcurrentReferenceHashMap();
    }

    We find significant performance gains (JDK 8):


    Benchmark                      Mode  Cnt  Score   Error  Units
    original                       avgt  100  732,1  ± 15,8  ns/op
    nonVolatileSegmentReferences   avgt  100  610,6  ± 15,4  ns/op

    On JDK 11, the elapsed time has decreased, but the relative gap has not changed much:


    Benchmark                      Mode  Cnt  Score   Error  Units
    original                       avgt  100  473,8  ± 11,2  ns/op
    nonVolatileSegmentReferences   avgt  100  401,9  ± 15,5  ns/op
    

    Of course, volatileyou need to return to the site and look for another way. A bottleneck was found - this is an appeal to the field. And if so, then you can create a variable segments, fill the array, and only then write it to the field:


    Segment[] segments = (Segment[]) Array.newInstance(Segment.class, size);
    for (int i = 0; i < segments.length; i++) {
      segments[i] = new Segment(roundedUpSegmentCapacity);
    }
    this.segments = segments;

    As a result, with the help of even such simple improvements, we managed to achieve quite good growth:


    JDK 8


    Benchmark                           Mode  Cnt  Score   Error  Units
    originalConcurrentReferenceHashMap  avgt  100  712,1   ± 7,2  ns/op
    patchedConcurrentReferenceHashMap   avgt  100  496,5   ± 4,6  ns/op

    JDK 11


    Benchmark                           Mode  Cnt  Score    Error  Units
    originalConcurrentReferenceHashMap  avgt  100  536,0   ±  8,4  ns/op
    patchedConcurrentReferenceHashMap   avgt  100  486,4   ±  9,3  ns/op

    What gives the replacement of 'Arrays :: newInstance' to 'new T []'


    When launching Spring Booth, applications from the Idea developers often set the flag 'Enable launch optimizations', which adds -XX:TieredStopAtLevel=1 -noverifyto the VM arguments, which speeds up the launch by turning off profiling and C2. Let's make a measurement with the specified arguments:


    // JDK 8 -XX:TieredStopAtLevel=1 -noverify
    Benchmark                           Mode  Cnt   Score  Error  Units
    originalConcurrentReferenceHashMap  avgt  100  1920,9 ± 24,2  ns/op
    patchedConcurrentReferenceHashMap   avgt  100   592,0 ± 25,4  ns/op
    // JDK 11 -XX:TieredStopAtLevel=1 -noverify
    Benchmark                           Mode  Cnt   Score  Error  Units
    originalConcurrentReferenceHashMap  avgt  100  1838,9 ±  8,0  ns/op
    patchedConcurrentReferenceHashMap   avgt  100   549,7 ±  6,7  ns/op

    More than 3-fold increase!


    What is it for?


    In particular, this is needed to speed up queries that return projections to Spring Data JPA.



    The JMC profile shows that creation ConcurrentReferenceHashMaptakes almost a fifth of the time spent on executing a query like


    publicinterfaceSimpleEntityRepositoryextendsJpaRepository<SimpleEntity, Long> {
      List<HasIdAndName> findAllByName(String name);
    }

    where HasIdAndNameis the view projection


    publicinterfaceHasIdAndName{
      intgetId();
      String getName();
    }

    Also, a ConcurrentReferenceHashMapfew dozen times used in the code "Spring", so that just will not be superfluous.


    findings


    • improve performance is not as difficult as it seems at first glance
    • volatile access in the vicinity of the cycle - one of the possible bottlenecks
    • look for invariants and take them out of cycles

    What to read


    Article by Nitzan Waqarta


    Code of examples


    Changes:
    https://github.com/spring-projects/spring-framework/pull/1873
    https://github.com/spring-projects/spring-framework/pull/2051


    Also popular now: