Accelerating the creation of ConcurrentReferenceHashMap
Greetings, in this article I will plan how to speed up creation with minor efforts org.springframework.util.ConcurrentReferenceHashMap
.
Interested in pumping performance? Welcome!
Intelligence service
We begin, of course, with measurements and try to understand what exactly we will improve. To do this, take JMH 1.21, JDK 8 and JDK 11, as well as async-profiler .
To find out how much it takes to create an empty dictionary, let's put a simple experience:
@Benchmarkpublic Object original(){
returnnew ConcurrentReferenceHashMap();
}
Profile looks like this:
55.21% 2429743 o.s.u.ConcurrentReferenceHashMap.calculateShift
20.30% 891404 o.s.u.ConcurrentReferenceHashMap$Segment.<init>
8.79% 387198 o.s.u.ConcurrentReferenceHashMap.<init>
3.35% 147651 java.util.concurrent.locks.ReentrantLock.<init>
2.34% 102804 java.lang.ref.ReferenceQueue.<init>
1.61% 70748 o.s.u.ConcurrentReferenceHashMap.createReferenceManager
1.53% 67265 o.s.u.ConcurrentReferenceHashMap$Segment.createReferenceArray
0.78% 34493 java.lang.ref.ReferenceQueue$Lock.<init>
0.76% 33546 o.s.u.ConcurrentReferenceHashMap$ReferenceManager.<init>
0.36% 15948 o.s.u.Assert.isTrue
The direction is clear, you can proceed.
Maths
So, we spend the lion's share of time in the method calculateShift
. Here he is:
protectedstaticintcalculateShift(int minimumValue, int maximumValue){
int shift = 0;
int value = 1;
while (value < minimumValue && value < maximumValue) {
value <<= 1;
shift++;
}
return shift;
}
It’s hard to come up with something new, so let's switch to using it:
publicConcurrentReferenceHashMap(/*...*/int concurrencyLevel, /*...*/){
//...this.shift = calculateShift(concurrencyLevel, MAXIMUM_CONCURRENCY_LEVEL);
//...
}
// ConcurrentReferenceHashMap$SegmentpublicSegment(int initialCapacity){
this.referenceManager = createReferenceManager();
this.initialSize = 1 << calculateShift(initialCapacity, MAXIMUM_SEGMENT_SIZE);
this.references = createReferenceArray(this.initialSize);
this.resizeThreshold = (int) (this.references.length * getLoadFactor());
}
Note the use of the constructor Segment
:
int roundedUpSegmentCapacity = (int) ((initialCapacity + size - 1L) / size);
//...for (int i = 0; i < this.segments.length; i++) {
this.segments[i] = new Segment(roundedUpSegmentCapacity);
}
The value is roundedUpSegmentCapacity
constant when passing through the loop, therefore the Segment
expression executed in the constructor 1 << calculateShift(initialCapacity, MAXIMUM_SEGMENT_SIZE)
will also always be constant. Thus, we can move the specified expression beyond the constructor and the cycle.
The same statement is true for the expression (int) (this.references.length * getLoadFactor())
, since the array references
is created using a variable initialCapacity
and its size is constant when creating each segment. Bring the expression out of the constructor and the loop.
Arrays
Consider the method createReferenceArray
:
private Reference<K, V>[] createReferenceArray(int size) {
return (Reference<K, V>[]) Array.newInstance(Reference.class, size);
}
The use is Array::newInstance
clearly redundant, nothing prevents us from creating an array using a constructor:
private Reference<K, V>[] createReferenceArray(int size) {
returnnew Reference[size];
}
The performance of the constructor is not inferior to the call Array::newInstance
at the C2 level, but considerably exceeds it for small arrays in C1 (property -XX:TieredStopAtLevel=1
) and interpreter (property -Xint
) modes :
//C2
length Mode Cnt Score Error Units
constructor 10 avgt 50 5,6 ± 0,0 ns/op
constructor 100 avgt 50 29,7 ± 0,1 ns/op
constructor 1000 avgt 50 242,7 ± 1,3 ns/op
newInstance 10 avgt 50 5,5 ± 0,0 ns/op
newInstance 100 avgt 50 29,7 ± 0,1 ns/op
newInstance 1000 avgt 50 249,3 ± 9,6 ns/op
//C1
length Mode Cnt Score Error Units
constructor 10 avgt 50 6,8 ± 0,1 ns/op
constructor 100 avgt 50 36,3 ± 0,6 ns/op
constructor 1000 avgt 50 358,6 ± 6,4 ns/op
newInstance 10 avgt 50 91,0 ± 2,4 ns/op
newInstance 100 avgt 50 127,2 ± 1,8 ns/op
newInstance 1000 avgt 50 322,8 ± 7,2 ns/op
//-Xint
length Mode Cnt Score Error Units
constructor 10 avgt 50 126,3 ± 5,9 ns/op
constructor 100 avgt 50 154,7 ± 2,6 ns/op
constructor 1000 avgt 50 364,2 ± 6,2 ns/op
newInstance 10 avgt 50 251,2 ± 11,3 ns/op
newInstance 100 avgt 50 287,5 ± 11,4 ns/op
newInstance 1000 avgt 50 486,5 ± 8,5 ns/op
The replacement will not be reflected on our benchmark, but it will speed up the code when the application is started, when C2 has not yet worked. More about this mode will be discussed at the end of the article.
Key trivia
Let’s go back to the constructor. ConcurrentReferenceHashMap
ConcurrentReferenceHashMap(/*...*/) {
Assert.isTrue(initialCapacity >= 0, "Initial capacity must not be negative");
Assert.isTrue(loadFactor > 0f, "Load factor must be positive");
Assert.isTrue(concurrencyLevel > 0, "Concurrency level must be positive");
Assert.notNull(referenceType, "Reference type must not be null");
this.loadFactor = loadFactor;
this.shift = calculateShift(concurrencyLevel, MAXIMUM_CONCURRENCY_LEVEL);
int size = 1 << this.shift;
this.referenceType = referenceType;
int roundedUpSegmentCapacity = (int) ((initialCapacity + size - 1L) / size);
this.segments = (Segment[]) Array.newInstance(Segment.class, size);
for (int i = 0; i < this.segments.length; i++) {
this.segments[i] = new Segment(roundedUpSegmentCapacity);
}
}
From curious for us: replacing Array.newInstance
with a constructor leads to a compilation error, we pass by. But the cycle is very curious, more precisely the appeal to the field segments
. To see how destructive (sometimes) for performance such an appeal can be, I recommend the article by Nitzan Vakarta The volatile read suprise .
The case described in the article, I think, corresponds with the code in question. Focus on the segments:
this.segments = (Segment[]) Array.newInstance(Segment.class, size);
for (int i = 0; i < this.segments.length; i++) {
this.segments[i] = new Segment(roundedUpSegmentCapacity);
}
Immediately after creating the array, it is recorded in the field ConcurrentReferenceHashMap.segments
, and it is with this field that the loop interacts. Inside the Segment constructor, writing to a volatile field occurs references
:
privatevolatile Reference<K, V>[] references;
publicSegment(int initialCapacity){
//...this.references = createReferenceArray(this.initialSize);
//...
}
This means that it is impossible to improve access to the field segments
, in other words, its contents are read at each turn of the cycle. How to check the truth of this statement? The easiest way is to copy the code into a separate package and remove the volatile
fields from the declaration Segment.references
:
protectedfinalclassSegmentextendsReentrantLock{
// былоprivatevolatile Reference<K, V>[] references;
// сталоprivate Reference<K, V>[] references;
}
Check if something has changed:
@Benchmarkpublic Object original(){
returnnew tsypanov.map.original.ConcurrentReferenceHashMap();
}
@Benchmarkpublic Object nonVolatileSegmentReferences(){
returnnew tsypanov.map.nonvolatile.ConcurrentReferenceHashMap();
}
We find significant performance gains (JDK 8):
Benchmark Mode Cnt Score Error Units
original avgt 100 732,1 ± 15,8 ns/op
nonVolatileSegmentReferences avgt 100 610,6 ± 15,4 ns/op
On JDK 11, the elapsed time has decreased, but the relative gap has not changed much:
Benchmark Mode Cnt Score Error Units
original avgt 100 473,8 ± 11,2 ns/op
nonVolatileSegmentReferences avgt 100 401,9 ± 15,5 ns/op
Of course, volatile
you need to return to the site and look for another way. A bottleneck was found - this is an appeal to the field. And if so, then you can create a variable segments
, fill the array, and only then write it to the field:
Segment[] segments = (Segment[]) Array.newInstance(Segment.class, size);
for (int i = 0; i < segments.length; i++) {
segments[i] = new Segment(roundedUpSegmentCapacity);
}
this.segments = segments;
As a result, with the help of even such simple improvements, we managed to achieve quite good growth:
JDK 8
Benchmark Mode Cnt Score Error Units
originalConcurrentReferenceHashMap avgt 100 712,1 ± 7,2 ns/op
patchedConcurrentReferenceHashMap avgt 100 496,5 ± 4,6 ns/op
JDK 11
Benchmark Mode Cnt Score Error Units
originalConcurrentReferenceHashMap avgt 100 536,0 ± 8,4 ns/op
patchedConcurrentReferenceHashMap avgt 100 486,4 ± 9,3 ns/op
What gives the replacement of 'Arrays :: newInstance' to 'new T []'
When launching Spring Booth, applications from the Idea developers often set the flag 'Enable launch optimizations', which adds -XX:TieredStopAtLevel=1 -noverify
to the VM arguments, which speeds up the launch by turning off profiling and C2. Let's make a measurement with the specified arguments:
// JDK 8 -XX:TieredStopAtLevel=1 -noverify
Benchmark Mode Cnt Score Error Units
originalConcurrentReferenceHashMap avgt 100 1920,9 ± 24,2 ns/op
patchedConcurrentReferenceHashMap avgt 100 592,0 ± 25,4 ns/op
// JDK 11 -XX:TieredStopAtLevel=1 -noverify
Benchmark Mode Cnt Score Error Units
originalConcurrentReferenceHashMap avgt 100 1838,9 ± 8,0 ns/op
patchedConcurrentReferenceHashMap avgt 100 549,7 ± 6,7 ns/op
More than 3-fold increase!
What is it for?
In particular, this is needed to speed up queries that return projections to Spring Data JPA.
The JMC profile shows that creation ConcurrentReferenceHashMap
takes almost a fifth of the time spent on executing a query like
publicinterfaceSimpleEntityRepositoryextendsJpaRepository<SimpleEntity, Long> {
List<HasIdAndName> findAllByName(String name);
}
where HasIdAndName
is the view projection
publicinterfaceHasIdAndName{
intgetId();
String getName();
}
Also, a ConcurrentReferenceHashMap
few dozen times used in the code "Spring", so that just will not be superfluous.
findings
- improve performance is not as difficult as it seems at first glance
- volatile access in the vicinity of the cycle - one of the possible bottlenecks
- look for invariants and take them out of cycles
What to read
Changes:
https://github.com/spring-projects/spring-framework/pull/1873
https://github.com/spring-projects/spring-framework/pull/2051