Finding Strange Performance Causes
Introduction
Finally, I got useful to study Java bytecode in detail, and almost immediately an interesting question arose in my head. There is a NOP instruction there that does nothing. So, how does this “nothing” affect performance? Actually, the process of studying this is described in the post.
Disclaimer
The story itself, first of all, is not about how it really works, but about what kind of errors you should be wary of when measuring performance.
Instruments
Let's start with the main thing: how all the measurements were carried out. To generate the code, the ASM library was used , to create the benchmark itself - JMH .
In order not to use reflection, a small interface was created:
public interface Getter {
int get();
}
Next, a class that implements the get method was generated :
public get()I
NOP
...
NOP
LDC 20
IRETURN
You can insert an arbitrary number of notes.
Full generator code
public class SimpleGetterClassLoader extends ClassLoader {
private static final String GENERATED_CLASS_NAME = "other.GeneratedClass";
private static final ClassLoader myClassLoader = new SimpleGetterClassLoader();
@SuppressWarnings("unchecked")
public static Getter newInstanceWithNOPs(int nopCount) throws Exception {
Class clazz = Class.forName(GENERATED_CLASS_NAME + "_" + nopCount, false, myClassLoader);
return (Getter) clazz.newInstance();
}
@NotNull
@Override
protected Class findClass(@NotNull String name) throws ClassNotFoundException {
if (!name.startsWith(GENERATED_CLASS_NAME))
throw new ClassNotFoundException(name);
int nopCount = Integer.parseInt(name.substring(GENERATED_CLASS_NAME.length() + 1));
ClassWriter cw = new ClassWriter(0);
cw.visit(V1_5, ACC_PUBLIC, name.replace('.', '/'), null, getInternalName(Object.class), new String[]{getInternalName(Getter.class)});
{
MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "", "()V", null, null);
mv.visitCode();
mv.visitVarInsn(ALOAD, 0);
mv.visitMethodInsn(INVOKESPECIAL, getInternalName(Object.class), "", "()V");
mv.visitInsn(RETURN);
mv.visitMaxs(1, 1);
mv.visitEnd();
}
{
MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "get", "()I", null, null);
mv.visitCode();
for (int i = 0; i < nopCount; i++) {
mv.visitInsn(NOP);
}
mv.visitLdcInsn(20);
mv.visitInsn(IRETURN);
mv.visitMaxs(1, 1);
mv.visitEnd();
}
cw.visitEnd();
byte[] bytes = cw.toByteArray();
return defineClass(name, bytes, 0, bytes.length);
}
}
Benchmark
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class Bench {
private Getter nop_0;
private Getter nop_10;
...
@Setup
public void setup() throws Exception {
nop_0 = newInstanceWithNOPs(0);
nop_10 = newInstanceWithNOPs(10);
...
}
@GenerateMicroBenchmark
public int nop_0() {
return nop_0.get();
}
@GenerateMicroBenchmark
public int nop_10() {
return nop_10.get();
}
...
The search for truth
At first 2 tests were launched: without knobs and since 2000.
Benchmark Mode Samples Mean Mean error Units
b.Bench.nop_0 thrpt 5 838,753 48,962 ops/us
b.Bench.nop_2000 thrpt 5 298,428 7,965 ops/us
And immediately I made a very powerful conclusion: “Stupid JIT does not cut out the feet, but translates them into the machine ones.”
Question to experts:
If this were true, would the measurement result be similar? Or would there be something completely different?
But this, nevertheless, was a hypothesis, and I really wanted to test it. First I was convinced that these methods are really compiled by JIT, then I looked at what. Naturally, the assembler was completely identical. And then I realized that I didn’t understand something. The executable code is completely the same, and the performance is 2.5 times different. It’s strange.
Then I really wanted to look at the type of dependence.
Benchmark Mode Samples Mean Mean error Units
b.Bench.nop_0 thrpt 5 813,010 71,510 ops/us
b.Bench.nop_2000 thrpt 5 302,589 12,360 ops/us
b.Bench.nop_10000 thrpt 5 0,268 0,017 ops/us
Hidden knowledge
This measurement is generally gorgeous. 3 points, and all from different sequences.
It is worth noting here separately that I did not look at all for the new point whether compilation occurs or what happens on the output. Automatically assumed that everything is the same as at 0 / 2k. Which was a mistake.
I looked at this, and made the following far-reaching conclusion: "The dependence is very strongly non-linear." But, more importantly, in this place I began to suspect that the real thing is not in the knots themselves, but in the size of the method.
The next thought was that we have virtual methods, which means they are stored in a table of virtual methods. Maybe the table itself is size sensitive? For verification, I simply transferred the code to static methods, and, of course, nothing changed at all.
Question to experts 2
Was this thought completely stupid? Or was there anything sensible in her?
Further, from misunderstanding, it was useful to look at what the size of the method has to do with it. The answer was found in openjdk sources:
develop(intx, HugeMethodLimit, 8000, \
"Don't compile methods larger than this if " \
"+DontCompileHugeMethods")
Interestingly, just between 2k and 10k. Let's calculate the size of my method: 3 bytes on “return 20”, 7997 remains.
Benchmark Mode Samples Mean Mean error Units
b.Bench.nop_0 thrpt 5 797,376 12,998 ops/us
b.Bench.nop_2000 thrpt 5 306,795 0,243 ops/us
b.Bench.nop_7997 thrpt 5 303,314 7,161 ops/us
b.Bench.nop_7998 thrpt 5 0,335 0,001 ops/us
b.Bench.nop_10000 thrpt 5 0,269 0,000 ops/us
Guessed, this border is clear. It remains to understand what happens up to 8000 bytes. Add points:
Benchmark Mode Samples Mean Mean error Units
b.Bench.nop_0 thrpt 5 853,499 61,847 ops/us
b.Bench.nop_10 thrpt 5 845,861 112,504 ops/us
b.Bench.nop_100 thrpt 5 867,068 20,681 ops/us
b.Bench.nop_500 thrpt 5 304,116 1,665 ops/us
b.Bench.nop_1000 thrpt 5 299,295 8,745 ops/us
b.Bench.nop_2000 thrpt 5 306,495 0,578 ops/us
b.Bench.nop_7997 thrpt 5 301,322 7,992 ops/us
b.Bench.nop_7998 thrpt 5 0,335 0,005 ops/us
b.Bench.nop_10000 thrpt 5 0,269 0,004 ops/us
b.Bench.nop_25000 thrpt 5 0,105 0,007 ops/us
b.Bench.nop_50000 thrpt 5 0,053 0,001 ops/us
The first thing that pleases us here is that after jit disconnected, a linear relationship is very clearly visible. Which exactly coincides with our expectation, because each NOP needs to be explicitly processed.
The next thing that the eye falls on is a strong feeling that up to 8k is not one kind of dependence, but just 2 constants. Another 5 minutes of manual binary search, and the boundary is found.
Benchmark Mode Samples Mean Mean error Units
b.Bench.nop_0 thrpt 5 805,466 10,074 ops/us
b.Bench.nop_10 thrpt 5 862,027 4,756 ops/us
b.Bench.nop_100 thrpt 5 861,462 9,881 ops/us
b.Bench.nop_322 thrpt 5 863,176 22,385 ops/us
b.Bench.nop_323 thrpt 5 303,677 5,130 ops/us
b.Bench.nop_500 thrpt 5 299,368 11,143 ops/us
b.Bench.nop_1000 thrpt 5 302,884 3,373 ops/us
b.Bench.nop_2000 thrpt 5 306,682 3,598 ops/us
b.Bench.nop_7997 thrpt 5 301,457 4,209 ops/us
b.Bench.nop_7998 thrpt 5 0,337 0,001 ops/us
b.Bench.nop_10000 thrpt 5 0,268 0,004 ops/us
b.Bench.nop_25000 thrpt 5 0,107 0,002 ops/us
b.Bench.nop_50000 thrpt 5 0,053 0,000 ops/us
Almost everything, it remains to understand what kind of border it is. Let's calculate: 3 + 322 == 325. We are looking for what kind of magic 325, and we find a certain key -XX: FreqInlineSize
FreqInlineSize is 325 on modern 64bit Linux
and its description from the docks:
Integer specifying maximum number of bytecode instructions in a frequently executed method which gets inlined.
Hurrah! Finally, everything came together. In total, we found the dependence of performance on the size of the method (naturally, “all other things being equal”).
1. JIT + inline
2. JIT
3. honest interpretation
Conclusion
As I said at the beginning, the main thing that you should pay attention to is not real behavior. It turned out to be quite trivial, and I’m sure it is described in the dock (I haven’t read it, I don’t know). My main message is that it is very important to trust common sense, and if the measurement results are at least a little at odds with it, or just seem incomprehensible, then you must definitely check and double-check everything.
I hope someone found this post interesting.
PS
All the time I counted both 8000 and 325 in bytes inclusive. It seems that it was necessary to do this in the instructions non-inclusive.
Question to experts 3
Why exactly 325 and 8000? Are these random numbers, or is there something behind them?