Finding Strange Performance Causes

Introduction


Finally, I got useful to study Java bytecode in detail, and almost immediately an interesting question arose in my head. There is a NOP instruction there that does nothing. So, how does this “nothing” affect performance? Actually, the process of studying this is described in the post.

Disclaimer


The story itself, first of all, is not about how it really works, but about what kind of errors you should be wary of when measuring performance.

Instruments


Let's start with the main thing: how all the measurements were carried out. To generate the code, the ASM library was used , to create the benchmark itself - JMH .

In order not to use reflection, a small interface was created:
public interface Getter {
    int get();
}


Next, a class that implements the get method was generated :
  public get()I
    NOP
    ...
    NOP
    LDC 20
    IRETURN

You can insert an arbitrary number of notes.

Full generator code
public class SimpleGetterClassLoader extends ClassLoader {
    private static final String GENERATED_CLASS_NAME = "other.GeneratedClass";
    private static final ClassLoader myClassLoader = new SimpleGetterClassLoader();
    @SuppressWarnings("unchecked")
    public static Getter newInstanceWithNOPs(int nopCount) throws Exception {
        Class clazz = Class.forName(GENERATED_CLASS_NAME + "_" + nopCount, false, myClassLoader);
        return (Getter) clazz.newInstance();
    }
    @NotNull
    @Override
    protected Class findClass(@NotNull String name) throws ClassNotFoundException {
        if (!name.startsWith(GENERATED_CLASS_NAME))
            throw new ClassNotFoundException(name);
        int nopCount = Integer.parseInt(name.substring(GENERATED_CLASS_NAME.length() + 1));
        ClassWriter cw = new ClassWriter(0);
        cw.visit(V1_5, ACC_PUBLIC, name.replace('.', '/'), null, getInternalName(Object.class), new String[]{getInternalName(Getter.class)});
        {
            MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "", "()V", null, null);
            mv.visitCode();
            mv.visitVarInsn(ALOAD, 0);
            mv.visitMethodInsn(INVOKESPECIAL, getInternalName(Object.class), "", "()V");
            mv.visitInsn(RETURN);
            mv.visitMaxs(1, 1);
            mv.visitEnd();
        }
        {
            MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "get", "()I", null, null);
            mv.visitCode();
            for (int i = 0; i < nopCount; i++) {
                mv.visitInsn(NOP);
            }
            mv.visitLdcInsn(20);
            mv.visitInsn(IRETURN);
            mv.visitMaxs(1, 1);
            mv.visitEnd();
        }
        cw.visitEnd();
        byte[] bytes = cw.toByteArray();
        return defineClass(name, bytes, 0, bytes.length);
    }
}


Benchmark
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class Bench {
    private Getter nop_0;
    private Getter nop_10;
    ...
    @Setup
    public void setup() throws Exception {
        nop_0 = newInstanceWithNOPs(0);
        nop_10 = newInstanceWithNOPs(10);
        ...
    }
    @GenerateMicroBenchmark
    public int nop_0() {
        return nop_0.get();
    }
    @GenerateMicroBenchmark
    public int nop_10() {
        return nop_10.get();
    }
   ...



The search for truth


At first 2 tests were launched: without knobs and since 2000.

Benchmark            Mode   Samples         Mean   Mean error    Units
b.Bench.nop_0       thrpt         5      838,753       48,962   ops/us
b.Bench.nop_2000    thrpt         5      298,428        7,965   ops/us

And immediately I made a very powerful conclusion: “Stupid JIT does not cut out the feet, but translates them into the machine ones.”
Question to experts:
If this were true, would the measurement result be similar? Or would there be something completely different?

But this, nevertheless, was a hypothesis, and I really wanted to test it. First I was convinced that these methods are really compiled by JIT, then I looked at what. Naturally, the assembler was completely identical. And then I realized that I didn’t understand something. The executable code is completely the same, and the performance is 2.5 times different. It’s strange.

Then I really wanted to look at the type of dependence.
Benchmark             Mode   Samples         Mean   Mean error    Units
b.Bench.nop_0        thrpt         5      813,010       71,510   ops/us
b.Bench.nop_2000     thrpt         5      302,589       12,360   ops/us
b.Bench.nop_10000    thrpt         5        0,268        0,017   ops/us

Hidden knowledge
This measurement is generally gorgeous. 3 points, and all from different sequences.

It is worth noting here separately that I did not look at all for the new point whether compilation occurs or what happens on the output. Automatically assumed that everything is the same as at 0 / 2k. Which was a mistake.

I looked at this, and made the following far-reaching conclusion: "The dependence is very strongly non-linear." But, more importantly, in this place I began to suspect that the real thing is not in the knots themselves, but in the size of the method.

The next thought was that we have virtual methods, which means they are stored in a table of virtual methods. Maybe the table itself is size sensitive? For verification, I simply transferred the code to static methods, and, of course, nothing changed at all.
Question to experts 2
Was this thought completely stupid? Or was there anything sensible in her?


Further, from misunderstanding, it was useful to look at what the size of the method has to do with it. The answer was found in openjdk sources:
  develop(intx, HugeMethodLimit,  8000,                                     \
          "Don't compile methods larger than this if "                      \
          "+DontCompileHugeMethods")      

Interestingly, just between 2k and 10k. Let's calculate the size of my method: 3 bytes on “return 20”, 7997 remains.
Benchmark             Mode   Samples         Mean   Mean error    Units
b.Bench.nop_0        thrpt         5      797,376       12,998   ops/us
b.Bench.nop_2000     thrpt         5      306,795        0,243   ops/us
b.Bench.nop_7997     thrpt         5      303,314        7,161   ops/us
b.Bench.nop_7998     thrpt         5        0,335        0,001   ops/us
b.Bench.nop_10000    thrpt         5        0,269        0,000   ops/us

Guessed, this border is clear. It remains to understand what happens up to 8000 bytes. Add points:
Benchmark             Mode   Samples         Mean   Mean error    Units
b.Bench.nop_0        thrpt         5      853,499       61,847   ops/us
b.Bench.nop_10       thrpt         5      845,861      112,504   ops/us
b.Bench.nop_100      thrpt         5      867,068       20,681   ops/us
b.Bench.nop_500      thrpt         5      304,116        1,665   ops/us
b.Bench.nop_1000     thrpt         5      299,295        8,745   ops/us
b.Bench.nop_2000     thrpt         5      306,495        0,578   ops/us
b.Bench.nop_7997     thrpt         5      301,322        7,992   ops/us
b.Bench.nop_7998     thrpt         5        0,335        0,005   ops/us
b.Bench.nop_10000    thrpt         5        0,269        0,004   ops/us
b.Bench.nop_25000    thrpt         5        0,105        0,007   ops/us
b.Bench.nop_50000    thrpt         5        0,053        0,001   ops/us

The first thing that pleases us here is that after jit disconnected, a linear relationship is very clearly visible. Which exactly coincides with our expectation, because each NOP needs to be explicitly processed.

The next thing that the eye falls on is a strong feeling that up to 8k is not one kind of dependence, but just 2 constants. Another 5 minutes of manual binary search, and the boundary is found.
Benchmark             Mode   Samples         Mean   Mean error    Units
b.Bench.nop_0        thrpt         5      805,466       10,074   ops/us
b.Bench.nop_10       thrpt         5      862,027        4,756   ops/us
b.Bench.nop_100      thrpt         5      861,462        9,881   ops/us
b.Bench.nop_322      thrpt         5      863,176       22,385   ops/us
b.Bench.nop_323      thrpt         5      303,677        5,130   ops/us
b.Bench.nop_500      thrpt         5      299,368       11,143   ops/us
b.Bench.nop_1000     thrpt         5      302,884        3,373   ops/us
b.Bench.nop_2000     thrpt         5      306,682        3,598   ops/us
b.Bench.nop_7997     thrpt         5      301,457        4,209   ops/us
b.Bench.nop_7998     thrpt         5        0,337        0,001   ops/us
b.Bench.nop_10000    thrpt         5        0,268        0,004   ops/us
b.Bench.nop_25000    thrpt         5        0,107        0,002   ops/us
b.Bench.nop_50000    thrpt         5        0,053        0,000   ops/us

Almost everything, it remains to understand what kind of border it is. Let's calculate: 3 + 322 == 325. We are looking for what kind of magic 325, and we find a certain key -XX: FreqInlineSize
FreqInlineSize is 325 on modern 64bit Linux

and its description from the docks:
Integer specifying maximum number of bytecode instructions in a frequently executed method which gets inlined.


Hurrah! Finally, everything came together. In total, we found the dependence of performance on the size of the method (naturally, “all other things being equal”).
1. JIT + inline
2. JIT
3. honest interpretation

Conclusion


As I said at the beginning, the main thing that you should pay attention to is not real behavior. It turned out to be quite trivial, and I’m sure it is described in the dock (I haven’t read it, I don’t know). My main message is that it is very important to trust common sense, and if the measurement results are at least a little at odds with it, or just seem incomprehensible, then you must definitely check and double-check everything.

I hope someone found this post interesting.

PS


All the time I counted both 8000 and 325 in bytes inclusive. It seems that it was necessary to do this in the instructions non-inclusive.

Question to experts 3
Why exactly 325 and 8000? Are these random numbers, or is there something behind them?

Also popular now: