JDK-8366042 : Creates Object Headers Microbenchmark Set.
  • Type: Enhancement
  • Component: performance
  • Sub-Component: libraries
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: generic
  • CPU: generic
  • Submitted: 2025-08-24
  • Updated: 2025-12-02
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
1) Create a set of microbenchmarks covering every elementary JVM operation that uses Object Header. 
Examples of such operations: virtual method invocation (including cache inline), instanceof, type cast, identity hash code, array store check, synchronization, etc. 

2) Extend the Object Headers Microbenchmark Set for "in-progress" projects. 
For example, Project Valhalla adds object header checks for acmp/aastore/aaload operations.

Having such a set of microbenchmarks, it would be possible to make performance evaluation, performance analysis, and performance regression tracking for every change/extension in the object header area. 

For example, it's known that Compact Object Headers cause -20% performance overhead for virtual method invocation. 

The goal is to get a comprehensive picture of Compact Object Headers (Vallhalla, and other projects) performance impact.
Ideally, every object header access should be covered by the Object Headers Microbenchmark Set except GC. It's not a goal to cover GC operations.




Comments
"But I cannot imagine this costing 20% performance". When I said 20%, I meant 20% of the invocation costs. Invocation cost is also cheap. Cheap divided by cheap may give 20%. It highly depends on HW. In June, I had 20%, now I have newer HW, and I am getting ~10%. - The overhead of the decoding is small and can be visible on specifically designed microbenchmarks. That is why I marked the performance as "OK" in https://bugs.openjdk.org/browse/JDK-8355513. For real apps, the overhead is usually absorbed by the code around. I found only one real app (crypto AES encoding/decoding) that gives a visible slowdown.
02-12-2025

So, I wondered about the "That is caused by additional ASM instructions generated for class pointer decoding." statement in https://cr.openjdk.org/~skuksenko/jep519/performance.txt. For Non-COH, example: ``` 0x00007f69502512ac: mov 0x8(%rdx),%edi ;; decode_klass_not_null { 0x00007f69502512af: mov $0x51000000,%r10d 0x00007f69502512b5: add %r10,%rdi ;; } decode_klass_not_null ``` for COH: ``` 0x00007fadc42d0cac: mov (%rdx),%rdi 0x00007fadc42d0caf: shr $0x2a,%rdi ;; decode_klass_not_null { 0x00007fadc42d0cb3: shl $0x9,%rdi 0x00007fadc42d0cb7: mov $0x50000000,%r10d 0x00007fadc42d0cbd: add %r10,%rdi ;; } decode_klass_not_null ``` Compared to stock, with COH, okay, we need 2 additional shift instructions to extract the narrow KP from the Markword: one for the decoding shift. But I cannot imagine this costing 20% performance. Shift operations are very, very cheap. Moreover, in COH, we can load the NK from the Markword. If the markword is already loaded into a register, we save a load. Non-COH always needs a separate load for this. That is at least 5 cycles if the content at Oop* is in L1. ------------------------ Side note: I think there is room for improvement; maybe we're not fully leveraging COH in some cases. E.g., I see that on COH, we often make 2 memory accesses: 1) checking markword == 0 and 2) loading the NK from the markword. Like this: ``` 0x00007fadc42d0cf8: cmp (%rsi),%rax // check markword == 0 ;; check_receiver { 0x00007fadc42d0cfb: mov (%rsi),%r10 // load markword into R10 0x00007fadc42d0cfe: shr $0x2a,%r10 ``` The first two operations could be reversed, then the CMP could be done to R10 instead of a memory location. Then we have only one memory load, which is better than Non-COH, which always needs two. ---------------------- [~skuksenko] you write " 2. Performance improvements. Only 3 benchmarks got undeniable performance improvement (on all platforms): Renaissance-FjKmeans: ~7% SPECjbb2005: ~5% SPECjbb2015: ~8% " IMHO, performance gains in compound benchmarks like these are a huge deal and easily outweigh micro-level losses. 8% in SpecJBB are a lot. I saw even better improvements in my tests. In fact, I measured improvements in memory throughput of up to 25% in SpecJBB. Here, we can clearly see the benefit of improved memory bandwidth and decreased footprint; something you will not see in micros that don't really stress memory bandwidth. Have you seen compound benchmarks that show decreased performance with COH?
02-12-2025

Here is what was reported about COH performance in June. Roman saw it, so I can't understand why the last point surprised. https://cr.openjdk.org/~skuksenko/jep519/performance.txt
02-12-2025

Here is an example: # JMH version: 1.38-SNAPSHOT # VM version: JDK 25.0.1, Java HotSpot(TM) 64-Bit Server VM, 25.0.1+8-LTS-27 # VM invoker: /home/skuksenk/jdk-25.0.1/bin/java # VM options: -XX:-UseCompactObjectHeaders Benchmark Mode Cnt Score Error Units o.o.b.m.i.array.Identity.target0 avgt 15 0.169 ± 0.001 ns/op o.o.b.m.i.array.Identity.target1_i avgt 15 0.241 ± 0.005 ns/op o.o.b.m.i.array.Identity.target1_r avgt 15 0.235 ± 0.004 ns/op o.o.b.m.i.array.Identity.target1_ri avgt 15 0.233 ± 0.006 ns/op o.o.b.m.i.array.Identity.target2_i avgt 15 0.325 ± 0.004 ns/op o.o.b.m.i.array.Identity.target2_r avgt 15 0.326 ± 0.003 ns/op o.o.b.m.i.array.Identity.target2_ri avgt 15 0.324 ± 0.003 ns/op o.o.b.m.i.array.Identity.target3_i avgt 15 1.463 ± 0.022 ns/op o.o.b.m.i.array.Identity.target3_r avgt 15 1.470 ± 0.017 ns/op o.o.b.m.i.array.Identity.target3_ri avgt 15 1.474 ± 0.011 ns/op o.o.b.m.i.field.Identity.target0 avgt 15 0.241 ± 0.002 ns/op o.o.b.m.i.field.Identity.target1 avgt 15 0.342 ± 0.006 ns/op o.o.b.m.i.field.Identity.target2 avgt 15 0.389 ± 0.004 ns/op o.o.b.m.i.field.Identity.target3 avgt 15 1.672 ± 0.023 ns/op # JMH version: 1.38-SNAPSHOT # VM version: JDK 25.0.1, Java HotSpot(TM) 64-Bit Server VM, 25.0.1+8-LTS-27 # VM invoker: /home/skuksenk/jdk-25.0.1/bin/java # VM options: -XX:+UseCompactObjectHeaders Benchmark Mode Cnt Score Error Units o.o.b.m.i.array.Identity.target0 avgt 15 0.169 ± 0.001 ns/op o.o.b.m.i.array.Identity.target1_i avgt 15 0.254 ± 0.005 ns/op o.o.b.m.i.array.Identity.target1_r avgt 15 0.254 ± 0.003 ns/op o.o.b.m.i.array.Identity.target1_ri avgt 15 0.251 ± 0.002 ns/op o.o.b.m.i.array.Identity.target2_i avgt 15 0.374 ± 0.004 ns/op o.o.b.m.i.array.Identity.target2_r avgt 15 0.373 ± 0.003 ns/op o.o.b.m.i.array.Identity.target2_ri avgt 15 0.376 ± 0.006 ns/op o.o.b.m.i.array.Identity.target3_i avgt 15 1.497 ± 0.018 ns/op o.o.b.m.i.array.Identity.target3_r avgt 15 1.501 ± 0.030 ns/op o.o.b.m.i.array.Identity.target3_ri avgt 15 2.463 ± 1.511 ns/op o.o.b.m.i.field.Identity.target0 avgt 15 0.242 ± 0.004 ns/op o.o.b.m.i.field.Identity.target1 avgt 15 0.355 ± 0.005 ns/op o.o.b.m.i.field.Identity.target2 avgt 15 0.406 ± 0.005 ns/op o.o.b.m.i.field.Identity.target3 avgt 15 1.672 ± 0.026 ns/op While "Object Header Microbenchmark Set" is not created, you may take identity part of micros from https://github.com/openjdk/valhalla/tree/lworld/test/micro/org/openjdk/bench/valhalla/invoke The JBS was not created because it's a regression, but not an issue. Not every performance difference should be a JBS issue. The regression is caused by extra instructions for vtable address decoding in COH-ON case. That can't be fixed.
01-12-2025

I may have a suspicion, but before voicing it, I'd like to see a JBS issue and a repro case if possible.
01-12-2025

Is this issue blocking making COH on by default?
01-12-2025

"For example, it's known that Compact Object Headers cause -20% performance overhead for virtual method invocation." Is it? It is new to me, tbh. Where have you seen this? I also don't know why this should happen - it is not really plausible.
01-12-2025

[~skuksenko] " For example, it's known that Compact Object Headers cause -20% performance overhead for virtual method invocation. " Known where? Do we have a JBS report for this?
01-12-2025