Current C1 sometimes uses the single-byte nops, e.g. for aligning the call immediate:
0.14% 0.16% 0x00007efd65652ba0: nop
0.16% 0.10% 0x00007efd65652ba1: nop
0.09% 0.04% 0x00007efd65652ba2: nop
10.35% 10.32% 0x00007efd65652ba3: nop
0.14% 0.10% 0x00007efd65652ba4: nop
0.14% 0.17% 0x00007efd65652ba5: nop
0.13% 0.14% 0x00007efd65652ba6: nop
10.59% 5.78% 0x00007efd65652ba7: callq 0x00007efd65046160 ; ImmutableOopMap{[192]=Oop [176]=Oop [184]=Oop }
;*invokespecial sink
; {optimized virtual_call}
This is due to code patterns like:
while (offset++ % BytesPerWord != 0) {
__ nop();
}
...even though we have Assembler::nop(int i). We might need to revisit the C1 nop emits, and refactor them to use the nop(int i) method instead of looping.
A proof-of-concept patch shows ~20% improvements on targeted call benchmarks, even on modern Haswell CPU.