I have long suspected that C2Stubs take a significant part of the code cache. Most of them look different "only" because there are Label-s that bind their return addresses. The meat of these stubs are roughly the same. And since these stubs often call runtime, the bulk of their meat is stack pushes/pops.
This is not exactly the problem for the hot code paths: these stubs are out-lined for a reason. But it becomes a problem when code density is needed anyway, at least for two cases: a) the interprocedural calls like to call in near memory (at least on some AArch64 cores); b) AOTCache would eventually store generated code, and archive size would impact startup/warmup.
To estimate how bad it is, I whipped up a simple patch:
https://github.com/openjdk/jdk/compare/master...shipilev:wip-count-nmethod-size
Here is the result for SpringBoot PetClinic running at x86_64:
Parallel: 5670480 in instructions, 0 in GC stubs, 96094 in C2 stubs ( 1.7% are stubs)
G1: 6437416 in instructions, 623779 in GC stubs, 101031 in C2 stubs (10.1% are stubs)
Shenandoah: 7466346 in instructions, 0 in GC stubs, 93806 in C2 stubs ( 1.2% are stubs)
ZGC: 5530769 in instructions, 2018482 in GC stubs, 92234 in C2 stubs (27.6% are stubs)
Looks like late barrier expansion in G1 and ZGC contribute most of the stub code. Shenandoah, when it will implement late barrier expansion, would get into the same condition.
I wonder if we can make it a bit better by de-duplicating the stubs, and maybe _calling_ them with interior near calls (like we do trampolines), so that we "record" the return address using machine state itself? Remains to be seen if it is possible without affecting fast-path performance much. Current stub code is also smartly push/pop-ing only the part of the registers that are actually in use at the point where the stub is used -- that might tip the scale significantly.
Example tail of nmethod with G1:
; C2Stub 1
0x00007f40e42e730f: mov (%r10),%r11d
0x00007f40e42e7312: shl $0x3,%r11
0x00007f40e42e7316: cmp $0x0,%r11
0x00007f40e42e731a: je 0x00007f40e42e7183
0x00007f40e42e7320: mov 0x38(%r15),%rdi
0x00007f40e42e7324: test %rdi,%rdi
0x00007f40e42e7327: je 0x00007f40e42e7341
0x00007f40e42e732d: sub $0x8,%rdi
0x00007f40e42e7331: mov %rdi,0x38(%r15)
0x00007f40e42e7335: add 0x40(%r15),%rdi
0x00007f40e42e7339: mov %r11,(%rdi)
0x00007f40e42e733c: jmp 0x00007f40e42e7183
0x00007f40e42e7341: sub $0x40,%rsp
0x00007f40e42e7345: mov %r10,0x38(%rsp)
0x00007f40e42e734a: mov %r8,0x30(%rsp)
0x00007f40e42e734f: mov %r9,0x28(%rsp)
0x00007f40e42e7354: mov %rcx,0x20(%rsp)
0x00007f40e42e7359: mov %rdx,0x18(%rsp)
0x00007f40e42e735e: mov %rsi,0x10(%rsp)
0x00007f40e42e7363: mov %rax,0x8(%rsp)
0x00007f40e42e7368: mov %r11,%rdi
0x00007f40e42e736b: mov %r15,%rsi
0x00007f40e42e736e: call 0x00007f40f616ea70 ; {runtime_call G1BarrierSetRuntime::write_ref_field_pre_entry(oopDesc*, JavaThread*)}
0x00007f40e42e7373: mov 0x8(%rsp),%rax
0x00007f40e42e7378: mov 0x10(%rsp),%rsi
0x00007f40e42e737d: mov 0x18(%rsp),%rdx
0x00007f40e42e7382: mov 0x20(%rsp),%rcx
0x00007f40e42e7387: mov 0x28(%rsp),%r9
0x00007f40e42e738c: mov 0x30(%rsp),%r8
0x00007f40e42e7391: mov 0x38(%rsp),%r10
0x00007f40e42e7396: vzeroupper
0x00007f40e42e7399: add $0x40,%rsp
0x00007f40e42e739d: jmp 0x00007f40e42e7183
; C2Stub 2
0x00007f40e42e73a2: mov (%r10),%r11d
0x00007f40e42e73a5: shl $0x3,%r11
0x00007f40e42e73a9: cmp $0x0,%r11
0x00007f40e42e73ad: je 0x00007f40e42e7196
0x00007f40e42e73b3: mov 0x38(%r15),%rdi
0x00007f40e42e73b7: test %rdi,%rdi
0x00007f40e42e73ba: je 0x00007f40e42e73d4
0x00007f40e42e73c0: sub $0x8,%rdi
0x00007f40e42e73c4: mov %rdi,0x38(%r15)
0x00007f40e42e73c8: add 0x40(%r15),%rdi
0x00007f40e42e73cc: mov %r11,(%rdi)
0x00007f40e42e73cf: jmp 0x00007f40e42e7196
0x00007f40e42e73d4: sub $0x40,%rsp
0x00007f40e42e73d8: mov %r10,0x38(%rsp)
0x00007f40e42e73dd: mov %r8,0x30(%rsp)
0x00007f40e42e73e2: mov %r9,0x28(%rsp)
0x00007f40e42e73e7: mov %rcx,0x20(%rsp)
0x00007f40e42e73ec: mov %rdx,0x18(%rsp)
0x00007f40e42e73f1: mov %rsi,0x10(%rsp)
0x00007f40e42e73f6: mov %rax,0x8(%rsp)
0x00007f40e42e73fb: mov %r11,%rdi
0x00007f40e42e73fe: mov %r15,%rsi
0x00007f40e42e7401: call 0x00007f40f616ea70 ; {runtime_call G1BarrierSetRuntime::write_ref_field_pre_entry(oopDesc*, JavaThread*)}
0x00007f40e42e7406: mov 0x8(%rsp),%rax
0x00007f40e42e740b: mov 0x10(%rsp),%rsi
0x00007f40e42e7410: mov 0x18(%rsp),%rdx
0x00007f40e42e7415: mov 0x20(%rsp),%rcx
0x00007f40e42e741a: mov 0x28(%rsp),%r9
0x00007f40e42e741f: mov 0x30(%rsp),%r8
0x00007f40e42e7424: mov 0x38(%rsp),%r10
0x00007f40e42e7429: vzeroupper
0x00007f40e42e742c: add $0x40,%rsp
0x00007f40e42e7430: jmp 0x00007f40e42e7196