Comparing the codepaths for allocation in 64-bit mode and 64-bit mode with compressed oops yields some interesting codegen issues, caused by the BiasedLocking prototype header mechanism and compressed oops.
For example, "new Object()" yields this allocation sequence:
; load and unpack metadata for java.lang.Object
mov $0x200001d5,%r11d
movabs $0x0,%r10
lea (%r10,%r11,8),%r10
; get prototype mark word, and store it into object
mov 0xa8(%r10),%r10
mov %r10,(%rax)
; store class word
movl $0x200001d5,0x8(%rax)
Doing either -XX:-UseCompressedOops or -XX:-UseBiasedLocking improves allocation performance at around +7% -- seems because we strip away the decoding part.
It does not seem simple or sane to rework biased locking machinery to avoid polling the prototype headers during allocation, so we may want to just improve the generated code quality there. Indeed, it seems that -XX:-UseCompressedOops is providing the good boost on targeted microbenchmark. There are two things that we might try to improve the code quality here:
a) Since we know the narrow class address statically, we might as well unpack it statically, and store it right away, e.g.:
; get prototype mark word, and store it into object
mov $0x100000EA8,%r10d
mov 0xa8(%r10),%r10
mov %r10,(%rax)
; store class word
movl $0x200001d5,0x8(%rax)
b) Keep the narrow class constant, but generate better code:
; get prototype mark word, and store it into object
mov $0x200001d5,%r11d
mov 0xa8(%r12,%r11,8),%r10
mov %r10,(%rax)
; store class word
movl %r11,0x8(%rax)