JDK-8231561 : [lworld] C2 generates inefficient code for acmp
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: repo-valhalla
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • CPU: x86
  • Submitted: 2019-09-26
  • Updated: 2021-07-30
  • Resolved: 2019-10-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
repo-valhallaFixed
Related Reports
Relates :  
Relates :  
Description
I'd like to describe a set of issues with exists in 'acmp' codegeneration.
Benchmarks: http://cr.openjdk.java.net/~skuksenko/valhalla/cleanup/
(Params, FieldsO, FieldsI)

1. Clearing ArrayStorageProperties bits clazz ptr.
Obviously. We check types equality for both arguments and to do it - performs clearing flattened_array & non_null array.
That clearing is not required, because of we already know that one argument is not an array (and free from both bits).

2. In case when null never passed as an argument for our comparison, explicit null checks are not generated, traps to recompilation are used.
In that case load the first class ptr is always moved before checking markword for always locked pattern. That is useless for all non-inline types arguments:

-XX:-UseCompressedOops:
0.29%         0x00007f70001be624:   cmp    %rdx,%rbp
              ���      0x00007f70001be627:   je     0x00007f70001be671
  0.40%  ���      0x00007f70001be629:   mov    0x8(%rbp),%r10               ; implicit exception: dispatches to 0x00007f70001be80f
  4.77%  ���      0x00007f70001be62d:   shl    $0x3,%r10
  3.80%  ���      0x00007f70001be631:   sar    $0x3,%r10

SK: ^^ thats load should be moved to basic block where it's required (and arrays properties bits clearing is also moved here).

3.48%  ���      0x00007f70001be635:   mov    $0x405,%r11d
            ���      0x00007f70001be63b:   and    0x0(%rbp),%r11
0.25%  ���      0x00007f70001be63f:   cmp    $0x405,%r11
            ������     0x00007f70001be646:   jne    0x00007f70001be66d
            ������     0x00007f70001be648:   mov    0x8(%rdx),%r11               ; implicit exception: dispatches to 0x00007f70001be893
            ������     0x00007f70001be64c:   shl    $0x3,%r11
            ������     0x00007f70001be650:   sar    $0x3,%r11
            ������     0x00007f70001be654:   cmp    %r11,%r10
            ���������    0x00007f70001be657:   jne    0x00007f70001be66d           ;*if_acmpne {reexecute=0 rethrow=0 return_oop=0 return_vt=0}
 
-XX:+UseCompressedOops:
0x00007f0d241bf543:   cmp    %r9d,%ebp
0x00007f0d241bf546:   je     0x00007f0d241bf599               ;*if_acmpne {reexecute=0 rethrow=0 return_oop=0 return_vt=0}
                                                                                                ; - acmp.FieldsO::cmp0@8 (line 34)
0x00007f0d241bf548:   mov    0x8(%r12,%rbp,8),%r10d       ; implicit exception: dispatches to 0x00007f0d241bf737
0x00007f0d241bf54d:   lea    (%r12,%rbp,8),%rsi                  ;*getfield f {reexecute=0 rethrow=0 return_oop=0 return_vt=0}
                                                                                               ; - acmp.FieldsO::cmp0@5 (line 34)
0x00007f0d241bf551:   mov    $0x405,%r11d
0x00007f0d241bf557:   and    (%rsi),%r11
0x00007f0d241bf55a:   cmp    $0x405,%r11
0x00007f0d241bf561:   jne    0x00007f0d241bf595
0x00007f0d241bf563:   mov    0x8(%r12,%r9,8),%r11d        ; implicit exception: dispatches to 0x00007f0d241bf7bb
0x00007f0d241bf568:   mov    %r11d,%r8d
0x00007f0d241bf56b:   and    $0x1fffffff,%r8d
0x00007f0d241bf572:   and    $0x1fffffff,%r10d
0x00007f0d241bf579:   mov    %r10d,%r11d
0x00007f0d241bf57c:   cmp    %r8d,%r11d
0x00007f0d241bf57f:    jne    0x00007f0d241bf595           ;*if_acmpne {reexecute=0 rethrow=0 return_oop=0 return_vt=0}
                                                                                            ; - acmp.FieldsO::cmp0@8 (line 34)
0x00007f0d241bf581:   lea    (%r12,%r9,8),%rdx              ;*getfield f {reexecute=0 rethrow=0 return_oop=0 return_vt=0}
                                                                                           ; - acmp.FieldsO::cmp0@1 (line 34)
0x00007f0d241bf585:   mov    %r9d,(%rsp)
0x00007f0d241bf589:   xchg   %ax,%ax
0x00007f0d241bf58b:   callq  0x00007f0d1c717e80           ; ImmutableOopMap {rbp=NarrowOop [0]=NarrowOop }

In case of compressed oops the picture is same. Besides, there are many questions, why sometimes uncompressed pointer stored to register, but in other cases complex memory load is used. 


Comments
http://hg.openjdk.java.net/valhalla/valhalla/rev/85f25a764824
11-10-2019

http://cr.openjdk.java.net/~thartmann/8231561/webrev.00/ This patch includes fixes for the following performance issues that Sergey found: (1) Clearing of array property bits includes useless mov instructions. (2) When loading the klass of the acmp operands, we already know that one operand is an inline type and therefore don't need to clear the storage properties of the klass pointer. In fact, there are many other places in the code where the LoadKlassNode implementation does not need to clear the storage property bits because we know that either the object can't be an array or we don't load the klass ptr from the object header but from some other location. (3) Implicit null checking does not work for the 'and' instruction that is used for the is_always_locked check because the corresponding MachNode references a constant load of the mask that prevents hoisting. As a result, in the acmp implementation, a later load of the klass is converted to an implicit null check and hoisted to before that check (i.e. it's always executed although it's only needed if the first operand is an inline type). (4) When loading the acmp operands from a field, complex memory loads should be used for decoding when loading the mark word for the is_always_locked check. In addition, I've noticed that the inline type guard for the System.identityHashCode intrinsic is useless because inline types always have the always_locked_pattern set and therefore a subsequent guard will always trigger.
08-10-2019

Thanks Sergey, I've found the problem and fixed it.
08-10-2019

> This is most likely due to missing IR matching rules in C2's backend. I.e., we can only translate certain hard-coded IR shapes to complex memory loads. Do you have examples? c2, level 4, acmp.FieldsI::cmp0, version 466 (113 bytes) [Verified Entry Point] [Verified Value Entry Point] [Verified Value Entry Point (RO)] # this: rsi:rsi = 'acmp/FieldsI' # parm0: rdx:rdx = object # parm1: rcx:rcx = object # [sp+0x30] (sp of caller) 3.88% 0x00007f2eb81c1810: mov %eax,-0x14000(%rsp) 5.17% 0x00007f2eb81c1817: push %rbp 1.62% 0x00007f2eb81c1818: sub $0x20,%rsp ;*synchronization entry ; - acmp.FieldsI::cmp0@-1 (line 35) 4.86% 0x00007f2eb81c181c: mov 0xc(%rdx),%r9d ; implicit exception: dispatches to 0x00007f2eb81c1903 ;*getfield f {reexecute=0 rethrow=0 return_oop=0 return_vt=0} ; - acmp.FieldsI::cmp0@1 (line 35) 2.30% 0x00007f2eb81c1820: mov 0xc(%rcx),%ebp ; implicit exception: dispatches to 0x00007f2eb81c1987 ;*getfield f {reexecute=0 rethrow=0 return_oop=0 return_vt=0} ; - acmp.FieldsI::cmp0@5 (line 35) SK: now r9 & rbp contains our two uncompressed refs 1.18% 0x00007f2eb81c1823: cmp %r9d,%ebp ��� 0x00007f2eb81c1826: je 0x00007f2eb81c186d 1.49% ��� 0x00007f2eb81c1828: test %ebp,%ebp ������ 0x00007f2eb81c182a: je 0x00007f2eb81c1842 ;*if_acmpne {reexecute=0 rethrow=0 return_oop=0 return_vt=0} ������ ; - acmp.FieldsI::cmp0@8 (line 35) 2.16% ������ 0x00007f2eb81c182c: lea (%r12,%rbp,8),%rsi ;*getfield f {reexecute=0 rethrow=0 return_oop=0 return_vt=0} ������ ; - acmp.FieldsI::cmp0@5 (line 35) SK: uncompress rbp 0.87% ������ 0x00007f2eb81c1830: mov $0x405,%r10d 0.02% ������ 0x00007f2eb81c1836: and (%rsi),%r10 SK: the only usage of uncompressed rbp 10.90% ������ 0x00007f2eb81c1839: cmp $0x405,%r10 3.76% ��������� 0x00007f2eb81c1840: je 0x00007f2eb81c1846 4.44% ��������� ��������� 0x00007f2eb81c1842: xor %eax,%eax 0.06% ��� ��������������� 0x00007f2eb81c1844: jmp 0x00007f2eb81c1872 ��� ��������������� 0x00007f2eb81c1846: test %r9d,%r9d ��� ������������ 0x00007f2eb81c1849: je 0x00007f2eb81c1842 ��� ��� ������ 0x00007f2eb81c184b: mov 0x8(%r12,%r9,8),%r10d SK: embedded uncompressing r9 ��� ��� ������ 0x00007f2eb81c1850: mov 0x8(%r12,%rbp,8),%r8d SK: embedded uncompressing rbp (rsi already has it) ��� ��� ������ 0x00007f2eb81c1855: cmp %r10d,%r8d ��� ��� ������ 0x00007f2eb81c1858: jne 0x00007f2eb81c1842 ;*if_acmpne {reexecute=0 rethrow=0 return_oop=0 return_vt=0} ��� ��� ��� ; - acmp.FieldsI::cmp0@8 (line 35) ��� ��� ��� 0x00007f2eb81c185a: lea (%r12,%r9,8),%rdx ;*getfield f {reexecute=0 rethrow=0 return_oop=0 return_vt=0} SK: uncompressing r9, but above we used complex load.
02-10-2019

> 1. Clearing ArrayStorageProperties bits clazz ptr. > Obviously. We check types equality for both arguments and to do it - performs clearing flattened_array & non_null array. > That clearing is not required, because of we already know that one argument is not an array (and free from both bits). Good catch. With my fix, the code looks like this: // Null check first operand 0x00007f8468aeb7a8: test %rcx,%rcx 0x00007f8468aeb7ab: je 0x00007f8468aeb7e1 // Check if first operand is an inline type (always locked pattern) 0x00007f8468aeb7ad: mov $0x405,%r10d 0x00007f8468aeb7b3: and (%rcx),%r10 0x00007f8468aeb7b6: cmp $0x405,%r10 0x00007f8468aeb7bd: jne 0x00007f8468aeb7e1 // Null check second operand 0x00007f8468aeb7bf: test %rdx,%rdx 0x00007f8468aeb7c2: je 0x00007f8468aeb7e1 // Load and compare klass ptrs 0x00007f8468aeb7c4: mov 0x8(%rdx),%r11d 0x00007f8468aeb7c8: mov 0x8(%rcx),%r10d 0x00007f8468aeb7cc: cmp %r11d,%r10d > 2. In case when null never passed as an argument for our comparison, explicit null checks are not generated, traps to recompilation are used. > In that case load the first class ptr is always moved before checking markword for always locked pattern. That is useless for all non-inline types arguments: Right, in this case we generate *implicit* null checks when loading the klass ptrs: // Implicitly null check first operand and load klass ptr 0x00007f8468aeabc8: mov 0x8(%rcx),%r11d ; implicit exception: dispatches to 0x00007f8468aeac8f // Check if first operand is an inline type 0x00007f8468aeabcc: mov $0x405,%r10d 0x00007f8468aeabd2: and (%rcx),%r10 0x00007f8468aeabd5: cmp $0x405,%r10 0x00007f8468aeabdc: jne 0x00007f8468aeabf5 // Implicitly null check second operand and load klass ptr 0x00007f8468aeabde: mov 0x8(%rdx),%r10d ; implicit exception: dispatches to 0x00007f8468aead13 // Compare klass ptrs 0x00007f8468aeabe2: cmp %r10d,%r11d Indeed, it would be better to do the implicit null check at "and (%rcx),%r10". I think this is an existing limitation of the implicit null check optimization. I'll investigate. > In case of compressed oops the picture is same. Besides, there are many questions, why sometimes uncompressed pointer stored to register, but in other cases complex memory load is used. This is most likely due to missing IR matching rules in C2's backend. I.e., we can only translate certain hard-coded IR shapes to complex memory loads. Do you have examples?
02-10-2019