Bug ID: JDK-6545986 Bad scheduling in gcm causes performance problems

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 7,9,10
Priority: P4
Status: Closed
Resolution: Won't Fix
OS: generic
CPU: sparc
Submitted: 2007-04-13
Updated: 2017-04-07
Resolved: 2017-04-07
Investigation of 6544710 shows that bad scheduling in gcm causes performance problems.

On Apr 11, 2007, at 10:50 AM, Vladimir Kozlov wrote:

> John Rose wrote:
>> On Apr 11, 2007, at 1:03 AM, Vladimir Kozlov wrote:
>>> 2. Replace set_memory(adr_type) in arraycopy code with
>>> set_all_memory(reset_memory()). It eliminates all weird
>>> address expressions movement and spills.
>>>
>> I don't like this part, if it makes array copy calls into
>> barriers for every other kind of memory operation.
>> That would inhibit value numbering of field loads around
>> the memory copy.  Such field loads (both before
>> and after  are common in Java collections and
>> similar classes involving header/array data structures.
>
> John,
>
> I think, this is exactly what is causing the problem.
> Fields are loaded to early (for example, they loaded before
> arraycopy calls but used only after). With infinite number of
> registers it may be good since a value will be available for use
> even if it is missed in cache. But with low number of register
> or when some registers are killed by calls it causes
> registers spilling on stack.
So the combination of a register-squeezing stub call
with field shuffling causes a problem.  That does not
seem unique to arraycopy.

To fend off this problem for the moment, I suggest
a point-fix:  Put a MemBarCPUOrder after the intrinsic.
Keep the memory slices as they are.  If the arraycopy
can be optimized away (which it can, sometimes)
then we can manually remove the membar also.
And, comment the observed problem & ubenchmark
around the memory barrier.

> The weird thing is it also causes explicit null check generation.
Yes, that's very weird.  Something is going wrong with
frequency estimates, maybe.  Tom pointed out the
inconsistencies in the block and branch counts.

On Apr 11, 2007, at 12:32 PM, Tom Rodriguez wrote:
> Try disabling the call the hoist_to_cheaper_block in gcm.cpp.  This will force everything to the latest possible block.
Or give hoist_to_cheaper_block some inertia (hysteresis?)
Something similar to (but maybe more clever than):

  -  if (LCA_freq < least_freq              || // Better Frequency
  +  if (LCA_freq < least_freq * 0.25       || // Better Frequency

The idea is to hoist only if there's a decisive advantage,
such as moving into a loop header.  (I think that's the
only case that code contemplates.  Only with loops can
a dominator have a lower frequency than a 'dominee'.)

> I think we want to keep John's change.  It looks to me like it enabled the elimination of some extra loads of the array length and maybe some others.
Yes, that's true.  It helps us fold up user-written range checks if we GVN indexes and limits of all sorts.
User-written range checks occur irregularly both before and after arraycopies.

> I am a bit confused why it really would make any difference.  I always assumed that MergeMems would be eliminated if they weren't doing anything and set_all_memory(reset_memory()) doesn't seem like it does anything meaningful.
It makes a memory barrier, since the slow-path call (if I understand Vladimir's
proposal) will set all of memory, so the all-memory phi at the bottom will prevent
GVN from commoning up later memory references with earlier.

> I think that's the Phi of AddP problem I mentioned before.  Actually if we were smarter I don't see why we couldn't fold null checks in that case too.  They would still work, I just think the code for recognizing opportunities doesn't look for that case.
Good point.

> The real problem is the extra add we emit which could better be performed in the address expression, without consuming an extra live register.
Yes.  There's some code in the matcher which clones address expressions; didn't you recently work on that, Tom?

Tom Rodriguez wrote:

I didn't write it but it only works for AddPs which feed into memory ops.  In the case I saw, it was (LoadI (Phi (AddP base 8) (AddP base2 8))), instead of (LoadI (AddP (Phi base base2 8))) which as far as I can tell is the remains of an attempt to split_thru_phi a load.  I think it happens commonly when base is nonnull but base2 is proven nonnull by an explicit check.  The extra control flow from the explicit check defeats split_thru_phi but it doesn't clean up after itself.  I was talking to Ross about this since it's causing me problems so maybe fixing my problem would also remove the silly phi.  I'll look into this.
No resources to work on this.
10-03-2017