Bug ID: JDK-8221358 need intrinsics for loop control that can guide iteration range splitting

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: repo-panama
Priority: P3
Status: Open
Resolution: Unresolved
Submitted: 2019-03-23
Updated: 2019-04-28
Other
repo-panamaUnresolved
The optimizing JIT performs a standard transformation on counted loops called iteration range splitting, which classifies ("splits") the indexes of the loop counter (the "iteration range") into three consecutive categories.  This RFE asks for an intrinsic method (private to java.base) to allow library code to help the JIT perform this classification. 

The first and last categories of index, in iteration range splitting, are those which are more difficult to create optimized code for, and the middle category (the majority of loop iterations) are those which are easier to optimize, since the loop counter is in the middle of its range and there are no edge effects (such as array boundaries) to worry about.

The loop transformation makes three (or more) consecutive copies of the loop, re-optimizing each copy in the environment of its split-range category.  The loop that gets optimized the best, and runs the longest, is called the "main loop".  Before that is the "pre-loop" (or "spin-up" loop) and after is the "post-loop" (or "cleanup" loop); these loop copies are less optimized because they must be more careful about edge effects in the iteration range.

There are a number of uses of pre- and post-loops, and there might be more than one of either kind.  For example, if the main loop is also unrolled 8 ways, there might be a one-iteration pre-loop (which as a special case is called "peeled" loop iteration), followed by another iteration that runs up to 7 times to perform some kind of alignment operation.  Or, the post-loop might also run up to 7 times to process remaining iterations of the 8-way unrolled main loop.

This optimization is especially useful with vector processing, since the main loop may be vectorizable (as well as profitable to unroll), pushing the responsibility for handling a partial-vector of data in a post-loop, one scalar at a time.  There might also be a pre-loop over scalars which also runs up to some desirable alignment point (such as a cache line).  If aligned vector operations are cheaper than unaligned ones, the pre-loop might be vectorized but unaligned.  If unmasked vector operations are cheaper than masked ones, but the latter are efficient, then either the pre- or post-loop might use a single iteration of a masked, vectorized operation, so that the main loop can use unmasked (as well as perhaps aligned and unrolled) operations.

Such pre/main/post decompositions can be done purely in library code, if the loop kernel is present as a lambda, but this form of source code is presently difficult to optimize reliably (due to inlining vagaries).  Decomposition is often done by hand, especially for scalar post-loop, but this is awkward and hard to maintain.

If a loop is coded using a fixed-sized vector type with an optional mask, then the computation of the mask can be characterized as follows:

Inputs:  The current index, and the index limit.  Also the vector lane count VL.
Output:  A mask vector of size VL with active areas for the smaller of VL and the number of remaining indexes to loop over.

The "max" operator is the place where the logic choses between two fundamentally different masks:  A trivial mask of "all lanes active" and a short mask of "some lanes active".  The trivial mask should be the one used by the main loop, while a new copy of the loop, deployed as a post-loop, should make one iteration with the short mask.  The trivial mask should be  clearly optimizable as a no-op, and any vector instruction which uses it is a candidate for selection as the unmasked (simpler, faster) version.

One missing bit to make all this optimize well is a way for the mask generation function to inform the JIT that one of its outputs pertains to the main loop, while the other output pertains to the post-loop.

The present RFE is for an advisory method that performs this, as follows:

/** Returns its argument unchanged.  In addition,
if the value is used as a branch predicate and it is true,
then compilers are invited to inspect the following
code as a candidate for inclusion in a post-loop of
an iteration range splitting transformation of an
enclosing loop.  Likewise, if the value is false,
compilers are invited to inspect the following code
(on the other branch of the conditional) as a candidate
for inclusion in a more heavily optimized main loop.
*/
@HotSpotIntrinsicCandidate
static boolean forPostLoop(boolean z) { return z; }

Example:

@ForceInline
Vector.Mask nextIterationMask(int count, int limit) {
  int remaining = limit - count;
  if (forPostLoop(remaining < VL))
    return makeIntervalMask(0, remaining);
  return FULL_MASK;
}

// hand written:
for (int count = 0; count < limit; count += VL) {
  var mask = nextIterationMask(count, loop);
  ...do masked stuff at offset count...
}

// transformed main-loop
int count = 0;
var limit1 = limit - limit % VL;
for (; count < limit1; count += VL) {
  var mask = FULL_MASK;
  ...do masked stuff at offset count...
}
// transformed post-loop
if (count < limit) {
  int remaining = limit - count;
  var mask = makeIntervalMask(0, remaining);
  ...do masked stuff at offset count...
}
Relevant discussion on vector loop shapes on the mailing list: http://mail.openjdk.java.net/pipermail/panama-dev/2018-August/002525.html
23-03-2019