JDK-8223051 : support loops with long (64b) trip counts
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 13
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2019-04-27
  • Updated: 2024-07-17
  • Resolved: 2020-10-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b21Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Many core loop transformations apply to counted loops, which
are those with a calculated trip count.  Transformations
include unrolling, iteration range splitting (array RCE),
and strip mining (JDK-8186027).  The optimizer performs many
complicated pattern matches to detect and transform counted
loop.

Most or all of these pattern matches and transformations
apply to loops with 32-bit control variables and arithmetic.
This makes sense as long as bulk operations apply only to
Java arrays, since those arrays can only span a 31-bit index
range.  Newer APIs for larger blocks of bulk data will
introduce 64-bit indexes, such as Panama's native arrays and
(possibly) range-expanded byte buffers.  Under the hood, the
Unsafe API routinely works with 64-bit addresses and address
arithmetic.  Loops which work on such data structures
naturally use 64-bit values, either as direct Java longs, or
as wrapped cursor structure with incrementing long
components (Panama pointers).

There needs to be a story for transforming such long-running
loops.  This RFE is a request for that story.

A complicating factor is that sometimes counted loops have
no safepoints, on the assumption that the largest possible
iteration (across 32 bits of dynamic range) won't cause the
JVM's safepoint mechanism to malfunction due to a
non-responsive thread stuck in such a counted loop.  This
assumption is invalid in the 64-bit case.  Luckily, we have
a (relatively new) optimization which can address this
problem, by strip-mining a single very long running loop
into a sequence (outer loop) of loops of with appropriately
bounded trip counts.

This observation leads to a suggested approach to this RFE.
Generalize the strip-mining transform operate on
trip-counted loops which use 64-bit arithmetic.  In the
inner loops, use a normal 32-bit counter, constrained (of
course) to avoid overflow.  This should enable all the other
loop transforms to work on the inner loops, without them
"knowing" that they are working on a strip-mined 64-bit
loop.

The strip mining transform, along with the index variable
size reduction, would look something like this:

```
void processRange(long start, long end) {
  for (long li = start; li < end; li++) {
    checkIndexLong(li, a.length());
    a.put(li, a.get(li) + 1);
  }
  ==STRIPMINE==>
void processRange(long start, long end) {
  for (long li = start; li < end; ) {
    int stripMax = LoopStripMiningIter;
    int strip = (li + stripMax < end) ? stripMax : (int)(end - li);
    for (int si = 0; si < strip; si++) {
      long li_ = li + si;
      checkIndexLong(li_, a.length());
      a.put(li_, a.get(li_) + 1);
    }
    li += strip;
  }
```

In the future, Valhalla will introduce primitive-like
"inline types" that can wrap hardware vectors of 128 bits or
larger.  The possibility exists (in theory) of treating
those larger types as capable of arithmetic.  In that case,
the JIT may encounter loops over other types besides Java
longs, and wider than 64 bits.  The initial version of this
RFE need to support that, but it should be factored so that
adding more kinds of index types (besides T_INT and T_LONG)
will not require a total rewrite.  I say "should" not "must"
because this point is a goal not a requirement of this RFE.

A separate part of getting the benefit from this work is
recognizing the places where range checking occurs, so that
iteration range splitting can be done (inside each strip of
the outer loop).  The JVM does this in a hardwired way for
Java arrays, and also (as of JDK-8042997) on library-coded
data structures which happen to use the intrinsic method
`Objects.checkIndex` to check their indexes.

For library-coded data structures with 64-bit indexes, the
JIT can be set up to look for a 64-bit version of
`checkIndex`.  (That is TBD, but see `checkIndexLong`
above.)  Range checks on the long index can be reduced to
range checks on the shortened inner loop index.

Here is an example, based on the previous one, of splitting
the inner loop into pre-, main, and post-loops:

```
  ==RCE==>
void processRange(long start, long end) {
  long mainStartLI = Long.max(start, 0);
  long mainLimitLI = Long.min(end, a.length());
  for (long li = start; li < end; ) {
    int stripMax = LoopStripMiningIter;
    int strip = (li+stripMax < end) ? stripMax : (int)(end-li);
    int si = 0;
    int mainStartSI, mainLimitSI;
    if (li+strip <= mainStartLI)
      mainStartSI = mainLimitSI = strip; //or 0
    else if (li >= mainLimitLI)
      mainStartSI = mainLimitSI = 0; //or strip
    else {
      mainStartSI = (int)Long.max(mainStartLI-li, 0);
      mainLimitSI = (int)Long.min(mainLimitLI-li, strip);
    }
    for (; si < mainStartSI; si++) { //PreLoop
      long li_ = li + si;
      checkIndexLong(li_, a.length());
      a.put(li_, a.get(li_) + 1);
    }
    for (; si < mainLimitSI; si++) { //MainLoop
      long li_ = li + si;
      //RCE!// checkIndexLong(li_, a.length());
      a.put(li_, a.get(li_) + 1);
    }
    for (; si < strip; si++) { //PostLoop
      (... same as PreLoop ...)
    }
    li += strip;
}
```

Or it may be more reasonable to split the outer loop itself
into pre/main/post copies, in which case the 32-bit versions
of the split values are not computed.

(Note: The gathering of constraints is a complex matter.
See `PhaseIdealLoop::add_constraint` in C2 for more details.
It seems it would need to be generalized to handle
constraints on 64-bit values.)

The request to generalize iteration range splitting over
64-bit ranges should be broken out into a separate RFE or
subtask, when work starts on this RFE.

Comments
Changeset: e76de189 Author: Roland Westrelin <roland@openjdk.org> Date: 2020-10-19 11:30:13 +0000 URL: https://git.openjdk.java.net/jdk/commit/e76de189
19-10-2020

At least for use cases with Panama memory segments, it is reasonable to assume that a loop with a 64-bit trip count will almost always have a normal number of trips, say under a million. The starting and ending "pointers" of such a loop might possibly be full-range 64-bit integers (encoding virtual addresses or file offsets). But the difference between start and end is likely to be small, except in cases where a very large block of memory (multiple gigabytes) is being processed. Therefore, supporting such loops does not need to be an invasive change which generalizes the pattern matching and transformation of loop IR. The loop could be transformed like this: long start = ..., end = ...; // input: for (long i = start; i < end; i++) {body(i);} long n = (end > start) ? (end - start) : 0; int BIG = 1<<30; if (n < BIG && n >= 0) { for (int i = 0; i < (int)n; i++) {body(start+i);} } else { //maybe uncommon trap here long mid = start; while (mid < end) { long m = (end - mid); if (m > BIG || m < 0) m = BIG; for (int i = 0; i < (int)m; i++) {body(mid+i);} mid += m; } } Some of the if-statements might seem unnecessary. They are present to cover rare cases of overflow. Specifically, if end is very positive and start is very negative, then end-start can overflow to a negative number. This is the root cause of the comparisons against zero (n >= 0, m < 0). What's really happening there is that we intend to perform an unsigned comparison against a possibly overflowing difference (end-start, end-mid), but if using signed comparisons we must make a separate comparison against zero. Of course, a compiler can select unsigned comparisons to perform the test in one instruction. The uncommon trap, if present, would allow the JIT to get away with pretending that the loop is really just a 32-bit loop with 64-bit base addresses. This is probably common. When this assumption fails, the more general strip-mined loop can be used as the only version of the loop, or else both versions can be kept if (a) there might be wide dynamic variation in the size of the trip-count and (b) the original 32-bit code is somehow more efficient.
19-10-2019

And some ideas from a conversation with kvn: If we convert the JIT to internally prefer intptr_t for loop transforms, the treatment of byte coded loops will differ on 32-bit and 64-bit platforms. On 32-bit platforms, int loops will "just work" as today, and long loops will not be specially optimized (as today also). This is the setting in which C2 loop transforms (and Java arrays) were designed. On 64-bit platforms (the common case today), today's int loops from source code will have to be internally raised up by the JIT to use 64-bit registers, but those register values will continue to have TypeLong range information that makes it clear that the 64-bit values are taking a 32-bit dynamic range. Loops that cause 32-bit overflow (a pathological uncommon case) will have to be carefully managed after lifting iteration variables to the natural 64-bit size. On the other hand, most loop transforms avoid overflow by predicating and range splitting, and this is sometimes done with 64-bit helper logic. That logic should continue to work smoothly as long as we predicate that the long temporary values will not overflow. Vladimir (both of them, I think) suggests that this could be done by speculating that *all* 64-bit temps involved in loop transforms are within some generous subrange, such as a 56-bit dynamic range (reserving the top 8 bits for "slop" and overflow). This is realistic as long as the iteration variables in question are offsets over reasonably virtual memory structures (less than exabyte scale), or (if they are unsafe hardware addresses) the virtual memory addresses have all-one or all-zero bits near the top 8 bits of the pointer. Both are reasonable working assumptions. Note we have a lot of bug-prone logic in C2 which corrects errors from extending 32-bit offsets into 64-bit addressing modes (base plus scaled offset, for example). I hope this logic can be simplified if we simply standardize on native offset sizes as well as native pointer sizes, in our loop transform logic.
16-07-2019

Some ideas for fixing this (courtesy of vlivanov): (1) Implement a loop transformation (similar to strip mining) which converts a loop with a long trip count to a nested loop where inner loop has int trip count. Apply all existing loop transformations to the inner loop with the int trip count. (2) Speculatively insert a loop predicate when a long serves as a trip counter. Speculate that it stays in the int dynamic range. This limits use cases to work only <2G indexes. (3) Change C2 loop optimizations to work on intptr_t instead of int, long, or int-and-long. That’s more of a 1-1 change than adding support for longs next to ints. Today’s int loops can use widened intptr_t iteration variables. On 32-bit platforms, the speculative trick (2) might be useful, or maybe we just don't care. (1) Requires more work on C2/Graal side while (2) may be a temporary workaround on library/jdk side until a better solution is there on JVM side. (3) is IMO a semi-principled fix. It would have the beneficial side effect of removing complicated workarounds for generating strided addressing modes on 64-bit platforms.
11-07-2019