Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
Many core loop transformations apply to counted loops, which are those with a calculated trip count. Transformations include unrolling, iteration range splitting (array RCE), and strip mining (JDK-8186027). The optimizer performs many complicated pattern matches to detect and transform counted loop. Most or all of these pattern matches and transformations apply to loops with 32-bit control variables and arithmetic. This makes sense as long as bulk operations apply only to Java arrays, since those arrays can only span a 31-bit index range. Newer APIs for larger blocks of bulk data will introduce 64-bit indexes, such as Panama's native arrays and (possibly) range-expanded byte buffers. Under the hood, the Unsafe API routinely works with 64-bit addresses and address arithmetic. Loops which work on such data structures naturally use 64-bit values, either as direct Java longs, or as wrapped cursor structure with incrementing long components (Panama pointers). There needs to be a story for transforming such long-running loops. This RFE is a request for that story. A complicating factor is that sometimes counted loops have no safepoints, on the assumption that the largest possible iteration (across 32 bits of dynamic range) won't cause the JVM's safepoint mechanism to malfunction due to a non-responsive thread stuck in such a counted loop. This assumption is invalid in the 64-bit case. Luckily, we have a (relatively new) optimization which can address this problem, by strip-mining a single very long running loop into a sequence (outer loop) of loops of with appropriately bounded trip counts. This observation leads to a suggested approach to this RFE. Generalize the strip-mining transform operate on trip-counted loops which use 64-bit arithmetic. In the inner loops, use a normal 32-bit counter, constrained (of course) to avoid overflow. This should enable all the other loop transforms to work on the inner loops, without them "knowing" that they are working on a strip-mined 64-bit loop. The strip mining transform, along with the index variable size reduction, would look something like this: ``` void processRange(long start, long end) { for (long li = start; li < end; li++) { checkIndexLong(li, a.length()); a.put(li, a.get(li) + 1); } ==STRIPMINE==> void processRange(long start, long end) { for (long li = start; li < end; ) { int stripMax = LoopStripMiningIter; int strip = (li + stripMax < end) ? stripMax : (int)(end - li); for (int si = 0; si < strip; si++) { long li_ = li + si; checkIndexLong(li_, a.length()); a.put(li_, a.get(li_) + 1); } li += strip; } ``` In the future, Valhalla will introduce primitive-like "inline types" that can wrap hardware vectors of 128 bits or larger. The possibility exists (in theory) of treating those larger types as capable of arithmetic. In that case, the JIT may encounter loops over other types besides Java longs, and wider than 64 bits. The initial version of this RFE need to support that, but it should be factored so that adding more kinds of index types (besides T_INT and T_LONG) will not require a total rewrite. I say "should" not "must" because this point is a goal not a requirement of this RFE. A separate part of getting the benefit from this work is recognizing the places where range checking occurs, so that iteration range splitting can be done (inside each strip of the outer loop). The JVM does this in a hardwired way for Java arrays, and also (as of JDK-8042997) on library-coded data structures which happen to use the intrinsic method `Objects.checkIndex` to check their indexes. For library-coded data structures with 64-bit indexes, the JIT can be set up to look for a 64-bit version of `checkIndex`. (That is TBD, but see `checkIndexLong` above.) Range checks on the long index can be reduced to range checks on the shortened inner loop index. Here is an example, based on the previous one, of splitting the inner loop into pre-, main, and post-loops: ``` ==RCE==> void processRange(long start, long end) { long mainStartLI = Long.max(start, 0); long mainLimitLI = Long.min(end, a.length()); for (long li = start; li < end; ) { int stripMax = LoopStripMiningIter; int strip = (li+stripMax < end) ? stripMax : (int)(end-li); int si = 0; int mainStartSI, mainLimitSI; if (li+strip <= mainStartLI) mainStartSI = mainLimitSI = strip; //or 0 else if (li >= mainLimitLI) mainStartSI = mainLimitSI = 0; //or strip else { mainStartSI = (int)Long.max(mainStartLI-li, 0); mainLimitSI = (int)Long.min(mainLimitLI-li, strip); } for (; si < mainStartSI; si++) { //PreLoop long li_ = li + si; checkIndexLong(li_, a.length()); a.put(li_, a.get(li_) + 1); } for (; si < mainLimitSI; si++) { //MainLoop long li_ = li + si; //RCE!// checkIndexLong(li_, a.length()); a.put(li_, a.get(li_) + 1); } for (; si < strip; si++) { //PostLoop (... same as PreLoop ...) } li += strip; } ``` Or it may be more reasonable to split the outer loop itself into pre/main/post copies, in which case the 32-bit versions of the split values are not computed. (Note: The gathering of constraints is a complex matter. See `PhaseIdealLoop::add_constraint` in C2 for more details. It seems it would need to be generalized to handle constraints on 64-bit values.) The request to generalize iteration range splitting over 64-bit ranges should be broken out into a separate RFE or subtask, when work starts on this RFE.
|