JDK-8259396 : Investigate nested irreducible loops creation in 8253353 case
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 16,17
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2021-01-07
  • Updated: 2022-11-25
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
Running test case from JDK-8253353 shows that a lot (around 400) of nested  irreducible loops were created after JDK-8240576 fix.

Before JDK-8240576 I see next loops from TraceLoopOpts with only 2 IRREDUCIBLE outer loops: 

Loop: N0/N0 has_call has_sfpt 
  Loop: N6764/N6752 IRREDUCIBLE has_call 
    Loop: N2317/N3581 IRREDUCIBLE sfpts={ 4042 5697 } 
    Loop: N6785/N4367 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5860 } 
    Loop: N6794/N4380 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5674 } 
    Loop: N6803/N4566 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5687 } 
    Loop: N6812/N4577 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5971 } 
    Loop: N6821/N4722 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 6047 } 

After that fix there is very long nested list of IRREDUCIBLE loops: 

Loop: N0/N0 has_call has_sfpt 
  Loop: N6752/N3584 IRREDUCIBLE has_call 
    Loop: N6764/N3585 IRREDUCIBLE has_call 
      Loop: N6776/N3586 IRREDUCIBLE has_call 
        Loop: N6788/N3587 IRREDUCIBLE has_call 
          Loop: N6800/N3588 IRREDUCIBLE has_call 
            Loop: N6812/N3589 IRREDUCIBLE has_call 
              Loop: N6824/N3590 IRREDUCIBLE has_call 
                Loop: N6836/N3591 IRREDUCIBLE has_call 
.... 
    Loop: N10592/N3581 IRREDUCIBLE 
      Loop: N10604/N3580 sfpts={ 4042 5697 } 
        Loop: N10614/N4367 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5860 } 
        Loop: N10623/N4380 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5674 } 
        Loop: N10632/N4566 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5687 } 
        Loop: N10641/N4577 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 5971 } 
        Loop: N10650/N4722 limit_check profile_predicated predicated counted [int,int),+1 (-1 iters) has_call has_sfpt sfpts={ 6047 } 

Comments
[~epeter] Thank you very much for such deep and thorough analysis not only of this issue but also JDK-8240576 which led to it. I am fine with putting this issue investigation on hold as you suggested.
25-11-2022

My current best guess is that maybe we could fix merge_many_backedges for irreducible loops. But the benefit is fairly limited. A simple fix worth trying is to simply tag all of the backedges of the old loop-head as irreducible, which prevents LoopNode insertion for any of them. But maybe that has some unforseen consequences. My suggestion: for now we leave this issue rest, and I try to re-investigate once I have the general issue of "dead-loop" detection fixed. We should also add an assert that checks that if we find an irreducible loop entry/loop head during build_loop_tree, all of these regions are just RegionNodes and not LoopNodes. LoopNodes guarantee that we have a canonical reducible loop head.
25-11-2022

I would still like to know why we decided to avoid merge_many_backedges for irreducible loops. JDK-8240576, its change: https://hg.openjdk.java.net/jdk/jdk/rev/0238badf51bc And the review: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2020-March/037314.html I just analyzed it myself, see the "before_JDK-8240586.zip" with a series of images of the graph, where we do merge_man_backedges also for irreducible graphs. Before escape-analysis, we perform a first round of loop-opts. It is there where we perform the merge_many_backedges on "316 Region". Sadly, as analyzed in the review of JDK-8240576, we do not properly keep the _irreducible flag in the loop. Before merge_many_backedges: Loop: N316/N313 IRREDUCIBLE Loop: N316/N359 sfpts={ 325 } After merge_many_backedges: Loop: N316/N407 IRREDUCIBLE Loop: N316/N313 This is wrong, N316/N313 is definately irreducible, and this creates issues later. Maybe this _irreducible flag should just be set for both, but I would have to do a deeper analysis to be sure about this. We then perform a split_fall_in on "316 Region", which creates a "413 Region", that takes the "407" backedge, and "316 Region" only keeps the "313" backedge. Sadly, "316/313" has _irreducible=false, so now that "316 Region" only has two control inputs, and it is a loop head, we assume it is a "canonicalized" reducible loop head, and insert a "419 Loop" to replace "316 Region". This is a bad idea, because the loop has a secondary entry at "299 Region". So once we lose entry-control to "419 Loop", we will decide the loop is dead, even though it might still be alive with the control entering through "299 Region". Something like this happens, though it is a bit more complicated. After escape-analysis, during IGVN, we lose the 386->413 entry control for "413 Region", and collapse it. Its backedge "413/359" is transferred to "419 Loop" as entry control. For now, the control flow does not realize that things have gone wrong. But we then try to transform "320 Phi", which belongs to "419 Loop". We see that has a unique input (uin). A unique input in a Phi of a LoopNode can mean: backedge has died (TOP), we only have values from entry. They share the same value, backedge does not modify the value. Entry has died (TOP), and the uin value is not equals to the entry-value. In PhiNode::is_data_loop, we check if uin is equal to the entry-value. uin here is "303 Phi", the uncast-entry-value happens to be "320 Phi". They are not equal, so we decide the loop has died (bit it has not died!!). The phi is replaced with TOP. It starts to rip out parts of the graph, and leaves it in an inconsistent state. Eventually, we attempt to transform an "311 If" that only has a "312 IfTrue" left, "313 IfFalse" was already ripped out in an inconsistent way. This triggers the assert "failed: bad if #1".
25-11-2022

[~kvn] asked me some more questions, that I will answer here: 1. does TestNestedIrreducibleLoops::test really contain irreducible loops? TestNestedIrreducibleLoopsMain.java says that it tests deep nested irreducible loops. Structure: Start -> L3135, L31 L22 -> L3135, L31 L31 -> L3135 (about 300x), L3084 L3084 -> L3104 (2x), L3084 (self) L3104 -> L3135 (3x), return L3130 -> L3153, L3135 L3135 -> L3130, return L3153 -> L22, return We have L3135 loop head. Loop via L3130 back to L3135. And via L3130, L3153, L22, L31, back to L3135. Start enters the loop at 2 points, thus we have irreducibility. 2. What is a better representation: with or without merge_many_backedges? With reducible loops, we want to make loops "canonical", that means we want a LoopNode with a single fall-in edge (for that we have split_fall_in), and with a single backedge (for that we have merge_many_backedges). Both of these are done in IdealLoopTree::beautify_loops. Once we can guarantee that a loop can only be entered through the one edge to the loop head, and we only have a singe backedge, and we know it is a reducible loop, then we replace the RegionNode with a LoopNode. When a LoopNode loses its entry-control, we know the loop is dead. If it loses the backedge, we know the loop has broken. For irreducible loops things get significantly trickier. We have at least one cycle of nodes that has at least two regions where we can reach the cycle from outside. I call those regions that can be reached from the outside "entries". During build_loop_tree, one of the entries is picked as "head", though which one is picked depends on the DF traversal order, which can change over time. Thus, over a compilation, more than one of the entries might be "head" at some time. I define "secondary entry" as an entry that is not "head". When a secondary entry suddenly becomes head (because the DF traversal order changed), the forward edges into it suddenly become backedges. And the backedge of the previous head becomes a forward edge. In principle, it would still be nice to have every entry ensure it only has one fall-in, and one loop-internal edge (backedge/forward-edge depending if it is head or secondary entry at the time). But other than in the reducible loop, if we lose the fall-in edge, this does not mean that the loop is unreachable, there may be another entry that has a live fall-in. If the loop-internal edge dies, this means that the entry now is no longer part of the loop (it floats outside the loop). There are multiple possible outcomes: a) The loop is still irreducible, and the fall-in edge now connects below the old entry into the irreducible loop, through possibly a new entry region. b) The loop is now reducible, if it only has a single entry remaining. c) The loop is broken, the internal edge was used by all possible cycles of the loop. Thus, even if we "canonicalized" the irreducible loop entries, we cannot say much useful if we either lose the fall-in edge or the loop-internal edge. All of this becomes even more complicated once we have nested reducible and irreducible loops. Generally, I cannot claim to understand what are all the possibilities. Also, build_loop_tree might find different regions to be "irreducible loop entries", depending on the DF traversal order (see irred.entry.df.order.jpg). 3. Is there a performance difference? Reducible loops gain something from being canonicalized this way, since we have a simpler LoopNode to deal with, which also collapses in a predictable way once the fall-in or backedge is deleted. This is nice because we do (as far as I understand) not have to do reachability checks. For irreducible loops, we do not have these compile-time benefits. For the generated code: As far as I know, we do most optimizations on Loops that we have detected to be not irreducible. So we probably generate less optimized code for irreducible loops, generally speaking - and we will not change that by canonicalizing the entries. The only benefit I see currently: merging backedges reduces the loop-nesting, as many loops become one loop (instead of many backedges we only have one backedge). 4. What could we fix to improve performance? My guess is that the best option is to remove irrducibility before we start with loop-opts. As Roberto has presented at some point, there are 2 general approaches: duplicating nodes (can lead to exponential explosion of nodes) or the dispatch approach (only a few additional nodes, but dispatching means indirection, we have overhead at runtime). Playing with these, and heuristics to choose is a larger project. The benefit is that we simplify the compiler, and can start optimizing the formerly irreducible loops with our reducible-loops-only optimizations. 5. Is this really a relevant / generalizable case, or a special case we should not waste time on? I think that merge_many_backedges is not really that useful for irreducible loops. As far as I understand merge_many_backedges is to canonicalize reducible loops, and eventually insert the LoopNode, and then perform loop-opts on it. Irreducible loops seem to generally be a bit of a special case. Though a more in-depth analysis would have to look at the structure of irreducible loops that we actually often encounter - during OSR or from non-Java languages (eg. Kotlin coroutines). My guess is that with OSR or Kotlin coroutines we can generate arbitrarily complex irreducible loops - but the more complex the less common. I have played with both a bit in the recent weeks: OSR can create irreducible loops for example when we are in a nested loop, and start OSR there. We can use "continue" / "break" to jump to arbitrary loop levels, especially given that one can label the loops in Java. Kotlin coroutines work like this: It is basically a function that can yield values many times. The usage is to iterate over all yielded values. So we might have the coroutine function iteratively compute the next value and yielding it. We might for example have a loop that computes powers of 3. The outside code iterates over these values, and does some additional computation/aggregation. The coroutine is implemented as a function where we can jump into and out of every yield location. This means that if we have a loop in the coroutine code, we may jump in or out of that loop at multiple locations. This produces irreducibility very easily. Interesting read: this is why Graal removes irreducibility: https://github.com/oracle/graal/issues/1330 They do this with loop-duplication, which means the code grows linearly by the number of "suspension-points" (i.e. the number of yield statements, or number of irreducible-loop entries) https://github.com/oracle/graal/commit/4662877b8ce214528f09553f21776ab97e97c1a8
25-11-2022

Issue still present. Reproduce like this: copy TestNestedIrreducibleLoopsMain.java and TestNestedIrreducibleLoops.jasm to local directory, compile them: javac TestNestedIrreducibleLoopsMain.java java -jar ~/Documents/asmtools-7.0-build/release/lib/asmtools.jar jasm TestNestedIrreducibleLoops.jasm run: ./java -Xbatch -XX:CompileCommand=dontinline,TestNestedIrreducibleLoops::* -XX:CompileCommand=exclude,TestNestedIrreducibleLoopsMain::main -XX:+TraceLoopOpts TestNestedIrreducibleLoopsMain
22-11-2022

I coded a parse-block viewer, where we can see both the blocks, their pre-visit-order, and the loop-tree generated during parsing. patch: print_blocks_and_loops.diff result: parse-blocks.txt Analysis: Block L3135 (pre_order 1) has about 300 predecessor blocks. One of them is the start block (pre_order 0). The others all come from the many blocks in the middle. Since we visit L3135 before all those other blocks (except the start block), those other blocks will create a back-edge to L3135 as a loop-head. In the parse-loop-tree, we see that we have nested loops [1<-314 .... 1<-5], so over 300 nested loops. If these loops were reducible, merge_many_backedges would take all of the backedges and merge them in a new region at the end of the loop, and then only have a single backedge from that merge-region up to the loop head. Then we only see one loop the next time we run build_loop_tree. But as we see, in JDK-8240576 (https://hg.openjdk.java.net/jdk/jdk/rev/0238badf51bc) we do not call merge_many_backedges if we have an irreducible loop. Hence, we leave the backedges unmerged, and they all directly point into the loop-head, and during PhaseIdealLoop, we still see them as individual loops. During parsing, we have a single loop-head-block for all the loops. Before loop-opts, we still have only a single Region. But during loop-opts (PhaseIdealLoop::build_and_optimize -> recursive IdealLoopTree::beautify_loops -> IdealLoopTree::split_fall_in) we split some edges into separate regions, hence we see different loop-heads afterwards. I am currently unsure why we do split_fall_in so often, it seems that for each remaining sub-loop, we always have two fall-in edges, which we split off into a separate region, reducing the remaining region by one input.
22-11-2022