Bug ID: JDK-8059241 C2: Excessive RemoveUseless passes during incremental inlining

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 9,10,11,12,13

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2014-09-26
Updated: 2019-07-01
Resolved: 2019-02-01

JDK 13
13 b07Fixed

If you build the VM with JDK-8058968 included, and run the full suite of Nashorn/Octane with:

~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/bin/java -XX:+CITime -jar ~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/jre/lib/ext/nashorn.jar -Dnashorn.typeInfo.disabled=false --class-cache-size=0 --persistent-code-cache=false -scripting --log=time test/script/basic/run-octane.js -- --iterations 1 

Then you will see this output:
 http://cr.openjdk.java.net/~shade/8058968/new-output-2.txt

Notice how much time Incremental Inlining consumes:

  Total compilation time   : 480.712 s
    C2 Compile Time:      376.791 s
         Incremental Inline:   84.199 s
           IdealLoop:             1.061 s
           IGVN:                 10.713 s
           Inline:               28.659 s
           Prune Useless:        43.611 s
           Other:                 0.155 s

This arguably has an effect on Nashorn warmup, and basically on how fast MH-heavy workloads ramp up.

Unfortunately, proposed solution doesn't work. Incremental inlining fails miserably when a call site dies while waiting for being inlined. It happens when previously inlined call causes some branches to be eliminated, but the info hasn't been propagated yet. Suggestion from John: Some complex optimizations can become buggy when the graph changes due to dead path elimination. One way we can defend against this is to delay the removal of dead paths to a cleanup phase. There are several ways to do this trick. To maintain certain connectivity properties, we introduce not-really-conditional NeverBranch nodes, which are rewritten extremely late. A "washable" version of NeverBranch nodes (which "washes out" sooner than the current ones) could perhaps be used to delay path-cutting. More commonly, in loop opts, we use various kinds of opaque nodes to block constant folding when it would be inconvenient. Perhaps during incremental inlining we could block path-cutting by introducing some kind of opaque data or control node, which would wash out in the PhaseIterGVN that you originally intended to run after several rounds of incremental inlining. I'm not sure if this would be profitable in this case, but the idea of delaying path-cutting has been useful in the past.
13-05-2015
Proposed fix: http://cr.openjdk.java.net/~vlivanov/8059241/webrev.00 2 observations: * PhaseRemoveUseless is performed too frequently (for every successful inlining tree), though it is more expensive the larger is IR; * inlining happens in smaller steps the closer live_nodes() to LiveNodeCountInliningCutoff; So, the fix is two-fold: * reduce PhaseRemoveUseless frequency: inline in larger chunks until IR size LiveNodeCountInliningCutoff, then eliminate dead nodes; * have a small (10%) gap between LiveNodeCountInliningCutoff and actual limit when inlining step is finished to give the algorithm some space to "breath" (smallest inlining chunk produce at least 10%*LiveNodeCountInliningCutoff nodes) It leads to significant reduction in incremental inline times: Box2D Before: C2 Compile Time: 163.114 s Optimize: 80.724 s Incremental Inline: 35.036 s Prune Useless: 21.808 s After: C2 Compile Time: 195.139 s Optimize: 106.982 s Incremental Inline: 16.356 s Prune Useless: 2.154 s
24-04-2015
+ 84.089s (5%) Compile::inline_incrementally(PhaseIterGVN&) \| + 77.634s (5%) Compile::inline_incrementally_one(PhaseIterGVN&) \| \| + 48.614s (3%) PhaseRemoveUseless::PhaseRemoveUseless(PhaseGVN,Unique_Node_List) \| \| \| + 24.997s (2%) Compile::identify_useful_nodes(Unique_Node_List&) \| \| \| + 5.964s (0%) Compile::update_dead_node_list(Unique_Node_List&) \| \| + 24.147s (2%) LateInlineCallGenerator::do_late_inline() \| \| \| + 23.476s (2%) ParseGenerator::generate(JVMState) \| \| + 4.593s (0%) PhaseIterGVN::PhaseIterGVN(PhaseGVN)
16-04-2015
If that is any help, here's a profile running Nashorn/Octane, focusing on incremental inlining: http://cr.openjdk.java.net/~shade/8059241/nashorn-octane-profile-1.txt
30-09-2014
In any capacity, it is very premature to discuss the impact on "standard" applications. Should there be an improvement for incremental inlining, we can surely check how that affects lambda-bearing benchmarks.
29-09-2014
It is true that incremental inlining is currently only used for MH call sites but that doesn't mean it could not affect non-JSR 292 applications: Lambdas are using JSR 292 (invokedynamic) and more and more Lambdas are used in the core library. I'm not saying standard Java applications are affected but they might.
29-09-2014
The data point CR is filed about is interesting and a good starting point for an investigation. I plan to look into what's going on during incremental inlining once I'm done with other JSR292-related stuff.
29-09-2014
Incremental inlining is special, isn't it? It only extends itself to MethodHandles now, right? If so, the "standard" Java applications are not affected either way. Anyhow, most "standard" applications we have are not dependent (much) on compiler performance, while the Nashorn case does depend on it for warmup.
29-09-2014
My experience with C2 tells me there are not many things that can be done within a couple of hours. Changing inlining heuristics is delicate and requires extensive testing to ensure we do not introduce regressions. Do we have data for "standard" Java applications (SpecJBB, SpecJVM, etc.)?
29-09-2014
The experience tells me the first low-hanging fruits are identified within hours for a person experienced with a code and basic profiling skills.
29-09-2014
I'm not sure that is the definition of low-hanging. Yes, we can send off an engineer and look into this but it will take time and it is far from sure that we can even remove anything.
29-09-2014
This is an exploratory task. It all starts when you actually look into the code, and try to minimize the time spent there. Start at the obvious thing in profile: are we sure we need that much pruning? Is it required in all those places in incremental inline? Can we avoid some of them? More questions and answers will come as you break the ice on the task.
29-09-2014
What would be a low-hanging fruit here?
29-09-2014
Albert, see above: of the 376 seconds to compile, and 154 seconds to optimize, Incremental Inlining takes 85 seconds. In my view, the reasonable overhead for Incremental Inlining is in line with "Well, we can't make it any faster without going really hardcore". I do believe we can shave quite some time on low-hanging fruits there.
29-09-2014
What does it mean for inlining to be "too hot"? Does that mean we inline too much? What would be a reasonable value?
29-09-2014
I would say that this is a P3 and a bug, from our team's perspective.
29-09-2014

Blocks :	JDK-8217782 - Spill detection broken after JDK-8217716
Blocks :	JDK-8059760 - VM/JDK fixes for Nashorn performance (warmup/footprint, indy)
Relates :	JDK-8221592 - C2 compilation failed with assert(!q->is_MergeMem())
Relates :	JDK-8143321 - Reduce the C2 compiler's memory usage
Relates :	JDK-8223581 - C2 compilation failed with assert(!q->is_MergeMem())