JDK-8059241 : C2: Excessive RemoveUseless passes during incremental inlining
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9,10,11,12,13
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2014-09-26
  • Updated: 2019-07-01
  • Resolved: 2019-02-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 13
13 b07Fixed
Related Reports
Blocks :  
Blocks :  
Relates :  
Relates :  
Relates :  
If you build the VM with JDK-8058968 included, and run the full suite of Nashorn/Octane with:

~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/bin/java -XX:+CITime -jar ~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/jre/lib/ext/nashorn.jar -Dnashorn.typeInfo.disabled=false --class-cache-size=0 --persistent-code-cache=false -scripting --log=time test/script/basic/run-octane.js -- --iterations 1 

Then you will see this output:

Notice how much time Incremental Inlining consumes:

  Total compilation time   : 480.712 s
    C2 Compile Time:      376.791 s
         Incremental Inline:   84.199 s
           IdealLoop:             1.061 s
           IGVN:                 10.713 s
           Inline:               28.659 s
           Prune Useless:        43.611 s
           Other:                 0.155 s

This arguably has an effect on Nashorn warmup, and basically on how fast MH-heavy workloads ramp up.
Unfortunately, proposed solution doesn't work. Incremental inlining fails miserably when a call site dies while waiting for being inlined. It happens when previously inlined call causes some branches to be eliminated, but the info hasn't been propagated yet. Suggestion from John: Some complex optimizations can become buggy when the graph changes due to dead path elimination. One way we can defend against this is to delay the removal of dead paths to a cleanup phase. There are several ways to do this trick. To maintain certain connectivity properties, we introduce not-really-conditional NeverBranch nodes, which are rewritten extremely late. A "washable" version of NeverBranch nodes (which "washes out" sooner than the current ones) could perhaps be used to delay path-cutting. More commonly, in loop opts, we use various kinds of opaque nodes to block constant folding when it would be inconvenient. Perhaps during incremental inlining we could block path-cutting by introducing some kind of opaque data or control node, which would wash out in the PhaseIterGVN that you originally intended to run after several rounds of incremental inlining. I'm not sure if this would be profitable in this case, but the idea of delaying path-cutting has been useful in the past.

Proposed fix: http://cr.openjdk.java.net/~vlivanov/8059241/webrev.00 2 observations: * PhaseRemoveUseless is performed too frequently (for every successful inlining tree), though it is more expensive the larger is IR; * inlining happens in smaller steps the closer live_nodes() to LiveNodeCountInliningCutoff; So, the fix is two-fold: * reduce PhaseRemoveUseless frequency: inline in larger chunks until IR size LiveNodeCountInliningCutoff, then eliminate dead nodes; * have a small (10%) gap between LiveNodeCountInliningCutoff and actual limit when inlining step is finished to give the algorithm some space to "breath" (smallest inlining chunk produce at least 10%*LiveNodeCountInliningCutoff nodes) It leads to significant reduction in incremental inline times: Box2D Before: C2 Compile Time: 163.114 s Optimize: 80.724 s Incremental Inline: 35.036 s Prune Useless: 21.808 s After: C2 Compile Time: 195.139 s Optimize: 106.982 s Incremental Inline: 16.356 s Prune Useless: 2.154 s

+ 84.089s (5%) Compile::inline_incrementally(PhaseIterGVN&) | + 77.634s (5%) Compile::inline_incrementally_one(PhaseIterGVN&) | | + 48.614s (3%) PhaseRemoveUseless::PhaseRemoveUseless(PhaseGVN*,Unique_Node_List*) | | | + 24.997s (2%) Compile::identify_useful_nodes(Unique_Node_List&) | | | + 5.964s (0%) Compile::update_dead_node_list(Unique_Node_List&) | | + 24.147s (2%) LateInlineCallGenerator::do_late_inline() | | | + 23.476s (2%) ParseGenerator::generate(JVMState*) | | + 4.593s (0%) PhaseIterGVN::PhaseIterGVN(PhaseGVN*)

If that is any help, here's a profile running Nashorn/Octane, focusing on incremental inlining: http://cr.openjdk.java.net/~shade/8059241/nashorn-octane-profile-1.txt

In any capacity, it is very premature to discuss the impact on "standard" applications. Should there be an improvement for incremental inlining, we can surely check how that affects lambda-bearing benchmarks.

It is true that incremental inlining is currently only used for MH call sites but that doesn't mean it could not affect non-JSR 292 applications: Lambdas are using JSR 292 (invokedynamic) and more and more Lambdas are used in the core library. I'm not saying standard Java applications are affected but they might.

The data point CR is filed about is interesting and a good starting point for an investigation. I plan to look into what's going on during incremental inlining once I'm done with other JSR292-related stuff.

Incremental inlining is special, isn't it? It only extends itself to MethodHandles now, right? If so, the "standard" Java applications are not affected either way. Anyhow, most "standard" applications we have are not dependent (much) on compiler performance, while the Nashorn case does depend on it for warmup.

My experience with C2 tells me there are not many things that can be done within a couple of hours. Changing inlining heuristics is delicate and requires extensive testing to ensure we do not introduce regressions. Do we have data for "standard" Java applications (SpecJBB, SpecJVM, etc.)?

The experience tells me the first low-hanging fruits are identified within hours for a person experienced with a code and basic profiling skills.

I'm not sure that is the definition of low-hanging. Yes, we can send off an engineer and look into this but it will take time and it is far from sure that we can even remove anything.

This is an exploratory task. It all starts when you actually look into the code, and try to minimize the time spent there. Start at the obvious thing in profile: are we sure we need that much pruning? Is it required in all those places in incremental inline? Can we avoid some of them? More questions and answers will come as you break the ice on the task.

What would be a low-hanging fruit here?

Albert, see above: of the 376 seconds to compile, and 154 seconds to optimize, Incremental Inlining takes 85 seconds. In my view, the reasonable overhead for Incremental Inlining is in line with "Well, we can't make it any faster without going really hardcore". I do believe we can shave quite some time on low-hanging fruits there.

What does it mean for inlining to be "too hot"? Does that mean we inline too much? What would be a reasonable value?

I would say that this is a P3 and a bug, from our team's perspective.