JDK-8195118 : LoopStripMining causes performance regression
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 10
  • Priority: P2
  • Status: Closed
  • Resolution: Won't Fix
  • OS: generic
  • CPU: generic
  • Submitted: 2018-01-16
  • Updated: 2019-09-13
  • Resolved: 2018-01-26
Related Reports
Duplicate :  
Relates :  
Description
All DSA crypto benchmarks show a 6-7% regression on Solaris-SPARC with 10-b36... The regression hold at least through b39...

��� ~ 1.5% in SPECjbb2005-G1	on Solaris-SPARC 

��� 2-5% in SHA1PRNG crypto benchmarks on Linux-x64, Mac-x64, and Solaris-SPARC: 
   ��� micro.benchmarks.crypto.small.SecureRandomBench.nextBytes-algorithm:SHA1PRNG-dataSize:64-provider:-shared:false 
   ��� micro.benchmarks.crypto.small.SecureRandomBench.nextBytes-algorithm:SHA1PRNG-dataSize:64-provider:-shared:true 

Reproduces in triage (repro run on same machine).
Reproduces in re-triage (repro run on different machine).
VMswap results indicate the cause is in the JVM.
Comments
Agree.
26-01-2018

[~kvn] it's only necessary on SPARC and only works with the patch. But yes, we might want to re-evaluate the default settings for the flags for JDK 11. Also, the default value for LoopStripMiningIterShortLoop is not set to the correct value by default (see JDK-8196294). To summarize, I think we've agreed on that these regressions are edge cases and are expected with Loop Strip Mining. The benefits overweight the performance impact and Loop Strip Mining can always be turned off. I'll close this as "Won't Fix".
26-01-2018

[~roland] You said with LoopStripMiningIterShortLoop=200 the regression is gone. Is it with your patch? Or only changing LoopStripMiningIterShortLoop value without patch? Should we just increase LoopStripMiningIterShortLoop?: // blind guess - LoopStripMiningIterShortLoop = LoopStripMiningIter / 10; + LoopStripMiningIterShortLoop = LoopStripMiningIter / 5;
26-01-2018

Thank you, Roland, for analyzing the problem and confirming that it is overhead of strip mining code (additional external loop with safepoint) and not a bug which prevents some other optimizations. As I said in previous comments we can accept small regression in some cases. For G1 GC small safepoint pauses is more important. And user can switch off strip mining if he wants. Note, this issue is present only when number of loop iterations is small and not known statically (passed as argument, for example).
26-01-2018

{quote} If we decide to not use strip mining based on profiling, a loop that changes behavior once compilation has happened could run for too long without a safepoint so I would rather not do it. {quote} Another approach would be to insert a deopt and recompile if behavior changes and loop iteration count grows over some limit.
25-01-2018

I agree with Roland's analysis. I've performed benchmark runs that show that with the experimental patch the SPECjbb2005 regression on SPARC [1] and the microbenchmark regressions on x86_64 [2] are gone. I.e. the regressions are due to the problem that Roland described above. I'm fine with closing the bug. [~kvn], what do you think?
25-01-2018

We experimented with the patch below. It removes the outer strip mined loop and safepoint if the profiling tells that, on average, the number of iterations of the inner strip mined loop is small. With -XX:LoopStripMiningIterShortLoop=200, all regressions disappear on all platforms. So it seems the root cause of the regressions is the overhead of maintaining the strip mined loop (and not, for instance, an optimization that doesn't trigger when loop strip mining is enabled). Note that, profiling only gives an average number of iterations (not a max). If we decide to not use strip mining based on profiling, a loop that changes behavior once compilation has happened could run for too long without a safepoint so I would rather not do it. My recommendation is to do nothing and live with some small regressions. diff --git a/src/hotspot/share/opto/loopTransform.cpp b/src/hotspot/share/opto/loopTransform.cpp --- a/src/hotspot/share/opto/loopTransform.cpp +++ b/src/hotspot/share/opto/loopTransform.cpp @@ -2795,6 +2795,16 @@ phase->do_maximally_unroll(this,old_new); return true; } + if (cl->is_strip_mined() && + cl->profile_trip_cnt() != COUNT_UNKNOWN && + cl->profile_trip_cnt() < LoopStripMiningIterShortLoop) { + Node* outer_sfpt = cl->outer_safepoint(); + Node* outer_out = cl->outer_loop_exit(); + phase->lazy_replace(outer_out, outer_sfpt->in(0)); + phase->_igvn.replace_input_of(outer_sfpt, 0, phase->C->top()); + cl->clear_strip_mined(); + return true; + } } // Skip next optimizations if running low on nodes. Note that
25-01-2018

Can we share the perfasm output with Roland?
25-01-2018

Added information from JDK-8195117.
19-01-2018

Just checked again with Eric. Roland, I think you can safely ignore the open jmh benchmark, it differs too much from the closed one that's required to reproduce the regression. We are checking if we can share that.
18-01-2018

Yes my runs were with the default G1.
18-01-2018

"run with default arguments" = benchmarks are run with G1 not parallel gc?
18-01-2018

Yes, that doesn't make sense. [~ecaspole], any ideas? Looking at the scatter plot, I guess the regressions with the open jmh benchmark could just be variance. I've checked the original runs with the (closed) jmh benchmark and there we run with default arguments. I don't think the performance difference is due to variance here. Eric, any chance we can get Roland access to the closed jmh benchmark?
18-01-2018

That doesn't make much sense as loop strip mining is off by default with parallel gc so running with -XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0 doesn't change anything.
18-01-2018

You are right. I've double checked and in Eric's run, there is a regression between two runs of 10-b36 with these options: Run 1: -XX:+IgnoreUnrecognizedVMOptions -XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0 -Xms1024m -Xmx1024m -Xmn768m -XX:+UseParallelGC Run 2: -XX:+IgnoreUnrecognizedVMOptions -Xms1024m -Xmx1024m -Xmn768m -XX:+UseParallelGC
18-01-2018

When I run with no option, it says: VM options: -Xms1024m -Xmx1024m -Xmn768m -XX:+UseParallelGC And when I enable loop strip mining I don't see a regression. Hot code path seems identical with or without loop strip mining. There doesn't seem to be a strip mined loop in the hot code path either.
18-01-2018

I think it was executed without any non-default arguments (which should be G1 with loop strip mining, right?).
18-01-2018

I can run that. Thanks. Does it have to be run with G1? or default option + loop strip mining enabled should show a regression?
18-01-2018

Okay, looking at Eric's run, it seems like there are different benchmarks (names) in that jmh file. According to his run, the largest regression (8.89%) on x86_64 shows up with openjdk.bench.javax.crypto.Crypto.decrypt-cipherName:DES-length:16384 Can you run that?
18-01-2018

The only crypto benchmarks that I see are: org.openjdk.bench.javax.crypto.AES.testBaseline org.openjdk.bench.javax.crypto.AES.testUseAes org.openjdk.bench.javax.crypto.AES.testUseAesIntrinsics org.openjdk.bench.javax.crypto.Crypto.decrypt org.openjdk.bench.javax.crypto.Crypto.encrypt
18-01-2018

These showed a regression in the run that I've executed: micro.benchmarks.crypto.small.KeyPairGeneratorBench.generateKeyPair-algorithm:DSA-keyLength:2048-provider: micro.benchmarks.crypto.small.KeyPairGeneratorBench.generateKeyPair-algorithm:RSA-keyLength:2048-provider: micro.benchmarks.crypto.small.MessageDigestBench.digest-algorithm:SHA1-dataSize:1048576-provider: micro.benchmarks.crypto.small.SecureRandomBench.nextBytes-algorithm:SHA1PRNG-dataSize:64-provider:-shared:false micro.benchmarks.crypto.small.SecureRandomBench.nextBytes-algorithm:SHA1PRNG-dataSize:64-provider:-shared:true micro.benchmarks.crypto.small.SignatureBench.DSA.sign-algorithm:SHA256withDSA-dataSize:1024-keyLength:1024-provider: micro.benchmarks.crypto.small.SignatureBench.DSA.verify-algorithm:SHA256withDSA-dataSize:1024-keyLength:1024-provider: micro.benchmarks.crypto.small.SignatureBench.RSA.sign-algorithm:SHA256withRSA-dataSize:1024-keyLength:2048-provider:
18-01-2018

What are the names of the tests that regress? I built the jmh benchmarks from the patch and when I run with -l, I don't see any benchmark whose name contains DSA.
18-01-2018

We always can advice users who complain to switch optimization off with flag.
16-01-2018

Yes, when profiling don't provide enough information and we have loops with small number of iterations.
16-01-2018

[~roland], I'm assigning this to you for further investigation. The crypto microbenchmarks are available here (they are planned to be pushed into the open soon): http://cr.openjdk.java.net/~ecaspole/jmh-jdk-microbenchmarks-first-export.txt Please let me know if you don't have time to work on this or problems to reproduce. EDIT: Vladimir K. mentioned that we might want to tolerate the regression in these cases because Loop Strip Mining is expected to cause regressions in some edge cases.
16-01-2018

ILW = Medium performance regression, crypto benchmarks on Solaris Sparc, disable loop strip mining -XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0 = HMM = P2
16-01-2018

I've executed a run with and without loop strip mining (-XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0). The results show an improvement of up to 7.6% which proves that the regression is due to JDK-8186027.
16-01-2018

Relevant compiler fixes in JDK 10 b36: https://bugs.openjdk.java.net/issues/?jql=project%20%3D%20JDK%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%20%2210%22%20AND%20component%20%3D%20hotspot%20AND%20%22Resolved%20In%20Build%22%20%3D%20b36%20AND%20Subcomponent%20%3D%20compiler%20AND%20labels%20not%20in%20(testbug%2C%20openjdk%2C%20aot%2C%20graal%2C%20noreg-self)%20ORDER%20BY%20resolved%20DESC%2C%20summary%20ASC%2C%20updatedDate%20DESC This regression could be due to Loop Strip Mining (JDK-8186027). I'm investigating.
16-01-2018

The regression is also reproducible on x86_64 (up to 5.6% improvement when disabling loop strip mining).
16-01-2018