Bug ID: JDK-8195118 LoopStripMining causes performance regression

Agree.
26-01-2018
[~kvn] it's only necessary on SPARC and only works with the patch. But yes, we might want to re-evaluate the default settings for the flags for JDK 11. Also, the default value for LoopStripMiningIterShortLoop is not set to the correct value by default (see JDK-8196294). To summarize, I think we've agreed on that these regressions are edge cases and are expected with Loop Strip Mining. The benefits overweight the performance impact and Loop Strip Mining can always be turned off. I'll close this as "Won't Fix".
26-01-2018
[~roland] You said with LoopStripMiningIterShortLoop=200 the regression is gone. Is it with your patch? Or only changing LoopStripMiningIterShortLoop value without patch? Should we just increase LoopStripMiningIterShortLoop?: // blind guess - LoopStripMiningIterShortLoop = LoopStripMiningIter / 10; + LoopStripMiningIterShortLoop = LoopStripMiningIter / 5;
26-01-2018
Thank you, Roland, for analyzing the problem and confirming that it is overhead of strip mining code (additional external loop with safepoint) and not a bug which prevents some other optimizations. As I said in previous comments we can accept small regression in some cases. For G1 GC small safepoint pauses is more important. And user can switch off strip mining if he wants. Note, this issue is present only when number of loop iterations is small and not known statically (passed as argument, for example).
26-01-2018
{quote} If we decide to not use strip mining based on profiling, a loop that changes behavior once compilation has happened could run for too long without a safepoint so I would rather not do it. {quote} Another approach would be to insert a deopt and recompile if behavior changes and loop iteration count grows over some limit.
25-01-2018
I agree with Roland's analysis. I've performed benchmark runs that show that with the experimental patch the SPECjbb2005 regression on SPARC [1] and the microbenchmark regressions on x86_64 [2] are gone. I.e. the regressions are due to the problem that Roland described above. I'm fine with closing the bug. [~kvn], what do you think?
25-01-2018
We experimented with the patch below. It removes the outer strip mined loop and safepoint if the profiling tells that, on average, the number of iterations of the inner strip mined loop is small. With -XX:LoopStripMiningIterShortLoop=200, all regressions disappear on all platforms. So it seems the root cause of the regressions is the overhead of maintaining the strip mined loop (and not, for instance, an optimization that doesn't trigger when loop strip mining is enabled). Note that, profiling only gives an average number of iterations (not a max). If we decide to not use strip mining based on profiling, a loop that changes behavior once compilation has happened could run for too long without a safepoint so I would rather not do it. My recommendation is to do nothing and live with some small regressions. diff --git a/src/hotspot/share/opto/loopTransform.cpp b/src/hotspot/share/opto/loopTransform.cpp --- a/src/hotspot/share/opto/loopTransform.cpp +++ b/src/hotspot/share/opto/loopTransform.cpp @@ -2795,6 +2795,16 @@ phase->do_maximally_unroll(this,old_new); return true; } + if (cl->is_strip_mined() && + cl->profile_trip_cnt() != COUNT_UNKNOWN && + cl->profile_trip_cnt() < LoopStripMiningIterShortLoop) { + Node* outer_sfpt = cl->outer_safepoint(); + Node* outer_out = cl->outer_loop_exit(); + phase->lazy_replace(outer_out, outer_sfpt->in(0)); + phase->_igvn.replace_input_of(outer_sfpt, 0, phase->C->top()); + cl->clear_strip_mined(); + return true; + } } // Skip next optimizations if running low on nodes. Note that
25-01-2018
Can we share the perfasm output with Roland?
25-01-2018
Added information from JDK-8195117.
19-01-2018
Just checked again with Eric. Roland, I think you can safely ignore the open jmh benchmark, it differs too much from the closed one that's required to reproduce the regression. We are checking if we can share that.
18-01-2018
Yes my runs were with the default G1.
18-01-2018
"run with default arguments" = benchmarks are run with G1 not parallel gc?
18-01-2018
Yes, that doesn't make sense. [~ecaspole], any ideas? Looking at the scatter plot, I guess the regressions with the open jmh benchmark could just be variance. I've checked the original runs with the (closed) jmh benchmark and there we run with default arguments. I don't think the performance difference is due to variance here. Eric, any chance we can get Roland access to the closed jmh benchmark?
18-01-2018
That doesn't make much sense as loop strip mining is off by default with parallel gc so running with -XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0 doesn't change anything.
18-01-2018
You are right. I've double checked and in Eric's run, there is a regression between two runs of 10-b36 with these options: Run 1: -XX:+IgnoreUnrecognizedVMOptions -XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0 -Xms1024m -Xmx1024m -Xmn768m -XX:+UseParallelGC Run 2: -XX:+IgnoreUnrecognizedVMOptions -Xms1024m -Xmx1024m -Xmn768m -XX:+UseParallelGC
18-01-2018
When I run with no option, it says: VM options: -Xms1024m -Xmx1024m -Xmn768m -XX:+UseParallelGC And when I enable loop strip mining I don't see a regression. Hot code path seems identical with or without loop strip mining. There doesn't seem to be a strip mined loop in the hot code path either.
18-01-2018
I think it was executed without any non-default arguments (which should be G1 with loop strip mining, right?).
18-01-2018
I can run that. Thanks. Does it have to be run with G1? or default option + loop strip mining enabled should show a regression?
18-01-2018
Okay, looking at Eric's run, it seems like there are different benchmarks (names) in that jmh file. According to his run, the largest regression (8.89%) on x86_64 shows up with openjdk.bench.javax.crypto.Crypto.decrypt-cipherName:DES-length:16384 Can you run that?
18-01-2018
The only crypto benchmarks that I see are: org.openjdk.bench.javax.crypto.AES.testBaseline org.openjdk.bench.javax.crypto.AES.testUseAes org.openjdk.bench.javax.crypto.AES.testUseAesIntrinsics org.openjdk.bench.javax.crypto.Crypto.decrypt org.openjdk.bench.javax.crypto.Crypto.encrypt
18-01-2018
These showed a regression in the run that I've executed: micro.benchmarks.crypto.small.KeyPairGeneratorBench.generateKeyPair-algorithm:DSA-keyLength:2048-provider: micro.benchmarks.crypto.small.KeyPairGeneratorBench.generateKeyPair-algorithm:RSA-keyLength:2048-provider: micro.benchmarks.crypto.small.MessageDigestBench.digest-algorithm:SHA1-dataSize:1048576-provider: micro.benchmarks.crypto.small.SecureRandomBench.nextBytes-algorithm:SHA1PRNG-dataSize:64-provider:-shared:false micro.benchmarks.crypto.small.SecureRandomBench.nextBytes-algorithm:SHA1PRNG-dataSize:64-provider:-shared:true micro.benchmarks.crypto.small.SignatureBench.DSA.sign-algorithm:SHA256withDSA-dataSize:1024-keyLength:1024-provider: micro.benchmarks.crypto.small.SignatureBench.DSA.verify-algorithm:SHA256withDSA-dataSize:1024-keyLength:1024-provider: micro.benchmarks.crypto.small.SignatureBench.RSA.sign-algorithm:SHA256withRSA-dataSize:1024-keyLength:2048-provider:
18-01-2018
What are the names of the tests that regress? I built the jmh benchmarks from the patch and when I run with -l, I don't see any benchmark whose name contains DSA.
18-01-2018
We always can advice users who complain to switch optimization off with flag.
16-01-2018
Yes, when profiling don't provide enough information and we have loops with small number of iterations.
16-01-2018
[~roland], I'm assigning this to you for further investigation. The crypto microbenchmarks are available here (they are planned to be pushed into the open soon): http://cr.openjdk.java.net/~ecaspole/jmh-jdk-microbenchmarks-first-export.txt Please let me know if you don't have time to work on this or problems to reproduce. EDIT: Vladimir K. mentioned that we might want to tolerate the regression in these cases because Loop Strip Mining is expected to cause regressions in some edge cases.
16-01-2018
ILW = Medium performance regression, crypto benchmarks on Solaris Sparc, disable loop strip mining -XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0 = HMM = P2
16-01-2018
I've executed a run with and without loop strip mining (-XX:-UseCountedLoopSafepoint -XX:LoopStripMiningIter=0). The results show an improvement of up to 7.6% which proves that the regression is due to JDK-8186027.
16-01-2018
Relevant compiler fixes in JDK 10 b36: https://bugs.openjdk.java.net/issues/?jql=project%20%3D%20JDK%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%20%2210%22%20AND%20component%20%3D%20hotspot%20AND%20%22Resolved%20In%20Build%22%20%3D%20b36%20AND%20Subcomponent%20%3D%20compiler%20AND%20labels%20not%20in%20(testbug%2C%20openjdk%2C%20aot%2C%20graal%2C%20noreg-self)%20ORDER%20BY%20resolved%20DESC%2C%20summary%20ASC%2C%20updatedDate%20DESC This regression could be due to Loop Strip Mining (JDK-8186027). I'm investigating.
16-01-2018
The regression is also reproducible on x86_64 (up to 5.6% improvement when disabling loop strip mining).
16-01-2018

Duplicate :	JDK-8195117 - Performance regressions from JDK-8186027
Relates :	JDK-8186027 - C2: loop strip mining