Bug ID: JDK-8044729 Intermittent SIGBUS on SPECjbb2013-CMS on Solaris SPARC with 9-b12

JDK-8041744 is a similar failure but with SEGV instead of bus error.
10-07-2014
Reproduced the problem with product 8u20 binaries on our T4 after 11 iterations. The CT base is again %l7. Trying with fastdebug now.
07-07-2014
SQE OK to defer the issues as far as this is intermittent failures and seems not a regression in 8u20.
02-07-2014
Ran it on a T4 we have with the same command line options over the weekend 33 times - didn't see any failures.
30-06-2014
Yep... Here's the SIGBUS message showing that (from iteration #1). # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0xa) at pc=0xffffffff6af4c648, pid=2925, tid=2830 # # JRE version: Java(TM) SE Runtime Environment (8.0_20-b20) (build 1.8.0_20-ea-b20) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b20 mixed mode solaris-sparc compressed oops) # Problematic frame: # J 7277 C2 org.spec.jbb.hq.tx.InstallmentPurchaseVoidTransaction.execute()Lorg/spec/jbb/core/comm/Response; (98 bytes) @ 0xffffffff6af4c648 [0xffffffff6af4ba40+0xc08]
27-06-2014
It still happens exclusively in org.spec.jbb.hq.tx.InstallmentPurchaseVoidTransaction::execute(), right?
27-06-2014
Rob Asked: Brian.. can you confirm? From what I can dig up, Rob's data is up to date. There may have been some incidents in earlier 8u20 or 9 builds, but our process back then was to re-run the jobs, which ended up overwriting the log files and changing the job status from Failed to Finished - so those records got lost. Once we figured this out, we changed the process on how we deal with failed jobs. As we get time, we are running back-builds through aurora-performance. But again, due to the intermittent nature of the bug, it could have slipped into a build were it just happend to not reproduce during the benchmark run for that build.
27-06-2014
FYI... the SIGBUS is still present, this time against 8u20-b20. One run of jbb2013-CMS on SPARC failed. The other 4 runs succeeded. No new core files, but wanted to log the fact that it continues to occur.
27-06-2014
Igor asked: "So has this problem started occurring only after 9b12?" We started jdk9 performance coverage around b12, and 8u20 coverage around b16. So our focus has been on the builds since then. However, in the background we are starting to work through the older builds. The older builds for 8u20 and 9 have now all been run, but not fully examined yet. I'll ask Brian to confirm this, but what I know, the earliest we've seen the SIGBUS is 9-b12 and 8u20-b18. However, keep in mind that the the SIGBUS is intermittent. To my knowledge, we seen the SIGBUS now in, 9: b12, b15, b16 <did not show in b19, b18, b17, b14, b13> 8u20: b18 <did not show in b19> so its possible it's in earlier builds, but just didn't show itself. Brian.. can you confirm?
25-06-2014
So, what this code is doing is making a field store and a post-barrier. The ldx with a negative offset is a load from the constant table of the cartable table base address. The value of l7 at this point looks bogus, we need to find out how it gets killed.
24-06-2014
So has this problem started occurring only after 9b12? Actually I don't see anything suspicious going to b12 (and no compiler changes at all). It'd be nice to narrow down the build with which the failure started occurring.
24-06-2014
The analysis above is a bit incorrect, the cwbe %g0, %g0, 0xffffffff69805160 instruction always branches, so in the context of ldx [%l7 - 136], %l1 the address base (%l7) does not necessarily contain the polling page address.
24-06-2014
Did an initial analysis and it looks like a register allocation issue. The crash happens in a compiled frame, see comments inline below: [...] 0xffffffff698050fc: sethi %hi(0x83cffc00), %g3 <------------------ Load safepoint check address 0xffffffff69805100: btog -1024, %g3 0xffffffff69805104: sra %l4, %g0, %l5 0xffffffff69805108: sllx %l5, 2, %l5 0xffffffff6980510c: add %i1, %l5, %l5 0xffffffff69805110: ld [%l5 + 16], %g1 0xffffffff69805114: ldx [%g3], %g0 <----------------- Safepoint check 0xffffffff69805118: cwbe %g1, %g0, 0xffffffff698051c0 ! 0xffffffff698051c0 0xffffffff6980511c: sllx %g1, 3, %i3 0xffffffff69805120: add %i3, %g6, %i3 0xffffffff69805124: cwbe %g0, %g0, 0xffffffff698050ac ! 0xffffffff698050ac 0xffffffff69805128: sethi %hi(0x83cffc00), %l7 <-------------------- Load safepoint check address again 0xffffffff6980512c: btog -1024, %l7 0xffffffff69805130: srl %l2, 1, %l2 0xffffffff69805134: ldx [%l7], %g0 <-------------------------- Safepoint check (so far so good....) 0xffffffff69805138: cwbne %l2, 0x0, 0xffffffff698052a0 ! 0xffffffff698052a0 0xffffffff6980513c: ld [%i2 + 16], %l1 0xffffffff69805140: cmp %l3, %l1 0xffffffff69805144: bge,pn %icc,0xffffffff69805b94 ! 0xffffffff69805b94 0xffffffff69805148: nop 0xffffffff6980514c: cwbe %g0, %g0, 0xffffffff69805160 ! 0xffffffff69805160 0xffffffff69805150: ldx [%l7 - 136], %l1 <------------------------ Here we crash, we dereference [%l7 - 136] here, but %l7 contains the safepoint check address here. A register allocation bug? 0xffffffff69805154: st %l3, [%i3 + 28] 0xffffffff69805158: srlx %i3, 9, %l0 0xffffffff6980515c: clrb [%l1 + %l0] 0xffffffff69805160: ldx [%g2 + 96], %i0 0xffffffff69805164: ldx [%g2 + 112], %l0 [...] Moving bug to the compiler team for further analysis.
24-06-2014
Right... We run SPECjbb2013 with CMS, G1, ParallelGC on 4 platforms: Solaris-SPARC, Solaris-x64, Linux-x64, Windows Server 2008-x64 So far, we have seen this SIGBUS error only while using CMS on Solaris-SPARC. We have seen it happen with both JDK9 and 8u20.
19-06-2014
The performance machines only run Solaris 11, so I’m sure that’s the only version we’ve tried. This benchmark also runs on Solaris x86, Linux and Windows (all 64-bit) without problems.
16-06-2014
The performance team is only seeing this with CMS. It is reported on solaris_11. I'll ask if it has been seen with other OS's.
16-06-2014
Problem reproduce with 8u20-b18, again with Solaris SPARC and SPECjbb2013-CMS. No core files from these runs, but hs_err_pid files are available. Here are the two aurora batches that failed: http://aurora.se.oracle.com/faces/Batch.xhtml?batchName=29357.perfSubmit http://aurora.se.oracle.com/faces/Batch.xhtml?batchName=29542.perfSubmit And here are links to the log directories for corresponding failures within these two batches: http://sthaurora-ds.se.oracle.com:10501/runs/29357.perfSubmit-1/logs.specjbb2013/results.specjbb2013/results_2/ http://sthaurora-ds.se.oracle.com:10501/runs/29542.perfSubmit-1/logs.specjbb2013/results.specjbb2013/results_3/
16-06-2014
Got 3 failures in a 5-iteration run. Iterations 1, 2, and 4 failed with SIGBUS. This is a different machine from the earlier failure. This run as with the a 9-b13 variant ~ the baseline comparison build for the Sparc-Solaris compiler switch (ie, not the new compiler). Batch: http://aurora.se.oracle.com/faces/Job.xhtml?job_id=26335.perfSubmit.perf-specjbb2013-SolarisSparc64.refworkload As with the previous failure, this is SPECjbb2013-CMS on Solaris-SPARC. This time we captured the core files. They can be found in the results_* directory for each iteration at: http://sthaurora-ds.se.oracle.com:10501/runs/26335.perfSubmit-1/logs.specjbb2013/ --Resii
11-06-2014
Due to a core files being generated with different names of Solaris than what Refworkload expected, the core files were not captured. When either of these issues happens again, and we get the core file, we will be able to look into this more. Until then, we're in a holding pattern.
23-05-2014
Restarting the run resulted in failure in one of the 5 runs. This time, it was a SIGSEGV that terminated the benchmark, but we did get a full hs_error_pid file this time. The failed iteration results are here: http://sthaurora-ds.se.oracle.com:10501/runs/22728.perfSubmit-1/logs.specjbb2013/results.specjbb2013/results_1/ The hs_err_pid16955.log file and a log file contain the relevant data.
15-05-2014