JDK-8044729 : Intermittent SIGBUS on SPECjbb2013-CMS on Solaris SPARC with 9-b12
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 8u20,9
  • Priority: P3
  • Status: Resolved
  • Resolution: Duplicate
  • OS: solaris_11
  • CPU: sparc_64
  • Submitted: 2014-05-14
  • Updated: 2014-10-06
  • Resolved: 2014-10-06
Related Reports
Duplicate :  
The 9-b12 promotion performance run on Aurora Performance had 2 of 5 iterations of SPECjbb2013-CMS fail with SIGBUS errors. The hs_err_pid files in both cases are truncated, suggesting that we may have even taken a fault on a fault.

This could be a hardware condition.
[ed. Not a hardware condition, the SIGBUS has occurred on a different machine.  See the June 2 comment.]

Please do not access the machine indicated in this bug report. Any access to this production performance machine needs to be coordinated with the Aurora Performance team. 

JDK-8041744 is a similar failure but with SEGV instead of bus error.

Reproduced the problem with product 8u20 binaries on our T4 after 11 iterations. The CT base is again %l7. Trying with fastdebug now.

SQE OK to defer the issues as far as this is intermittent failures and seems not a regression in 8u20.

Ran it on a T4 we have with the same command line options over the weekend 33 times - didn't see any failures.

Yep... Here's the SIGBUS message showing that (from iteration #1). # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0xa) at pc=0xffffffff6af4c648, pid=2925, tid=2830 # # JRE version: Java(TM) SE Runtime Environment (8.0_20-b20) (build 1.8.0_20-ea-b20) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b20 mixed mode solaris-sparc compressed oops) # Problematic frame: # J 7277 C2 org.spec.jbb.hq.tx.InstallmentPurchaseVoidTransaction.execute()Lorg/spec/jbb/core/comm/Response; (98 bytes) @ 0xffffffff6af4c648 [0xffffffff6af4ba40+0xc08]

It still happens exclusively in org.spec.jbb.hq.tx.InstallmentPurchaseVoidTransaction::execute(), right?

Rob Asked: Brian.. can you confirm? From what I can dig up, Rob's data is up to date. There may have been some incidents in earlier 8u20 or 9 builds, but our process back then was to re-run the jobs, which ended up overwriting the log files and changing the job status from Failed to Finished - so those records got lost. Once we figured this out, we changed the process on how we deal with failed jobs. As we get time, we are running back-builds through aurora-performance. But again, due to the intermittent nature of the bug, it could have slipped into a build were it just happend to not reproduce during the benchmark run for that build.

FYI... the SIGBUS is still present, this time against 8u20-b20. One run of jbb2013-CMS on SPARC failed. The other 4 runs succeeded. No new core files, but wanted to log the fact that it continues to occur.

Igor asked: "So has this problem started occurring only after 9b12?" We started jdk9 performance coverage around b12, and 8u20 coverage around b16. So our focus has been on the builds since then. However, in the background we are starting to work through the older builds. The older builds for 8u20 and 9 have now all been run, but not fully examined yet. I'll ask Brian to confirm this, but what I know, the earliest we've seen the SIGBUS is 9-b12 and 8u20-b18. However, keep in mind that the the SIGBUS is intermittent. To my knowledge, we seen the SIGBUS now in, 9: b12, b15, b16 <did not show in b19, b18, b17, b14, b13> 8u20: b18 <did not show in b19> so its possible it's in earlier builds, but just didn't show itself. Brian.. can you confirm?

So, what this code is doing is making a field store and a post-barrier. The ldx with a negative offset is a load from the constant table of the cartable table base address. The value of l7 at this point looks bogus, we need to find out how it gets killed.

So has this problem started occurring only after 9b12? Actually I don't see anything suspicious going to b12 (and no compiler changes at all). It'd be nice to narrow down the build with which the failure started occurring.

The analysis above is a bit incorrect, the cwbe %g0, %g0, 0xffffffff69805160 instruction always branches, so in the context of ldx [%l7 - 136], %l1 the address base (%l7) does not necessarily contain the polling page address.

Did an initial analysis and it looks like a register allocation issue. The crash happens in a compiled frame, see comments inline below: [...] 0xffffffff698050fc: sethi %hi(0x83cffc00), %g3 <------------------ Load safepoint check address 0xffffffff69805100: btog -1024, %g3 0xffffffff69805104: sra %l4, %g0, %l5 0xffffffff69805108: sllx %l5, 2, %l5 0xffffffff6980510c: add %i1, %l5, %l5 0xffffffff69805110: ld [%l5 + 16], %g1 0xffffffff69805114: ldx [%g3], %g0 <----------------- Safepoint check 0xffffffff69805118: cwbe %g1, %g0, 0xffffffff698051c0 ! 0xffffffff698051c0 0xffffffff6980511c: sllx %g1, 3, %i3 0xffffffff69805120: add %i3, %g6, %i3 0xffffffff69805124: cwbe %g0, %g0, 0xffffffff698050ac ! 0xffffffff698050ac 0xffffffff69805128: sethi %hi(0x83cffc00), %l7 <-------------------- Load safepoint check address again 0xffffffff6980512c: btog -1024, %l7 0xffffffff69805130: srl %l2, 1, %l2 0xffffffff69805134: ldx [%l7], %g0 <-------------------------- Safepoint check (so far so good....) 0xffffffff69805138: cwbne %l2, 0x0, 0xffffffff698052a0 ! 0xffffffff698052a0 0xffffffff6980513c: ld [%i2 + 16], %l1 0xffffffff69805140: cmp %l3, %l1 0xffffffff69805144: bge,pn %icc,0xffffffff69805b94 ! 0xffffffff69805b94 0xffffffff69805148: nop 0xffffffff6980514c: cwbe %g0, %g0, 0xffffffff69805160 ! 0xffffffff69805160 0xffffffff69805150: ldx [%l7 - 136], %l1 <------------------------ Here we crash, we dereference [%l7 - 136] here, but %l7 contains the safepoint check address here. A register allocation bug? 0xffffffff69805154: st %l3, [%i3 + 28] 0xffffffff69805158: srlx %i3, 9, %l0 0xffffffff6980515c: clrb [%l1 + %l0] 0xffffffff69805160: ldx [%g2 + 96], %i0 0xffffffff69805164: ldx [%g2 + 112], %l0 [...] Moving bug to the compiler team for further analysis.

Right... We run SPECjbb2013 with CMS, G1, ParallelGC on 4 platforms: Solaris-SPARC, Solaris-x64, Linux-x64, Windows Server 2008-x64 So far, we have seen this SIGBUS error only while using CMS on Solaris-SPARC. We have seen it happen with both JDK9 and 8u20.

The performance machines only run Solaris 11, so I���m sure that���s the only version we���ve tried. This benchmark also runs on Solaris x86, Linux and Windows (all 64-bit) without problems.

The performance team is only seeing this with CMS. It is reported on solaris_11. I'll ask if it has been seen with other OS's.

Problem reproduce with 8u20-b18, again with Solaris SPARC and SPECjbb2013-CMS. No core files from these runs, but hs_err_pid files are available. Here are the two aurora batches that failed: http://aurora.se.oracle.com/faces/Batch.xhtml?batchName=29357.perfSubmit http://aurora.se.oracle.com/faces/Batch.xhtml?batchName=29542.perfSubmit And here are links to the log directories for corresponding failures within these two batches: http://sthaurora-ds.se.oracle.com:10501/runs/29357.perfSubmit-1/logs.specjbb2013/results.specjbb2013/results_2/ http://sthaurora-ds.se.oracle.com:10501/runs/29542.perfSubmit-1/logs.specjbb2013/results.specjbb2013/results_3/

Got 3 failures in a 5-iteration run. Iterations 1, 2, and 4 failed with SIGBUS. This is a different machine from the earlier failure. This run as with the a 9-b13 variant ~ the baseline comparison build for the Sparc-Solaris compiler switch (ie, not the new compiler). Batch: http://aurora.se.oracle.com/faces/Job.xhtml?job_id=26335.perfSubmit.perf-specjbb2013-SolarisSparc64.refworkload As with the previous failure, this is SPECjbb2013-CMS on Solaris-SPARC. This time we captured the core files. They can be found in the results_* directory for each iteration at: http://sthaurora-ds.se.oracle.com:10501/runs/26335.perfSubmit-1/logs.specjbb2013/ --Resii

Due to a core files being generated with different names of Solaris than what Refworkload expected, the core files were not captured. When either of these issues happens again, and we get the core file, we will be able to look into this more. Until then, we're in a holding pattern.

Restarting the run resulted in failure in one of the 5 runs. This time, it was a SIGSEGV that terminated the benchmark, but we did get a full hs_error_pid file this time. The failed iteration results are here: http://sthaurora-ds.se.oracle.com:10501/runs/22728.perfSubmit-1/logs.specjbb2013/results.specjbb2013/results_1/ The hs_err_pid16955.log file and a log file contain the relevant data.