JDK-6859466 : Java 6 u13 (64-bit) crashes on RHEL 5.2 (64-bit) in CMS; Need analysis of core file
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 6u13
  • Priority: P2
  • Status: Closed
  • Resolution: Duplicate
  • OS: linux_redhat_5.2
  • CPU: unknown
  • Submitted: 2009-07-12
  • Updated: 2010-05-10
  • Resolved: 2009-10-06
Related Reports
Duplicate :  
Description
As the customer explains it, they introduced new code to their application about a week ago (so roughly 6 July 2009).  The new code causes a high number of CMS collections in the Eden space (about every 2 seconds) for over half an hour upon startup.

After an hour of functionality, the application crashes with a core file and hs_err log.  The appserver in which this runs does not display any errors before the crash.

Review of the hs_err logs suggested to CFE's that bug 6793611 was at fault.  Lawrence Chow, engaged for the latest crash, disagrees.  While the problem is in CMS, he feels that 6793611 is not the culprit.  Analysis of the core file is required to discover root cause.

Core file is at /net/cores.central/cores/71275408/12July/core.6220.  Libraries are in /net/cores.central/cores/71275408/12July/libs.  All associated data collected from the latest crash (jstack, jmap, gclogs) are in /net/cores.central/cores/71275408/12July
Customer is using 64-bit Java 6 update 13 on Red hat Linux 5.2.  After coding changes made last week that increased the amount of GC in Eden, application starts crashing after about an hour of life.

Comments
EVALUATION Looking at hs_err_pid6220.txt (attached), the symptom appears to be a bad count value passed to from copy_to_survivor_space_avoiding_promotion_undo() to Copy::aligned_disjoint_words() to Copy::pd_disjoint_words()--possibly a negative value or an underflow, which are huge when treated as unsigned. The relevant instructions are ;; 00002ad5344d412f 48 c1 e9 03 shr $0x3,%rcx ;; --------------- ;; 00002ad5344d4133 f3 48 a5 repz movsq %ds:(%rsi),%es:(%rdi) Fault occurred on repz movsq. rcx is the count (divided by 8 to convert from bytes to 64-bit quadwords), rsi is the source addr, rdi is the destination addr: RCX=0x1fffffffb744fa9b RSI=0x00002aacb18a9598 RDI=0x00002aacc8237000 Both rsi & rdi point past the end the heap; rcx is huge. Next question is where the bad count originated.
13-07-2009