JDK-6847956 : G1: crash in oopDesc*G1ParCopyHelper::copy_to_survivor_space(oopDesc*)
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 6u14,6u16
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic,solaris_10
  • CPU: generic,sparc
  • Submitted: 2009-06-04
  • Updated: 2013-09-18
  • Resolved: 2009-11-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 Other
6u17-revFixed 7Fixed hs14.3Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Customer got a crash in the G1 immediately after the start:

# A fatal error has been detected by the Java Runtime Environment:
#  SIGSEGV (0xb) at pc=0xfe87933c, pid=8596, tid=7
# JRE version: 6.0_14-b08
# Java VM: Java HotSpot(TM) Server VM (14.0-b16 mixed mode solaris-sparc )
# Problematic frame:
# V  [libjvm.so+0x47933c]
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
 O0=0x00036900 O1=0x0002cd80 O2=0x00000000 O3=0x00000000
 O4=0x00000001 O5=0x00000000 O6=0xfbd7d428 O7=0xfe8a67fc
 G1=0x01ba85ea G2=0x00000000 G3=0xcb5e0000 G4=0xffffe25c
 G5=0x00000000 G6=0x00000000 G7=0xfe261000 Y=0x00000000
 PC=0xfe87933c nPC=0xfe879340
Instructions: (pc=0xfe87933c)
0xfe87932c:   97 31 70 3f da 06 c0 13 98 9a e0 01 d8 23 a0 64
0xfe87933c:   de 03 61 14 96 03 e0 01 02 40 00 08 d6 23 a0 60
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x47933c] oopDesc*G1ParCopyHelper::copy_to_survivor_space(oopDesc*)+0xb0
V [libjvm.so+0x4814e0] void G1ParCopyClosure::do_oop_work(oopDesc**)+0xa0
V [libjvm.so+0x47db98] void BufferingOopClosure::process_buffer()+0x48
V [libjvm.so+0x47b10c] void G1CollectedHeap::g1_process_strong_roots(bool,SharedHeap::ScanningOption,OopClosure*,OopsInHeapRegionClosure*,OopsInHeapRegionClosure*,OopsInGenClosure*,int)+0x25c
V [libjvm.so+0x47d688] void G1ParTask::work(int)+0x520
V [libjvm.so+0x810388] void GangWorker::loop()+0x8c
V [libjvm.so+0x6f54f0] java_start+0x234 
---------------  S Y S T E M  ---------------

OS:                       Solaris 10 11/06 s10s_u3wos_10 SPARC
           Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                           Assembled 14 November 2006

uname:SunOS 5.10 Generic_120011-14 sun4u  (T2 libthread)
rlimit: STACK 8192k, CORE infinity, NOFILE 65536, AS infinity
load average:0,34 0,31 0,29

CPU:total 14 has_v8, has_v9, has_vis1, has_vis2, is_ultra3

Memory: 8k page, physical 83886080k(56274704k free)

vm_info: Java HotSpot(TM) Server VM (14.0-b16) for solaris-sparc JRE (1.6.0_14-b
08), built on May 21 2009 01:43:32 by "" with Workshop 5.8

time: Thu Jun  4 11:10:53 2009
elapsed time: 0 seconds

The full hs_err_pid8596.log is attached to this report.

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/1f19207eefc2

EVALUATION For what it is worth: It would be really nice not to have to mark objects as we copy them to the survivors (to avoid the extra overhead during the GC pause, as well as avoid having to notify the marking phase that those objects have moved). Note that, if we do several GC pauses during a marking phase, the majority of objects in the survivors would be objects that were allocated since the start of the marking phase which, according to the SATB invariants, we do not have to visit during the marking phase; it's only the objects in the survivors after the initial-mark pause we really need to visit. I'll open a CR to track this idea (it's CR 6888336).

EVALUATION I should have added: Typically, I could get the test to fail within 30 mins and after 3 marking cycles at most (typically, it'd fail after the first). I ran with the fix overnight for 12+ hours and 360+ marking cycles with no failures.

WORK AROUND -XX:MaxTenuringThreshold=0

SUGGESTED FIX The fix is straightforward: heapRegion.hpp: void note_end_of_copying() { - assert(top() >= _next_top_at_mark_start, - "Increase only"); - // Survivor regions will be scanned on the start of concurrent - // marking. - if (!is_survivor()) { + assert(top() >= _next_top_at_mark_start, "Increase only"); _next_top_at_mark_start = top(); } - }

EVALUATION The incomplete marking issue is caused because, when marking is in progress, we deal with the survivors spaces incorrectly. In G1, there are two ways in which an object is considered live. First, if it's marked in the bitmap. Second, if it's over the "TAMS" (top at mark start) variable of its containing region. And we have two copies of this liveness information, one it's the "previous" (the last one that was obtained and which is known to be consistent), one it's the "next" (thte one currently in progress which might be inconsistent). Here we deal with the next marking info, as it's the one that's being obtained during the marking cycle. One more thing to point out is that, in G1, when we evacuate objects during evacuation pauses, if they are considered live we also have to explicitly mark them in their new location too (typically, by marking them in the bitmap). In some cases we also have to notify the marking threads that an object has been evacuated. The bug is caused because, during marking, we explicitly set the NTAMS (next TAMS) variable of each region that contains survivors to bottom, thus making all its contents implicitly live. Consider the following scenario, we have a -> b -> c with a and b being in a survivor space, and c being, say, in the old generation. Let's also assume that, when we start the evacuation pause, a is marked, b and c are not. When we copy a and b to a survivor region, we'll propagate a's mark to the bitmap, notify the marking threads to have to visit it, and then set the NTAMS field of that region to bottom, making them both implicitly marked (note that a is both explicitly and implicitly marked at this point). When marking finally comes across a it says "ah, b is already live" (because it's over NTAMS) and it incorrectly doesn't process it further. As a result, b is never visited by the marking threads and c is never marked.