United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-8009940 G1: assert(_finger == _heap_end) failed, concurrentMark.cpp:809
JDK-8009940 : G1: assert(_finger == _heap_end) failed, concurrentMark.cpp:809

Details
Type:
Bug
Submit Date:
2013-03-12
Status:
Resolved
Updated Date:
2013-05-14
Project Name:
JDK
Resolved Date:
2013-03-19
Component:
hotspot
OS:
generic
Sub-Component:
gc
CPU:
generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
Fixed Versions:
hs25 (b24)

Related Reports
Backport:
Backport:
Backport:
Relates:

Sub Tasks

Description
While trying out various solutions for 8009536, I obtained the following crash when running the test case with parallel reference processing enabled:

[junit4:junit4] # To suppress the following error report, specify this argument
[junit4:junit4] # after -XX: or in .hotspotrc:  SuppressErrorAt=/concurrentMark.cpp:809
[junit4:junit4] #
[junit4:junit4] # A fatal error has been detected by the Java Runtime Environment:
[junit4:junit4] #
[junit4:junit4] #  Internal Error (/export/workspaces/8009536_3/src/share/vm/gc_implementation/g1/concurrentMark.cpp:809), pid=16314, tid=14
[junit4:junit4] #  assert(_finger == _heap_end) failed: only way to get here
[junit4:junit4] #
[junit4:junit4] # JRE version: Java(TM) SE Runtime Environment (8.0-b79) (build 1.8.0-ea-fastdebug-b79)
[junit4:junit4] # Java VM: Java HotSpot(TM) Server VM (25.0-b23-internal-jvmg mixed mode solaris-x86 )
[junit4:junit4] # Core dump written. Default location: /export/bugs/8009536/lucene-5.0-2013-03-05_15-37-06/build/analysis/uima/test/J0/core or core.16314
[junit4:junit4] #
[junit4:junit4] # An error report file with more information is saved as:
[junit4:junit4] # /export/bugs/8009536/lucene-5.0-2013-03-05_15-37-06/build/analysis/uima/test/J0/hs_err_pid16314.log
[junit4:junit4] #
[junit4:junit4] # If you would like to submit a bug report, please visit:
[junit4:junit4] #   http://bugreport.sun.com/bugreport/crash.jsp
[junit4:junit4] #
[junit4:junit4] Current thread is 14
[junit4:junit4] Dumping core ...

                                    

Comments
The stack trace for the assertion failure is:

Current thread (0x08353400):  VMThread [stack: 0xd817f000,0xd81ff000] [id=14]

Stack: [0xd817f000,0xd81ff000],  sp=0xd81fe34c,  free space=508k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xdcaeac]  void VMError::report(outputStream*)+0xdec
V  [libjvm.so+0xdcc205]  void VMError::report_and_die()+0x735
V  [libjvm.so+0x6af46b]  void report_vm_error(const char*,int,const char*,const char*)+0x8b
V  [libjvm.so+0x65e6a3]  void ConcurrentMark::set_phase(unsigned,bool)+0x1a3
V  [libjvm.so+0x660534]  void G1CMRefProcTaskExecutor::execute(AbstractRefProcTaskExecutor::ProcessTask&)+0xc4
V  [libjvm.so+0xc5275c]  void ReferenceProcessor::process_discovered_reflist(DiscoveredList*,ReferencePolicy*,bool,BoolObjectClosure*,OopClosure*,VoidClosure*,AbstractRefProcTaskExecutor*)+0x25c
V  [libjvm.so+0xc50af6]  void ReferenceProcessor::process_discovered_references(BoolObjectClosure*,OopClosure*,VoidClosure*,AbstractRefProcTaskExecutor*)+0x266
V  [libjvm.so+0x660975]  void ConcurrentMark::weakRefsWork(bool)+0x2d5
V  [libjvm.so+0x65f593]  void ConcurrentMark::checkpointRootsFinal(bool)+0x153
V  [libjvm.so+0x6983e8]  void CMCheckpointRootsFinalClosure::do_void()+0x18
V  [libjvm.so+0xe0ecd1]  void VM_CGC_Operation::doit()+0xe1
V  [libjvm.so+0xe0b82d]  void VM_Operation::evaluate()+0x7d
V  [libjvm.so+0xe0938c]  void VMThread::evaluate_operation(VM_Operation*)+0x9c
V  [libjvm.so+0xe09960]  void VMThread::loop()+0x4d0
V  [libjvm.so+0xe08fc1]  void VMThread::run()+0x111
V  [libjvm.so+0xbb1490]  java_start+0x210
C  [libc.so.1+0xa73b7]  _thr_setup+0x4e
C  [libc.so.1+0xa76a0]  __moddi3+0x60

And here's the assert:

void ConcurrentMark::set_phase(uint active_tasks, bool concurrent) {
  assert(active_tasks <= _max_worker_id, "we should not have more");

  _active_tasks = active_tasks;
  // Need to update the three data structures below according to the
  // number of active threads for this phase.
  _terminator   = ParallelTaskTerminator((int) active_tasks, _task_queues);
  _first_overflow_barrier_sync.set_n_workers((int) active_tasks);
  _second_overflow_barrier_sync.set_n_workers((int) active_tasks);

  _concurrent = concurrent;
  // We propagate this to all tasks, not just the active ones.
  for (uint i = 0; i < _max_worker_id; ++i)
    _tasks[i]->set_concurrent(concurrent);

  if (concurrent) {
    set_concurrent_marking_in_progress();
  } else {
    // We currently assume that the concurrent flag has been set to
    // false before we start remark. At this point we should also be
    // in a STW phase.
    assert(!concurrent_marking_in_progress(), "invariant");
    assert(_finger == _heap_end, "only way to get here");
    update_g1_committed(true);
  }
}

What happens is that the worker threads are executing their drain closure (an instance of G1CMDrainMarkingStackClosure) when they see that they should abort their marking task because the marking stack has overflowed. To handle the overflow the worker enter the first sync barrier in CMTask::do_marking_step():

  if (has_aborted()) {
    // The task was aborted for some reason.

    statsOnly( ++_aborted );

    if (_has_timed_out) {
      double diff_ms = elapsed_time_ms - _time_target_ms;
      // Keep statistics of how well we did with respect to hitting
      // our target only if we actually timed out (if we aborted for
      // other reasons, then the results might get skewed).
      _marking_step_diffs_ms.add(diff_ms);
    }

    if (_cm->has_overflown()) {
      // This is the interesting one. We aborted because a global
      // overflow was raised. This means we have to restart the
      // marking phase and start iterating over regions. However, in
      // order to do this we have to make sure that all tasks stop
      // what they are doing and re-initialise in a safe manner. We
      // will achieve this with the use of two barrier sync points.

      if (_cm->verbose_low()) {
        gclog_or_tty->print_cr("[%u] detected overflow", _worker_id);
      }

      if (!is_serial) {
        // We only need to enter the sync barrier if being called
        // from a parallel context
        _cm->enter_first_sync_barrier(_worker_id);
        
        // When we exit this sync barrier we know that all tasks have
        // stopped doing marking work. So, it's now safe to
        // re-initialise our data structures. At the end of this method,
        // task 0 will clear the global data structures.
      }

After worker 0 enters the first sync barrier, it resets the marking state:

  // let the task associated with with worker 0 do this
  if (worker_id == 0) {
    // task 0 is responsible for clearing the global data structures
    // We should be here because of an overflow. During STW we should
    // not clear the overflow flag since we rely on it being true when
    // we exit this method to abort the pause and restart concurent
    // marking.
    reset_marking_state(concurrent() /* clear_overflow */);
    force_overflow()->update();

    if (G1Log::fine()) {
      gclog_or_tty->date_stamp(PrintGCDateStamps);
      gclog_or_tty->stamp(PrintGCTimeStamps);
      gclog_or_tty->print_cr("[GC concurrent-mark-reset-for-overflow]");
    }
  }

Resetting the marking state resets the value of the global finger from _heap_end to _heap_start:

void ConcurrentMark::reset_marking_state(bool clear_overflow) {
  _markStack.set_should_expand();
  _markStack.setEmpty();        // Also clears the _markStack overflow flag
  if (clear_overflow) {
    clear_has_overflown();
  } else {
    assert(has_overflown(), "pre-condition");
  }
  _finger = _heap_start;

  for (uint i = 0; i < _max_worker_id; ++i) {
    CMTaskQueue* queue = _task_queues->queue(i);
    queue->set_empty();
  }
}

When the parallel task for the next phase of parallel reference processing is created, set_phase is called:

void G1CMRefProcTaskExecutor::execute(ProcessTask& proc_task) {
  assert(_workers != NULL, "Need parallel worker threads.");
  assert(_g1h->ref_processor_cm()->processing_is_mt(), "processing is not MT");

  G1CMRefProcTaskProxy proc_task_proxy(proc_task, _g1h, _cm);

  // We need to reset the phase for each task execution so that
  // the termination protocol of CMTask::do_marking_step works.
  _cm->set_phase(_active_workers, false /* concurrent */);
  _g1h->set_par_threads(_active_workers);
  _workers->run_task(&proc_task_proxy);
  _g1h->set_par_threads(0);
}

causing the assertion failure.
                                     
2013-03-12
The solution is a three pronged approach:

1. Skip reference processing if the marking stack overflowed during the actual remark task.

This is safe to do as reference objects are only discovered once. So once a reference object is discovered it will remain on the discovered list until it is processed. Since, as a result of the overflow, marking is going to restart - the references will be processed after a successful remark. We could use the routine abandon_partial_discovery() and depopulate the discovered lists. The only benefit to doing this is that we could discover less references during the subsequent restart.

2. Remove the set_phase() calls when we execute a new ref processing proxy task

To do this we need to call set_phase() from within weakRefsWork() and call the terminator's reset_for_reuse() routine when we execute a new ref processing task. This will remove the path that checks the assert.

3. Do not have worker 0 reset the marking state during the non-current phase of marking.

During the STW remark pause there is another call to reset the marking stack (in the event of overflow) immediately after the discovered references are processed:

  double start = os::elapsedTime();

  checkpointRootsFinalWork();

  double mark_work_end = os::elapsedTime();

  weakRefsWork(clear_all_soft_refs);

  if (has_overflown()) {
    // Oops.  We overflowed.  Restart concurrent marking.
    _restart_for_overflow = true;
    // Clear the marking state because we will be restarting
    // marking due to overflowing the global mark stack.
    reset_marking_state();
    if (G1TraceMarkStackOverflow) {
      gclog_or_tty->print_cr("\nRemark led to restart for overflow.");
    }

                                     
2013-03-12
URL:   http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/e864cc14ca75
User:  johnc
Date:  2013-03-19 10:31:54 +0000

                                     
2013-03-19
URL:   http://hg.openjdk.java.net/hsx/hsx25/hotspot/rev/e864cc14ca75
User:  amurillo
Date:  2013-03-23 20:20:47 +0000

                                     
2013-03-23
The test for this CR is the Apache Lucene tests with -XX:+ParallelRefProcEnabled and -XX:MarkStackSize=32K
                                     
2013-05-07



Hardware and Software, Engineered to Work Together