JDK-8009536 : G1: Apache Lucene hang during reference processing
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2013-03-06
  • Updated: 2014-01-14
  • Resolved: 2013-03-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 Other
7u40Fixed 8Fixed hs24Fixed
Related Reports
Relates :  
Description
Uwe Schindler at Apache reports a hang during Apache Lucene testing. From the stack trace it looks like we are hanging during reference processing.

The stack trace indicates that we are using single threaded reference processing and that this is a recent build that includes this changeset:

changeset:   4062:f90b9bceb8e5
parent:      4060:84304a77c4e3
user:        johnc
date:        Tue Feb 05 09:13:05 2013 -0800
summary:     8005032: G1: Cleanup serial reference processing closures in concurrent marking

With this change we are using the same closures for single threaded and multi threaded reference processing. My guess is that this might be what is causing the problem.

I'm guessing that we are stuck in this loop in WorkGangBarrierSync::enter():

    while (n_completed() != n_workers()) {
      monitor()->wait(/* no_safepoint_check */ true);
    }


Attaching the full thread dump supplied by Uwe. Here is the stack trace that I think is relevant:

Thread 10 (Thread 0xcf9efb40 (LWP 22951)):
#0  0xf7743430 in __kernel_vsyscall ()
#1  0xf771e96b in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/i386-linux-gnu/libpthread.so.0
#2  0xf6ec849c in os::PlatformEvent::park() ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#3  0xf6e98b82 in Monitor::IWait(Thread*, long long) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#4  0xf6e99370 in Monitor::wait(bool, long, bool) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#5  0xf704f094 in WorkGangBarrierSync::enter() ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#6  0xf6b62a09 in ConcurrentMark::enter_first_sync_barrier(unsigned int) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#7  0xf6b65adc in CMTask::do_marking_step(double, bool, bool) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#8  0xf6b69dc9 in G1CMDrainMarkingStackClosure::do_void() ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#9  0xf6f2bdff in ReferenceProcessor::process_phase3(DiscoveredList&, bool, BoolObjectClosure*, OopClosure*, VoidClosure*) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#10 0xf6f2cf6d in ReferenceProcessor::process_discovered_reflist(DiscoveredList*, ReferencePolicy*, bool, BoolObjectClosure*, OopClosure*, VoidClosure*, AbstractRefProcTaskExecutor*) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#11 0xf6f2d33c in ReferenceProcessor::process_discovered_references(BoolObjectClosure*, OopClosure*, VoidClosure*, AbstractRefProcTaskExecutor*) () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#12 0xf6b61ea5 in ConcurrentMark::weakRefsWork(bool) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#13 0xf6b66638 in ConcurrentMark::checkpointRootsFinal(bool) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#14 0xf6b82024 in CMCheckpointRootsFinalClosure::do_void() ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#15 0xf703dc65 in VM_CGC_Operation::doit() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#16 0xf703d2a7 in VM_Operation::evaluate() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#17 0xf703b44e in VMThread::evaluate_operation(VM_Operation*) ()
   from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#18 0xf703b9e8 in VMThread::loop() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#19 0xf703c095 in VMThread::run() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#20 0xf6ecee79 in java_start(Thread*) () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#21 0xf771ad4c in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#22 0xf763ed3e in clone () from /lib/i386-linux-gnu/libc.so.6

Here are the steps to reproduce according to Uwe:

To reproduce:
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing with: https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
- change to directory lucene/analysis/uima and run:
	ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the child VM actually running the test no longer responds). The PID is also printed. Try to get a stack trace or kill it, no response. Only kill -9 helps. Choosing another garbage collector in the above command line makes the test finish after a few seconds, e.g. -Dargs="-server -XX:+UseConcMarkSweepGC"
Comments
The problem is that the marking stack overflows and the CMTask-0/VMThread (the thread that is executing the serial ref processing code) enters the overflow handling code in CMTask::do_marking_step(). The overflow handling code makes the thread enter the first sync barrier and waits for the other threads. Unfortunately there are no other threads but the sync barriers think there are. The last time the number of threads in the barriers was updated was just before executing the parallel remark task. Hence the barriers think that we have same number of threads as that involved in the remark and will wait, indefinitely, for that number of threads. As I mentioned, there is only the VMThread and so we see the hang. Proposed solutions: * Use set_phase() in weakRefsWork to update the number of threads in the parallel task terminator and worker barrier syncs. This solution is not correct. If the VMThread enters the barrier sync we will get an assert that the current thread is NOT a concurrent GC thread. The VMThread must not enter the barrier sync code. We would also have to update the number of workers in the parallel task terminator each time the VMThread executes the drain closure and offers termination. * Skip offering to terminate and entering the sync barriers if serial This solution looks like the best solution and was the first solution suggested by Bengt.
12-03-2013

I can reproduce the hang on my Windows x64 workstations using a 32 bit binary. It hangs every run. With this setup I could confirm that the issue is due to the change for 8005032 as mentioned above. The test passes before the change, but hangs with the change. I can attach with Visual Studio and debug. This confirms that the hang is indeed due to the fact that we enter the termination protocol even though we are single threaded. I tried a simple patch that guards the entry to the termination code. With this patch the test passes again. diff --git a/src/share/vm/gc_implementation/g1/concurrentMark.cpp b/src/share/vm/gc_implementation/g1/concurrentMark.cpp --- a/src/share/vm/gc_implementation/g1/concurrentMark.cpp +++ b/src/share/vm/gc_implementation/g1/concurrentMark.cpp @@ -4336,7 +4336,9 @@ gclog_or_tty->print_cr("[%u] detected overflow", _worker_id); } + if (do_stealing || do_termination) { _cm->enter_first_sync_barrier(_worker_id); + } // When we exit this sync barrier we know that all tasks have // stopped doing marking work. So, it's now safe to // re-initialise our data structures. At the end of this method, @@ -4347,8 +4349,10 @@ // We clear the local state of this task... clear_region_fields(); + if (do_stealing || do_termination) { // ...and enter the second barrier. _cm->enter_second_sync_barrier(_worker_id); + } // At this point everything has bee re-initialised and we're // ready to restart. }
07-03-2013

I think we may be overflowing the marking stack - causing the marking task to enter the syncing protocol to handle the overflow (worker 0 resets the global marking data structures). Possible workarounds: -XX:+ParallelRefProcEnabled or -XX:MarkStackSize=128*16K (2m)
06-03-2013