Uwe Schindler at Apache reports a hang during Apache Lucene testing. From the stack trace it looks like we are hanging during reference processing.
The stack trace indicates that we are using single threaded reference processing and that this is a recent build that includes this changeset:
changeset: 4062:f90b9bceb8e5
parent: 4060:84304a77c4e3
user: johnc
date: Tue Feb 05 09:13:05 2013 -0800
summary: 8005032: G1: Cleanup serial reference processing closures in concurrent marking
With this change we are using the same closures for single threaded and multi threaded reference processing. My guess is that this might be what is causing the problem.
I'm guessing that we are stuck in this loop in WorkGangBarrierSync::enter():
while (n_completed() != n_workers()) {
monitor()->wait(/* no_safepoint_check */ true);
}
Attaching the full thread dump supplied by Uwe. Here is the stack trace that I think is relevant:
Thread 10 (Thread 0xcf9efb40 (LWP 22951)):
#0 0xf7743430 in __kernel_vsyscall ()
#1 0xf771e96b in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/i386-linux-gnu/libpthread.so.0
#2 0xf6ec849c in os::PlatformEvent::park() ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#3 0xf6e98b82 in Monitor::IWait(Thread*, long long) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#4 0xf6e99370 in Monitor::wait(bool, long, bool) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#5 0xf704f094 in WorkGangBarrierSync::enter() ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#6 0xf6b62a09 in ConcurrentMark::enter_first_sync_barrier(unsigned int) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#7 0xf6b65adc in CMTask::do_marking_step(double, bool, bool) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#8 0xf6b69dc9 in G1CMDrainMarkingStackClosure::do_void() ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#9 0xf6f2bdff in ReferenceProcessor::process_phase3(DiscoveredList&, bool, BoolObjectClosure*, OopClosure*, VoidClosure*) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#10 0xf6f2cf6d in ReferenceProcessor::process_discovered_reflist(DiscoveredList*, ReferencePolicy*, bool, BoolObjectClosure*, OopClosure*, VoidClosure*, AbstractRefProcTaskExecutor*) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#11 0xf6f2d33c in ReferenceProcessor::process_discovered_references(BoolObjectClosure*, OopClosure*, VoidClosure*, AbstractRefProcTaskExecutor*) () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#12 0xf6b61ea5 in ConcurrentMark::weakRefsWork(bool) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#13 0xf6b66638 in ConcurrentMark::checkpointRootsFinal(bool) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#14 0xf6b82024 in CMCheckpointRootsFinalClosure::do_void() ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#15 0xf703dc65 in VM_CGC_Operation::doit() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#16 0xf703d2a7 in VM_Operation::evaluate() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#17 0xf703b44e in VMThread::evaluate_operation(VM_Operation*) ()
from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#18 0xf703b9e8 in VMThread::loop() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#19 0xf703c095 in VMThread::run() () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#20 0xf6ecee79 in java_start(Thread*) () from /var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#21 0xf771ad4c in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#22 0xf763ed3e in clone () from /lib/i386-linux-gnu/libc.so.6
Here are the steps to reproduce according to Uwe:
To reproduce:
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing with: https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
- change to directory lucene/analysis/uima and run:
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the child VM actually running the test no longer responds). The PID is also printed. Try to get a stack trace or kill it, no response. Only kill -9 helps. Choosing another garbage collector in the above command line makes the test finish after a few seconds, e.g. -Dargs="-server -XX:+UseConcMarkSweepGC"