A DESCRIPTION OF THE REQUEST :
Symptom:
Frequent 60 second pauses of the application that feel like GC pauses but are not reported as a GC pause. Running G1GC on a 200GB JVM heap. The machine is 16 cores x2 threads Intel Xeon CPU E5-2670 v2 @ 2.50GHz.
Combining information from pstack, jstack, and the openjdk code base, the following appears to explain the problem:
1) We have an array of 500M strings in memory.
2) There is a gang worker that is probably scanning this array as part of a concurrent mark task. The worker processes the entire array without yielding for a safepoint. The scan appears to take roughly 60 seconds, 12 microseconds per entry, or 2.5e9*60/500e6 = 300 cycles per entry. Given the depth of the stack over the array scan and the random access to string objects, this seems believable. In isolation, this is not a problem, but combined with the following, it is the cause of application pause.
3) The VMThread is blocking waiting to reach a safepoint, so blocked by the above worker.
4) Many application threads are blocked waiting for the VMThread to revoke and rebias locks. In particular it is blocking Jetty and Akka threads from processing in our application.
It is possible that the stalled VMThread is blocking other critical services as well.
This bug relates to https://bugs.openjdk.java.net/browse/JDK-8152199 , which tries to run more VM_Operations while waiting for a safepoint to be reached.
Gang worker stack:
Thread 440 (Thread 0x7f91219cf700 (LWP 51452)):
#0 klass (this=0x7f9f81ff0fb8) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:80
#1 blueprint (this=0x7f9f81ff0fb8) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:150
#2 oopDesc::size (this=0x7f9f81ff0fb8) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:521
#3 0x00007fc5e61d9fa4 in count_object (task_card_bm=0x7fc5e009ea10, marked_bytes_array=0x7fc5e00d1730, hr=0x7fc5e152afd0, obj=0x7f9f81ff0fb8, this=0x7fc5e009bfc0) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.inline.hpp:141
#4 par_mark_and_count (task_card_bm=0x7fc5e009ea10, marked_bytes_array=0x7fc5e00d1730, hr=0x7fc5e152afd0, obj=0x7f9f81ff0fb8, this=0x7fc5e009bfc0) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.inline.hpp:165
#5 CMTask::deal_with_reference (this=0x7fc5e00ddf40, obj=0x7f9f81ff0fb8) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.inline.hpp:343
#6 0x00007fc5e63d7531 in do_oop_nv<oopDesc*> (this=0x7f91219ceda0, p=0x7fb8fabf8640) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/g1OopClosures.inline.hpp:119
#7 objArrayKlass::oop_oop_iterate_nv (this=<optimized out>, obj=<optimized out>, closure=0x7f91219ceda0) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/objArrayKlass.cpp:450
#8 0x00007fc5e6079894 in oop_iterate (blk=<optimized out>, this=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:808
#9 scan_object (obj=<optimized out>, this=0x7fc5e00ddf40) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.cpp:3232
#10 CMTask::drain_local_queue (this=this@entry=0x7fc5e00ddf40, partially=partially@entry=true) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.cpp:3653
#11 0x00007fc5e607f6f3 in drain_local_queue (partially=true, this=0x7fc5e00ddf40) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.hpp:1187
#12 do_bit (offset=<optimized out>, this=<synthetic pointer>) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.cpp:3262
#13 iterate (mr=..., cl=<synthetic pointer>, this=0x7fc5e009c100) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.inline.hpp:257
#14 CMTask::do_marking_step (this=this@entry=0x7fc5e00ddf40, time_target_ms=10, do_termination=do_termination@entry=true, is_serial=is_serial@entry=false) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.cpp:4036
#15 0x00007fc5e6083856 in CMConcurrentMarkingTask::work (this=0x7fc5851d0d30, worker_id=4) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/g1/concurrentMark.cpp:968
#16 0x00007fc5e6585b6a in GangWorker::loop (this=0x7fc5e01c5800) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/utilities/workgroup.cpp:326
#17 0x00007fc5e63f0552 in java_start (thread=0x7fc5e01c5800) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:916
#18 0x00007fc5e7420dc5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007fc5e6d3cbdd in clone () from /lib64/libc.so.6
VMThread stack:
Thread 436 (Thread 0x7f9121419700 (LWP 51456)):
#0 0x00007fc5e74246d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fc5e63f759b in os::PlatformEvent::park (this=this@entry=0x7f90bcbd6b00) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:5820
#2 0x00007fc5e63ba337 in ParkCommon (timo=0, ev=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/mutex.cpp:428
#3 Monitor::IWait (this=this@entry=0x7fc5e005db30, Self=Self@entry=0x7f90bcbd5800, timo=timo@entry=0) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/mutex.cpp:806
#4 0x00007fc5e63baf4f in Monitor::wait (this=0x7fc5e005db30, no_safepoint_check=no_safepoint_check@entry=true, timeout=timeout@entry=0, as_suspend_equivalent=as_suspend_equivalent@entry=false) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/mutex.cpp:1115
#5 0x00007fc5e60789d5 in suspend_all (this=0x7fc5e6a0a8e0 <ConcurrentGCThread::_sts>) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/shared/concurrentGCThread.cpp:155
#6 ConcurrentGCThread::safepoint_synchronize () at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/gc_implementation/shared/concurrentGCThread.cpp:47
#7 0x00007fc5e6483fc5 in SafepointSynchronize::begin () at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/safepoint.cpp:130
#8 0x00007fc5e657b062 in VMThread::loop (this=this@entry=0x7f90bcbd5800) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/vmThread.cpp:513
#9 0x00007fc5e657b4a2 in VMThread::run (this=0x7f90bcbd5800) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/vmThread.cpp:288
#10 0x00007fc5e63f0552 in java_start (thread=0x7f90bcbd5800) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:916
#11 0x00007fc5e7420dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fc5e6d3cbdd in clone () from /lib64/libc.so.6
One of many application threads:
----------------- 119109 -----------------
0x00007fc5e74246d5 __pthread_cond_wait + 0xc5
0x00007fc5e63ba337 _ZN7Monitor5IWaitEP6Threadl + 0xc7
0x00007fc5e63bade6 _ZN7Monitor4waitEblb + 0xc6
0x00007fc5e657bb7c _ZN8VMThread7executeEP12VM_Operation + 0x34c
0x00007fc5e5f2afe1 _ZN13BiasedLocking17revoke_and_rebiasE6HandlebP6Thread + 0x1e1
0x00007fc5e64f193f _ZN18ObjectSynchronizer10fast_enterE6HandleP9BasicLockbP6Thread + 0x3f
0x00007fc5e6208496 _ZN18InterpreterRuntime12monitorenterEP10JavaThreadP15BasicObjectLock + 0x96
0x00007fc5dd01f49d * org.apache.hadoop.ipc.Client$Connection$3.run() bci:10 line:1002 (Interpreted frame)
0x00007fc5ddada494 * java.util.concurrent.Executors$RunnableAdapter.call() bci:4 line:471 (Compiled frame)
JUSTIFICATION :
Long pauses of application threads is the bane of the JVM...
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The concurrent mark scan of a large array should yield regularly (say every 10K entries). Ideally it would also be parallelized.
CUSTOMER SUBMITTED WORKAROUND :
Using -XX:-UseBiasedLocking to avoid the biased locks use of the VMThread.