JDK-8239024 : Kitchensink24HStress.java failed due to timeout
  • Type: Bug
  • Component: hotspot
  • Sub-Component: jfr
  • Affected Version: 15
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2020-02-13
  • Updated: 2024-10-17
  • Resolved: 2020-06-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 15 JDK 16
15 b27Fixed 16Fixed
Related Reports
Relates :  
Description
The following test timed out in the JDK15 CI:

applications/kitchensink/Kitchensink24HStress.java

The log file is not helpful for this one:

result: Error. Program `/opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/bin/java' timed out (timeout set to 90000000ms, elapsed time including timeout handling was 90484392ms).

90484392ms == 1508 minutes == 25.1 hours
Comments
Changeset: 83d0318e Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2020-06-11 10:56:59 +0000 URL: https://git.openjdk.java.net/lanai/commit/83d0318e
02-07-2020

URL: https://hg.openjdk.java.net/jdk/jdk/rev/f88f43024fe0 User: mgronlun Date: 2020-06-11 08:58:22 +0000
11-06-2020

Review mail sent: https://mail.openjdk.java.net/pipermail/hotspot-jfr-dev/2020-June/001527.html
10-06-2020

The JFRStream_lock was repurposed in connection with Event Streaming: With Event Streaming, the relation, and relative order, between events and their checkpoint data needs to be carefully coordinated on per-flushpoint basis. This was previously done on a per-chunk basis where no order requirements existed. For the Old Object Samples, this situation caused a problem: Sample events are written to the stream, just as before. In parallel, there is now a continuous, quite frequent flush operation that writes events out to disk. As a consequence, the events could be written to the stream before its associated checkpoint data, violating the relative ordering. Streaming would find an event but did not find its associated checkpoint event. In an attempt to solve this, instead of using the non-global RotationLock, the JFRStream_lock was introduced to coordinate this activity between the old object sample event emitter thread and the flushpoint thread. This was a bad decision leading to this problem. We need to remove the JFRStream_lock as the coordinator, and reinstate the customized non-global RotationLock. If we invert the serialization order of old object event samples so that checkpoint information is written before the events, we will satisfy the order condition and no cross-thread coordination will be necessary.
10-06-2020

[~rehn] thanks Robbin for the analysis!
09-06-2020

[~mgronlun] can you have a look please.
09-06-2020

Thread 22 (Thread 0x7efbf24b1700 (LWP 10957)): #0 0x00007efcca05fbf9 in syscall () from /lib64/libc.so.6 #1 0x00007efcc9b4f682 in LinuxWaitBarrier::wait(int) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #2 0x00007efcc99cc8ba in SafepointSynchronize::block(JavaThread*) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #3 0x00007efcc99d2d15 in SafepointMechanism::block_if_requested_slow(JavaThread*) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #4 0x00007efcc98e3e24 in Mutex::lock_contended(Thread*) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #5 0x00007efcc98e3fa3 in Mutex::lock() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #6 0x00007efcc957b560 in JfrCheckpointManager::flush_type_set() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #7 0x00007efcc95b0f2e in JfrRecorderService::flush() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #8 0x00007efcc95b1757 in JfrRecorderService::invoke_flush() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #9 0x00007efcc95b1dc8 in JfrRecorderService::flushpoint() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #10 0x00007efcc95b3180 in recorderthread_entry(JavaThread*, Thread*) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #11 0x00007efcc9ac90ce in JavaThread::thread_main_inner() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #12 0x00007efcc9ace02b in Thread::call_run() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #13 0x00007efcc9927927 in thread_native_entry(Thread*) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #14 0x00007efcca540ea5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007efcca0658dd in clone () from /lib64/libc.so.6 JfrRecorderService::flushpoint() takes JfrStream_lock no safepoint check. JfrCheckpointManager::flush_type_set() takes: MutexLocker cld_lock(ClassLoaderDataGraph_lock); MutexLocker module_lock(Module_lock); With safepoint check, and stops for safepoint. Thread 54 (Thread 0x7efc6189d700 (LWP 10899)): #0 0x00007efcca54754d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007efcca542e9b in _L_lock_883 () from /lib64/libpthread.so.0 #2 0x00007efcca542d68 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007efcc98e3ff6 in Mutex::lock_without_safepoint_check() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #4 0x00007efcc93fb73e in EventEmitter::emit(ObjectSampler*, long, bool) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #5 0x00007efcc9784b13 in LeakProfiler::emit_events(long, bool) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #6 0x00007efcc9595d0f in jfr_emit_old_object_samples () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1006/linux-x64.jdk/jdk-15/lib/server/libjvm.so #7 0x00007efcacd43143 in ?? () #8 0x0000000000000000 in ?? () jfr_emit_old_object_samples (JNI method) waits on JfrStream_lock, so we can't safepoint and JfrStream_lock will not be released until after the safepoint. The general rule is that you may never take safepoint checked lock after you taken a non-safepoint checked lock.
09-06-2020

The message about emtpy jfr file means only that jcmd was not able to get JFR info. It is very common when JVM doesn't respond.
11-05-2020

I also looked, it's not clear why everything seems blocked and no progress on the safepoint. I think the timeout handler should core-dump the VM because this don't look debuggable from only logs. This happens often that the issue would have been debuggable if there was a core, someone should fix the timeout handler. I'll start working on: "8198730: Improve detection of failure to reach a safepoint" to make sure at least this 'deadlock' will be easier to debug.
18-03-2020

ILW = HLM = P3
18-02-2020

Attached the stack dump as stack-iteration-0. The timeout handler actually takes 6 stack samples, but they are identical except for time stamps: $ diff stack-iteration-[01] 2c2 < [2020-02-13 01:05:06] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 --- > [2020-02-13 01:05:07] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 1051c1051 < [2020-02-13 01:05:07] exit code: 0 time: 805 ms --- > [2020-02-13 01:05:08] exit code: 0 time: 438 ms $ diff stack-iteration-[02] 2c2 < [2020-02-13 01:05:06] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 --- > [2020-02-13 01:05:08] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 1051c1051 < [2020-02-13 01:05:07] exit code: 0 time: 805 ms --- > [2020-02-13 01:05:09] exit code: 0 time: 440 ms $ diff stack-iteration-[03] 2c2 < [2020-02-13 01:05:06] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 --- > [2020-02-13 01:05:09] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 1051c1051 < [2020-02-13 01:05:07] exit code: 0 time: 805 ms --- > [2020-02-13 01:05:10] exit code: 0 time: 434 ms $ diff stack-iteration-[04] 2c2 < [2020-02-13 01:05:06] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 --- > [2020-02-13 01:05:10] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 1051c1051 < [2020-02-13 01:05:07] exit code: 0 time: 805 ms --- > [2020-02-13 01:05:11] exit code: 0 time: 484 ms $ diff stack-iteration-[05] 2c2 < [2020-02-13 01:05:06] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 --- > [2020-02-13 01:05:11] [/bin/gdb, --pid=20395, -batch, -ex, thread apply all backtrace] timeout=20000 1051c1051 < [2020-02-13 01:05:07] exit code: 0 time: 805 ms --- > [2020-02-13 01:05:12] exit code: 0 time: 487 ms The VMThread is trying to bring the system to a safepoint: Thread 55 (Thread 0x7f96ece41700 (LWP 20421)): #0 0x00007f97529e5e9d in nanosleep () from /lib64/libpthread.so.0 #1 0x00007f97519d1f8f in os::naked_short_sleep(long) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/lib/server/libjvm.so #2 0x00007f9751b0008a in SafepointSynchronize::synchronize_threads(long, int, int*) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/lib/server/libjvm.so #3 0x00007f9751b052c8 in SafepointSynchronize::begin() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/lib/server/libjvm.so #4 0x00007f9751d74641 in VMThread::loop() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/lib/server/libjvm.so #5 0x00007f9751d74aea in VMThread::run() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/lib/server/libjvm.so #6 0x00007f9751c8fa86 in Thread::call_run() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/lib/server/libjvm.so #7 0x00007f97519c4a0e in thread_native_entry(Thread*) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-15+10-287/linux-x64-debug.jdk/jdk-15/fastdebug/lib/server/libjvm.so #8 0x00007f97529deea5 in start_thread () from /lib64/libpthread.so.0 #9 0x00007f97525038cd in clone () from /lib64/libc.so.6 I took a quick look at the other threads, but it's not clear which thread is keeping us in SafepointSynchronize::synchronize_threads().
13-02-2020