JDK-8318109 : Writing JFR records while a CHT has taken its lock asserts in rank checking
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 21,22
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2023-10-13
  • Updated: 2024-04-19
  • Resolved: 2023-10-23
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 21 JDK 22
21.0.4-oracleFixed 22 b21Fixed
Related Reports
Relates :  
Relates :  
Description
The following test failed in the JDK22 CI:

applications/kitchensink/Kitchensink.java

Here's a snippet from the log file:

[2023-10-13T18:30:10.562876800Z] Gathering output for process 26612
[2023-10-13T18:30:10.917412600Z] Waiting for completion for process 26612
[2023-10-13T18:30:10.934455800Z] Waiting for completion finished for process 26612
[stress.process.out] #
[stress.process.out] # A fatal error has been detected by the Java Runtime Environment:
[stress.process.out] #
[stress.process.out] #  Internal Error (c:\\sb\\prod\\1697208132\\workspace\\open\\src\\hotspot\\share\\runtime\\mutex.cpp:448), pid=39360, tid=41536
[stress.process.out] #  assert(false) failed: Attempting to acquire lock JfrStacktrace_lock/stackwatermark-1 out of order with lock ConcurrentHashTableResize_lock/service-1 -- possible deadlock
[stress.process.out] #
[stress.process.out] # JRE version: Java(TM) SE Runtime Environment (22.0+20) (fastdebug build 22-ea+20-1513)
[stress.process.out] # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-ea+20-1513, mixed mode, sharing, tiered, compressed class ptrs, z gc, windows-amd64)
[stress.process.out] # Core dump will be written. Default location: C:\\sb\\prod\\1697220906\\testoutput\\test-support\\jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java\\scratch\\0\\hs_err_pid39360.mdmp
[stress.process.out] #
[stress.process.out] # JFR recording file will be written. Location: C:\\sb\\prod\\1697220906\\testoutput\\test-support\\jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java\\scratch\\0\\hs_err_pid39360.jfr
[stress.process.out] #
[stress.process.out] Unsupported internal testing APIs have been used.
[stress.process.out] 
[stress.process.out] # An error report file with more information is saved as:
[stress.process.out] # C:\\sb\\prod\\1697220906\\testoutput\\test-support\\jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java\\scratch\\0\\hs_err_pid39360.log
[stress.process.out] #
[stress.process.out] # If you would like to submit a bug report, please visit:
[stress.process.out] #   https://bugreport.java.com/bugreport/crash.jsp
[stress.process.out] #
[2023-10-13T18:31:10.986027400Z] Gathering output for process 41560
[2023-10-13T18:31:12.106450700Z] Waiting for completion for process 41560
[2023-10-13T18:31:12.106450700Z] Waiting for completion finished for process 41560


Here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x000001bd05db61a0):  JavaThread "JFR Periodic Tasks" daemon [_thread_in_vm, id=41536, stack(0x0000008f77900000,0x0000008f77a00000) (1024K)]

Stack: [0x0000008f77900000,0x0000008f77a00000]
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0xc88c31]  os::win32::platform_print_native_stack+0x101  (os_windows_x86.cpp:236)
V  [jvm.dll+0xf25a0f]  VMError::report+0x147f  (vmError.cpp:1000)
V  [jvm.dll+0xf27f05]  VMError::report_and_die+0x645  (vmError.cpp:1819)
V  [jvm.dll+0xf28624]  VMError::report_and_die+0x64  (vmError.cpp:1584)
V  [jvm.dll+0x559ecb]  report_vm_error+0x5b  (debug.cpp:191)
V  [jvm.dll+0xc225ed]  Mutex::check_rank+0x22d  (mutex.cpp:447)
V  [jvm.dll+0xc22c93]  Mutex::lock_without_safepoint_check+0x43  (mutex.cpp:153)
V  [jvm.dll+0x858604]  JfrStackTraceRepository::add_trace+0x24  (jfrStackTraceRepository.cpp:202)
V  [jvm.dll+0x858e15]  JfrStackTraceRepository::record+0x155  (jfrStackTraceRepository.cpp:166)
V  [jvm.dll+0xfa7e09]  JfrEvent<EventZPageAllocation>::write_event+0x119  (jfrEvent.hpp:209)
V  [jvm.dll+0x10070b3]  ZPageAllocator::alloc_page+0x403  (zPageAllocator.cpp:744)
V  [jvm.dll+0xfe1ccf]  ZHeap::alloc_page+0x3f  (zHeap.cpp:227)
V  [jvm.dll+0x100467f]  ZObjectAllocator::alloc_page+0x5f  (zObjectAllocator.cpp:60)
V  [jvm.dll+0x10044cc]  ZObjectAllocator::alloc_object_in_shared_page+0x6c  (zObjectAllocator.cpp:94)
V  [jvm.dll+0x10041c6]  ZObjectAllocator::alloc_object+0x66  (zObjectAllocator.cpp:162)
V  [jvm.dll+0x1017ece]  ZRelocate::relocate_object+0xfe  (zRelocate.cpp:379)
V  [jvm.dll+0x181dc]  ZBarrier::load_barrier_on_oop_field_preloaded+0x3c  (zBarrier.inline.hpp:479)
V  [jvm.dll+0x2d2931]  AccessInternal::PostRuntimeDispatch<ZBarrierSet::AccessBarrier<593988,ZBarrierSet>,2,593988>::oop_access_barrier+0x31  (access.inline.hpp:86)
V  [jvm.dll+0x4abc3d]  WeakHandle::peek+0x3d  (weakHandle.inline.hpp:39)
V  [jvm.dll+0xda87b1]  ConcurrentHashTable<StringTableConfig,11>::statistics_calculate<SizeFunc>+0xc1  (concurrentHashTable.inline.hpp:1239)
V  [jvm.dll+0xdaaaf3]  StringTable::get_table_statistics+0x113  (stringTable.cpp:628)
V  [jvm.dll+0x8463e5]  JfrPeriodicEventSet::requestStringTableStatistics+0x25  (jfrPeriodic.cpp:642)
V  [jvm.dll+0x8321a5]  jfr_emit_event+0x85  (jfrJniMethod.cpp:231)
C  0x000001bd58940ce1  (no source info available)

The last pc belongs to native nmethod (printed below).
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 12002  jdk.jfr.internal.JVM.emitEvent(JJJ)Z jdk.jfr@22-ea (0 bytes) @ 0x000001bd58940c46 [0x000001bd58940bc0+0x0000000000000086]
J 36336 c1 jdk.jfr.internal.periodic.JVMEventTask.execute(JLjdk/jfr/internal/periodic/PeriodicType;)V jdk.jfr@22-ea (50 bytes) @ 0x000001bd505c5794 [0x000001bd505c53e0+0x00000000000003b4]
J 36645 c2 jdk.jfr.internal.periodic.PeriodicTask.run(JLjdk/jfr/internal/periodic/PeriodicType;)V jdk.jfr@22-ea (55 bytes) @ 0x000001bd5821c8bc [0x000001bd5821c840+0x000000000000007c]
J 36852 c1 jdk.jfr.internal.periodic.PeriodicEvents.doPeriodic()J jdk.jfr@22-ea (349 bytes) @ 0x000001bd51940284 [0x000001bd5193edc0+0x00000000000014c4]
j  jdk.jfr.internal.PlatformRecorder.periodicTask()V+47 jdk.jfr@22-ea
j  jdk.jfr.internal.PlatformRecorder.lambda$startDiskMonitor$1()V+1 jdk.jfr@22-ea
j  jdk.jfr.internal.PlatformRecorder$$Lambda+0x000001c1640e0668.run()V+4 jdk.jfr@22-ea
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@22-ea
j  java.lang.Thread.run()V+19 java.base@22-ea
v  ~StubRoutines::call_stub 0x000001bd578a106f
Lock stack of current Java thread (top to bottom):


[error occurred during error reporting (printing code blobs if possible), id 0xe0000000, Internal Error (c:\sb\prod\1697208132\workspace\open\src\hotspot\share\runtime\mutex.cpp:448)]


This looks like the same failure mode that was addressed via:

JDK-8317440 Lock rank checking fails when code root set is modified with the Servicelock held after JDK-8315503

The fix for JDK-8317440 is integrated in jdk-22+19-1444 and
this failure sighting occurred in jdk-22+20-1513-tier3.

I'm starting this bug off in hotspot/gc and at P2 just like JDK-8317440.
Comments
Fix request [21u] I backport this for parity with 21.0.4-oracle. Medium to low risk. Lock ordering is sensible, but it should basically affect jfr which is not that central to the JVM. Clean backport. SAP nightly testing passed.
06-04-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/454 Date: 2024-04-05 07:51:48 +0000
05-04-2024

The fix for this bug is integrated in jdk-22+21-1629.
23-10-2023

Changeset: 9f767aa4 Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2023-10-23 14:00:07 +0000 URL: https://git.openjdk.org/jdk/commit/9f767aa44b4699ed5404b934ac751f2cdd0ba824
23-10-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16242 Date: 2023-10-18 10:28:46 +0000
19-10-2023

The problem is that current lock ranks are not set up to write out JFR stack traces while any ConcurrentHashMap is locking the resize map. In this case, while gathering CHT statistics, there is a relocation that triggers ZGC page allocation which causes JFR to write out data to its recording.
16-10-2023