JDK-8293864 : Kitchensink24HStress.java fails with SIGSEGV in JfrCheckpointManager::lease
  • Type: Bug
  • Component: hotspot
  • Sub-Component: jfr
  • Affected Version: 19,20
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2022-09-15
  • Updated: 2022-12-05
  • Resolved: 2022-10-10
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20 b19Fixed
Related Reports
Duplicate :  
Relates :  
Description
The following test failed in the JDK20 CI:

applications/kitchensink/Kitchensink24HStress.java

Here's a snippet from the log file:

The tail of stress stdout is:
For random generator using seed: -4554880919762113761
To re-run test with same seed value please add "-Djdk.test.lib.random.seed=-4554880919762113761" to command line.
Stress process main method is started.
[9750.598s][warning][gc] GC locker is held; pre-dump GC was skipped
[13671.035s][warning][gc] GC locker is held; pre-dump GC was skipped
[14124.153s][warning][gc] GC locker is held; pre-dump GC was skipped
[24159.340s][warning][gc] GC locker is held; pre-dump GC was skipped
[24591.979s][warning][gc] GC locker is held; pre-dump GC was skipped
[30703.766s][warning][gc] GC locker is held; pre-dump GC was skipped
[33326.421s][warning][gc] GC locker is held; pre-dump GC was skipped
[35903.100s][warning][gc] GC locker is held; pre-dump GC was skipped
[40289.618s][warning][gc] GC locker is held; pre-dump GC was skipped
[54716.844s][warning][gc] GC locker is held; pre-dump GC was skipped
[58636.492s][warning][gc] GC locker is held; pre-dump GC was skipped
[71193.498s][warning][gc] GC locker is held; pre-dump GC was skipped
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2d934064e0, pid=1852074, tid=2755448
#
# JRE version: Java(TM) SE Runtime Environment (20.0+15) (build 20-ea+15-995)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20-ea+15-995, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x8c64e0]  JfrCheckpointManager::lease(Thread*, bool, unsigned long)+0x70
#
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S61378/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5a65d1a0-d544-4113-9107-e10dbe41799f/runs/16080598-87bb-4f8d-831b-bf4ed05a21fe/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink24HStress_java/scratch/0/core.1852074)
#
# JFR recording file will be written. Location: /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S61378/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5a65d1a0-d544-4113-9107-e10dbe41799f/runs/16080598-87bb-4f8d-831b-bf4ed05a21fe/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink24HStress_java/scratch/0/hs_err_pid1852074.jfr
#
Unsupported internal testing APIs have been used.

# An error report file with more information is saved as:
# /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S61378/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5a65d1a0-d544-4113-9107-e10dbe41799f/runs/16080598-87bb-4f8d-831b-bf4ed05a21fe/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink24HStress_java/scratch/0/hs_err_pid1852074.log
[thread 2685319 also had an error][thread 2685318 also had an error]


------ Timeout during error reporting after 120 s. ------
----------System.err:(777/68210)----------

Here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x00007f2b74a91270):  JavaThread "MemAccessWorkerThread" [_thread_new, id=2755448, stack(0x00007f2cb0f82000,0x00007f2cb1083000)]

Stack: [0x00007f2cb0f82000,0x00007f2cb1083000],  sp=0x00007f2cb1081c90,  free space=1023k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x8c64e0]  JfrCheckpointManager::lease(Thread*, bool, unsigned long)+0x70  (jfrIterator.hpp:44)
V  [libjvm.so+0x8cdb2c]  JfrCheckpointWriter::JfrCheckpointWriter(Thread*, bool, JfrCheckpointType, bool)+0x2c  (jfrCheckpointWriter.cpp:49)
V  [libjvm.so+0x92f6ea]  JfrTypeManager::write_checkpoint(Thread*, unsigned long, oopDesc*)+0x8a  (jfrTypeManager.cpp:122)
V  [libjvm.so+0x92342c]  JfrThreadLocal::on_start(Thread*)+0x9c  (jfrThreadLocal.cpp:119)
V  [libjvm.so+0xe4756c]  Thread::call_run()+0x6c  (thread.cpp:211)
V  [libjvm.so+0xc7fe18]  thread_native_entry(Thread*)+0xd8  (os_linux.cpp:710)


siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000008dd8
Comments
Changeset: 35d17a00 Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2022-10-10 12:39:10 +0000 URL: https://git.openjdk.org/jdk/commit/35d17a00ab4028071a8fc7cd781b3306e6811970
10-10-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/10467 Date: 2022-09-28 12:33:20 +0000
28-09-2022

With JDK-8289692, a need arose to introduce a mutex mechanism for the buffers on the global list to prevent the JFR Recorder Thread from resetting buffers currently in use. The introduced mutex mechanism was adequate as far as preventing the race condition. However, its protective characteristic was overemphasized, leading to the mistake of introducing a release, or removal, operation onto a current epoch list, as part of periodic flushing. The problem is that although the callback operations are safe, the underlying list type used with the global mspace does not allow for concurrent excision of nodes. The crash occurs because a thread attempts to dereference the next pointer to traverse a node that is excised and deleted after the thread loaded it. I have reviewed how to handle better the introduction of virtual threads and their checkpoint data. I have concluded that the best way is to preserve the old system (pre-Loom) as much as possible, keeping the global and thread-local mspaces as before. Instead, an additional mspace dedicated solely to virtual threads is introduced. This categorization preserves the matching of operations against the underlying list infrastructure. It also provides flexibility in handling virtual threads distinct from regular, or "carrier", threads. The separate mspace of virtual threads is more dynamic as it does not need to preallocate buffers, a memory win for systems not using virtual threads. Also, the sizes of the different buffer types can be better controlled. With this restructuring, there are no longer any list concurrency issues; hence the mutex mechanism introduced with JDK-8289692 becomes obsolete. Applying a release operation for the current epoch list, of the global mspace, as part of flushing is a broken invariant and is therefore removed.
28-09-2022