JDK-8274196 : Crashes in VM_HeapDumper::work after JDK-8252842
  • Type: Bug
  • Component: core-svc
  • Sub-Component: tools
  • Affected Version: 18
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2021-09-23
  • Updated: 2021-10-04
  • Resolved: 2021-09-30
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 18
18 masterFixed
Related Reports
Relates :  
Relates :  
Sub Tasks
JDK-8274216 :  
JDK-8274312 :  
Description
serviceability/dcmd/gc/HeapDump*Test.java started failing with 

V  [jvm.dll+0xadea11]  os::platform_print_native_stack+0xf1  (os_windows_x86.cpp:235)
V  [jvm.dll+0xcfd625]  VMError::report+0x1005  (vmError.cpp:742)
V  [jvm.dll+0xcfefce]  VMError::report_and_die+0x7fe  (vmError.cpp:1552)
V  [jvm.dll+0xcff754]  VMError::report_and_die+0x64  (vmError.cpp:1333)
V  [jvm.dll+0x4cdab7]  report_vm_error+0xb7  (debug.cpp:282)
V  [jvm.dll+0x1de1f]  oopDesc::klass+0x9f  (oop.inline.hpp:85)
V  [jvm.dll+0x402374]  oopDesc::is_instance+0x14  (oop.inline.hpp:206)
V  [jvm.dll+0x660229]  JNIGlobalsDumper::do_oop+0x29  (heapDumper.cpp:1634)
V  [jvm.dll+0x77b4ee]  JNIHandles::oops_do+0x1ce  (jniHandles.cpp:168)
V  [jvm.dll+0x665815]  VM_HeapDumper::work+0x545  (heapDumper.cpp:2317)
V  [jvm.dll+0xd4360a]  GangWorker::loop+0x8a  (workgroup.cpp:239)
V  [jvm.dll+0xd436ad]  GangWorker::run+0x1d  (workgroup.cpp:206)

or

Stack: [0x00007f68407db000,0x00007f68408db000],  sp=0x00007f68408d9c20,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x8ae7df]  CompressedKlassPointers::decode_not_null(unsigned int, unsigned char*)+0x8f
V  [libjvm.so+0xd80152]  JNIGlobalsDumper::do_oop(oop*)+0x242
V  [libjvm.so+0xfe6c26]  JNIHandles::oops_do(OopClosure*)+0xd6
V  [libjvm.so+0xd8a13b]  VM_HeapDumper::work(unsigned int)+0x18b
V  [libjvm.so+0x19f94f5]  GangWorker::run_task(WorkData)+0x85

Probably after integration of JDK-8252842
Comments
Changeset: bfd61634 Author: Lin Zang <lzang@openjdk.org> Date: 2021-09-30 14:44:59 +0000 URL: https://git.openjdk.java.net/jdk/commit/bfd616347126a802c641326a6be5a14c4cd7af90
30-09-2021

I tried to write down the plan of what I'm doing with the lock rankings in JDK-8176393 Essentially checking for safepoints and lock ranking are going to be tightly coupled, and you can only take out locks with lower than your rank. The heapDumping code is a bit of a mess wrt to this. My suggestion is to leave them nosafepoint and nosafepoint-1 ranks and I'll get rid of the assert that waiting on the lower ranked one hits.
24-09-2021

Root cause has been identified. I will submit a fix soon. The root cause for crash in ZGC is that the JNIHandles are processed before object iteration. And ZGC would update the JNIHandles at object iteration with read barrier. So the crash is cause by accessing the invalid address which can be dummy info after zgc, and hence crash. The lock rank issue can be fixed because the related mutexes are acquired in safepoint. so the safepoint_check_required could be safepoint_check_always. The Epsilon issue is caused by wrong _num_dumper_thread calculated when the gang==NULL. Sorry for introduced these issues. Testing is still WIP, I will update the test result when it complete. Thanks, Lin
24-09-2021

I am starting to investigate this issue. will update ASAP.
24-09-2021

This change also runs into Mutex rank checking assertions due to colliding changes.
24-09-2021

[~coleenp], No Problem, I will do this. BTW, do you happen to have some materials about the design of the lock rank in hotspot? I want to study more about it. Thanks, Lin
24-09-2021

Also I renamed your locks to have sort of the standard format names that many of the mutexLocker.cpp locks have, which is nice for logging: Can you do this? - _lock = new (std::nothrow) PaddedMonitor(Mutex::leaf, "Parallel HProf writer lock", Mutex::_safepoint_check_never); + _lock = new (std::nothrow) PaddedMonitor(Mutex::nosafepoint, "ParallelHProfWriter_lock", Mutex::_safepoint_check_never); and - _lock(new (std::nothrow) PaddedMonitor(Mutex::leaf, "Dumper Controller lock", + _lock(new (std::nothrow) PaddedMonitor(Mutex::nosafepoint, "DumperController_lock",
24-09-2021

[~lzang] Ok, I'm going to assign JDK-8274245 to you and close my PR. I'll file another bug for the special mutex wait problem that this ran into, but it won't affect you if you make these locks safepoint_check_always. PIck 'leaf' mutex rank for now. I'm removing that rank but I'll fix these locks when I do that afterwards.
24-09-2021

I have been able to reproduce the serviceability/dcmd/gc/HeapDumpAllTest.java crash with ZGC on my Ubuntu 20.04 test machine readily. I'm not able to reproduce the serviceability/dcmd/gc/HeapDumpTest.java crash on the same machine. I have not been able to reproduce the serviceability/dcmd/gc/HeapDumpAllTest.java in a repo without the fix for: JDK-8252842 Extend JMap to support parallel heap dump so I believe that this bug is a regression caused by JDK-8252842.
23-09-2021

So far all failure sightings for: serviceability/dcmd/gc/HeapDumpAllTest.java serviceability/dcmd/gc/HeapDumpTest.java are with ZGC on linux-aarch64, linux-x64 and win-x64. I have no sightings on macosx-aarch64 or macosx-x64 (yet?).
23-09-2021

Tests are failing with ZGC and Epsilon GC
23-09-2021