JDK-8334594 : Generational ZGC: Deadlock after OopMap rewrites in 8331572
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 21,23
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2024-06-20
  • Updated: 2024-08-09
  • Resolved: 2024-06-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 23 JDK 24
23Fixed 24 b04Fixed
Related Reports
Relates :  
Description
In one of our internal tests we have found this deadlock:

0) Various threads are waiting for the GC to hand out freed memory.

1) The GC is blocked on the Service_lock:

  Mutex::lock_without_safepoint_check()
  OopMapCache::trigger_cleanup()
  ZGenerationYoung::pause_relocate_start()
  ZGenerationYoung::collect(ZYoungType, ConcurrentGCTimer*)
  ZDriverMinor::gc(ZDriverRequest const&)
  ZThread::run_service()

2) The Service_lock is held by the ServiceThread, which is stalling because it is waiting for the GC to give it memory

  PlatformMonitor::wait(unsigned long)
  ZRelocateQueue::add_and_wait(ZForwarding*)
  ZRelocate::relocate_object(ZForwarding*, zaddress_unsafe) 
  ZUncoloredRootProcessWeakOopClosure::do_root(zaddress_unsafe*)
  ZNMethod::nmethod_oops_do_inner(nmethod*, OopClosure*)
  ZBarrierSetNMethod::nmethod_entry_barrier(nmethod*)
  JvmtiDeferredEventQueue::oops_do(OopClosure*, NMethodClosure*)
  // vvv Lock taken in this frame vvv  
  ServiceThread::oops_do_no_frames(OopClosure*, NMethodClosure*)
  ZStackWatermark::process_head(void*)
  ZStackWatermark::start_processing_impl(void*)
  StackWatermark::start_processing()
  StackWatermark::on_safepoint()
  SafepointMechanism::process(JavaThread*, bool, bool) 
  StringTable::clean_dead_entries(JavaThread*) 
  StringTable::do_concurrent_work(JavaThread*)
  ServiceThread::service_thread_entry
Comments
[jdk21u-fix-request] Approval Request from Aleksey Shipilëv Follow-up for JDK-8331572. Applies cleanly. All tests pass.
08-08-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk21u-dev/pull/610 Date: 2024-05-28 09:44:28 +0000
09-07-2024

Post-integration triaging: I(mpact)L(ikelyhood)W(orkaround) = HLH => P2 (Could maybe also be HMH => P1)
26-06-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/19851 Date: 2024-06-24 09:02:30 +0000
24-06-2024

Changeset: 05ff3185 Author: Aleksey Shipilev <shade@openjdk.org> Date: 2024-06-24 08:46:10 +0000 URL: https://git.openjdk.org/jdk/commit/05ff3185edd25b381a97f6879f496e97b62dddc2
24-06-2024

> An alternative is to figure out if we can stop guarding _jvmti_service_queue with the Service_lock, and instead use another lock. It feels safer to not do any oop processing while holding the Service_lock. True. I felt queasy about taking a service lock when doing the original JDK-8331572, even for a simple notification. I thought the intention for Service lock to be a low-level lock that is almost always safe to take for a short time. This apparently violated with _jvmti_service_queue locking. That said, this surprise tells me there might be other/similar landmines lurking. So I now PRed the fix that side-steps the notification problem, getting us back to state before JDK-8331572. This would also allow us to cleanly backport JDK-8331572 later.
20-06-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/19800 Date: 2024-06-20 09:11:04 +0000
20-06-2024

An alternative is to figure out if we can stop guarding _jvmti_service_queue with the Service_lock, and instead use another lock. It feels safer to not do any oop processing while holding the Service_lock. About if this only happens with GenZGC. I'm not sure. I think at least singlegen ZGC might be affected by this. Say that it runs a relocation start safepoint, and before it runs the doit_epilogue the ServiceThread runs the code above and runs out of memory. The GC never gets to the part where it can start relocate and hand back memory to the stalling ServiceThread.
20-06-2024

Have a candidate fix, will PR shortly. AFAIU, this realistically only affects GenZGC, since it overlaps service thread oop walks in one collection with GC safepoints from another concurrent collection?
20-06-2024

Oooof. I think we can relax our opportunistic notification of `ServiceThread` in `OopMapCache::trigger_cleanup`? Something like "try to lock ServiceLock for notification, and give up delivering the notification if not currently possible". Something like: ``` void OopMapCache::trigger_cleanup() { if (has_cleanup_work() && Service_lock->try_lock()) { Service_lock->notify_all(); Service_lock->unlock(); } } ```
20-06-2024

The jvmti code was there first, and then we piled on the other cleanup actions. They really should be separate.
20-06-2024