Bug ID: JDK-8334594 Generational ZGC: Deadlock after OopMap rewrites in 8331572

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 21,23

Priority: P2
Status: Resolved
Resolution: Fixed

Submitted: 2024-06-20
Updated: 2025-01-08
Resolved: 2024-06-24

JDK 21	JDK 23	JDK 24
21.0.5Fixed	23Fixed	24 b04Fixed

In one of our internal tests we have found this deadlock:

0) Various threads are waiting for the GC to hand out freed memory.

1) The GC is blocked on the Service_lock:

  Mutex::lock_without_safepoint_check()
  OopMapCache::trigger_cleanup()
  ZGenerationYoung::pause_relocate_start()
  ZGenerationYoung::collect(ZYoungType, ConcurrentGCTimer*)
  ZDriverMinor::gc(ZDriverRequest const&)
  ZThread::run_service()

2) The Service_lock is held by the ServiceThread, which is stalling because it is waiting for the GC to give it memory

  PlatformMonitor::wait(unsigned long)
  ZRelocateQueue::add_and_wait(ZForwarding*)
  ZRelocate::relocate_object(ZForwarding*, zaddress_unsafe) 
  ZUncoloredRootProcessWeakOopClosure::do_root(zaddress_unsafe*)
  ZNMethod::nmethod_oops_do_inner(nmethod*, OopClosure*)
  ZBarrierSetNMethod::nmethod_entry_barrier(nmethod*)
  JvmtiDeferredEventQueue::oops_do(OopClosure*, NMethodClosure*)
  // vvv Lock taken in this frame vvv  
  ServiceThread::oops_do_no_frames(OopClosure*, NMethodClosure*)
  ZStackWatermark::process_head(void*)
  ZStackWatermark::start_processing_impl(void*)
  StackWatermark::start_processing()
  StackWatermark::on_safepoint()
  SafepointMechanism::process(JavaThread*, bool, bool) 
  StringTable::clean_dead_entries(JavaThread*) 
  StringTable::do_concurrent_work(JavaThread*)
  ServiceThread::service_thread_entry

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/19800 Date: 2024-06-20 09:11:04 +0000
08-01-2025
[jdk21u-fix-request] Approval Request from Aleksey Shipilëv Follow-up for JDK-8331572. Applies cleanly. All tests pass.
08-08-2024
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk21u-dev/pull/610 Date: 2024-05-28 09:44:28 +0000
09-07-2024
Post-integration triaging: I(mpact)L(ikelyhood)W(orkaround) = HLH => P2 (Could maybe also be HMH => P1)
26-06-2024
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/19851 Date: 2024-06-24 09:02:30 +0000
24-06-2024
Changeset: 05ff3185 Author: Aleksey Shipilev <shade@openjdk.org> Date: 2024-06-24 08:46:10 +0000 URL: https://git.openjdk.org/jdk/commit/05ff3185edd25b381a97f6879f496e97b62dddc2
24-06-2024
> An alternative is to figure out if we can stop guarding _jvmti_service_queue with the Service_lock, and instead use another lock. It feels safer to not do any oop processing while holding the Service_lock. True. I felt queasy about taking a service lock when doing the original JDK-8331572, even for a simple notification. I thought the intention for Service lock to be a low-level lock that is almost always safe to take for a short time. This apparently violated with _jvmti_service_queue locking. That said, this surprise tells me there might be other/similar landmines lurking. So I now PRed the fix that side-steps the notification problem, getting us back to state before JDK-8331572. This would also allow us to cleanly backport JDK-8331572 later.
20-06-2024
An alternative is to figure out if we can stop guarding _jvmti_service_queue with the Service_lock, and instead use another lock. It feels safer to not do any oop processing while holding the Service_lock. About if this only happens with GenZGC. I'm not sure. I think at least singlegen ZGC might be affected by this. Say that it runs a relocation start safepoint, and before it runs the doit_epilogue the ServiceThread runs the code above and runs out of memory. The GC never gets to the part where it can start relocate and hand back memory to the stalling ServiceThread.
20-06-2024
Have a candidate fix, will PR shortly. AFAIU, this realistically only affects GenZGC, since it overlaps service thread oop walks in one collection with GC safepoints from another concurrent collection?
20-06-2024
Oooof. I think we can relax our opportunistic notification of `ServiceThread` in `OopMapCache::trigger_cleanup`? Something like "try to lock ServiceLock for notification, and give up delivering the notification if not currently possible". Something like: ``` void OopMapCache::trigger_cleanup() { if (has_cleanup_work() && Service_lock->try_lock()) { Service_lock->notify_all(); Service_lock->unlock(); } } ```
20-06-2024
The jvmti code was there first, and then we piled on the other cleanup actions. They really should be separate.
20-06-2024