JDK-8319048 : Monitor deflation unlink phase prolongs time to safepoint
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17,21,22
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-10-27
  • Updated: 2024-08-27
  • Resolved: 2023-11-28
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 21 JDK 22
17.0.12-oracleFixed 21.0.4-oracleFixed 22 b26Fixed
Related Reports
Blocks :  
Relates :  
Description
We have seen one of the cases in our production, where we have TTSP hiccups. Profiling shows the hiccup is in monitor deflation thread, which is busy unlinking the monitors. It is batched with `MonitorDeflationMax` (default: 1M) and checks safepoint after each batch.

However, the current default for the batch is unfortunately very high. If we spend just a 1ns per unlinking monitor, 1M batch size means we would not respond to safepoint request for about 1ms. In practice, it seems the delays are up to 30 ms. This gives a major latency hiccup.

Here is the simplest reproducer I have:

```
import java.util.ArrayList;
import java.util.List;

public class MonitorChurn {
   static final int THREAD_COUNT = Integer.getInteger("threads", 10);
   static final int MONITORS_PER_THREAD = Integer.getInteger("monitors", 1_000_000);

   static class HolderThread extends Thread {
     Object[] objs = new Object[MONITORS_PER_THREAD];

     public void run() {
       try {
         for (int c = 0; c < objs.length; c++) {
            objs[c] = new Object();
         }
         while (true) {
           for (int c = 0; c < objs.length; c++) {
             synchronized (objs[c]) {}
           }
         }
       } catch (Exception e) {
         e.printStackTrace();
       }
     }
   }

   static Object sink;

   public static void main(String... args) throws Exception {
     for (int c = 0; c < THREAD_COUNT; c++) {
       Thread t = new HolderThread();
       t.start();
     }
     while (true) {
       sink = new byte[1000000];
       Thread.sleep(1);
     }
   }
}
```

I added two logging statements in deflation/unlinking code to see when we are actually hitting the deflation/unlinking paths. 

```
diff --git a/src/hotspot/share/runtime/synchronizer.cpp b/src/hotspot/share/runtime/synchronizer.cpp
index c83df922e0d..a5a16c56968 100644
--- a/src/hotspot/share/runtime/synchronizer.cpp
+++ b/src/hotspot/share/runtime/synchronizer.cpp
@@ -132,6 +132,7 @@ size_t MonitorList::unlink_deflated(Thread* current, LogStream* ls,
   ObjectMonitor* prev = nullptr;
   ObjectMonitor* head = Atomic::load_acquire(&_head);
   ObjectMonitor* m = head;
+  log_warning(monitorinflation)("IN UNLINK PATH: " SIZE_FORMAT " MONITORS TO UNLINK", count());
   // The in-use list head can be null during the final audit.
   while (m != nullptr) {
     if (m->is_being_async_deflated()) {
@@ -1688,6 +1689,8 @@ size_t ObjectSynchronizer::deflate_idle_monitors(ObjectMonitorsHashtable* table)
   size_t unlinked_count = 0;
   size_t deleted_count = 0;
   if (deflated_count > 0 || is_final_audit()) {
+    log_warning(monitorinflation)("IN DEFLATION PATH: " SIZE_FORMAT " MONITORS DEFLATED", deflated_count);
+
     // There are ObjectMonitors that have been deflated or this is the
     // final audit and all the remaining ObjectMonitors have been
     // deflated, BUT the MonitorDeflationThread blocked for the final

```

The result is like this:

```
$ /java -XX:+UnlockDiagnosticVMOptions -Xmx256m -XX:GuaranteedAsyncDeflationInterval=1000 -XX:+UseParallelGC -XX:LockingMode=0 -XX:MonitorDeflationMax=1000000 -Xlog:safepoint+stats MonitorChurn.java

[10.589s][info   ][safepoint,stats ] VM Operation                 [ threads: total initial_running ][ time:       sync    cleanup       vmop      total ] page_trap_count
[10.589s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              10 ][            62375       2541   62300334   62365250 ]               0
[10.705s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              10 ][            42875       1375   62180458   62224708 ]               0
[10.756s][warning][monitorinflation] IN DEFLATION PATH: 1000000 MONITORS DEFLATED
[10.757s][warning][monitorinflation] IN UNLINK PATH: 2399912 MONITORS TO UNLINK
[10.834s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              11 ][         12407500       7208   63076667   75491375 ]               1
[10.953s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              11 ][            59208       2042   65406500   65467750 ]               0
[11.067s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              10 ][            26292       1917   61031375   61059584 ]               0
[11.182s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              10 ][            40334       1708   62094458   62136500 ]               0
[11.294s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              10 ][            43083       2167   58604166   58649416 ]               0
[11.404s][info   ][safepoint,stats ] ParallelGCFailedAllocation   [             20              10 ][            43083       2542   56941542   56987167 ]               0
[
```

Near the 2.4M monitor unlinking path we have incurred 12.4 ms TTSP latency.

Naively, one can tune down `MonitorDeflationMax`, but it has impact on the number of monitors we actually deflate per one monitor deflation phase, which unfortunately creeps up monitor population, as deflater thread is less active. 

We need to take a deeper look into safepoint checking policy during unlinkage. It is interesting that deflation loop actually checks for pending safepoint on every monitor. It is not immediately clear why unliking requires batching. If it does require batching for good reason, we might want to split the thresholds into `MonitorDeflationMax` (which would continue to drive deflation batch) and `MonitorUnlinkMax` (which would drive the batching for unlinkage).
Comments
[jdk17u-fix-request] Approval Request from Aleksey Shipilëv Unclean backport to eliminate another safepoint hiccup caused by monitor deflation. The uncleanliness comes from the usual `NULL` vs `nullptr` differences. Otherwise applies cleanly. We are not at risk with interleaving with anything else in JDK 17: JDK-8273107 is not present, so we don't collide with VM_ThreadDump; final audit (removed in JDK 21+ with JDK-8319896) should not be risky. All tests pass.
18-04-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/2309 Date: 2024-03-19 18:33:35 +0000
21-03-2024

[jdk21u-fix-request] Approval Request from Aleksey Shipilëv Clean backport to improve TTSP under heavy monitor deflation, and provide better parity with 21.0.4-oracle. Applies cleanly. All tests pass. The risk is medium: it touches the fiddly monitor code. We have seen no bug tail since the original integration in Nov 2023.
19-03-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/369 Date: 2024-03-18 09:15:14 +0000
18-03-2024

Thanks for confirming, [~shade]
14-03-2024

Yes, that is my understanding as well. I am waiting for JDK-8318757 to land in jdk21u-dev, after which I am planning to propose this fix to jdk21u-dev. I plan to propose this fix to jdk17u-dev as well.
14-03-2024

Backporting considerations (summary of my understanding, based on comments in JDK-8318757 and https://github.com/openjdk/jdk/pull/16412) Background: Monitor deflation is the process of removing the explicit native monitors which are created for objects used in synchronization. Since any object can be synchronized on, theoretically there might be as many native monitors as there are objects. In the past, monitor deflation was always done in a safepoint. This fell out of fashion, deflation started being done concurrently in a separate thread, and there was in principle no deflation done in safepoints in the jdk17 timespan. Timeline: JDK-8273107 changed this - introduced in jdk18, it added monitor deflation inside a safepoint again, in the VM_TreadDump VM operation. JDK-8318757 addressed a problem with this - when concurrent deflation was interleaved with in-safepoint deflation, crashes could happen. This didn't happen often in practice because part of the concurrent deflation process wasn't polling for safepoints that much. JDK-8318757 removed the deflation happening inside VM_TreadDump. JDK-8319048 (this issue) makes the last part of concurrent deflation poll for safepoints more often. This reduces time-to-safepoint, but would have hit the issue from JDK-8273107 much more often, had not JDK-8318757 been introduced. Conclusions: If/when backporting JDK-8319048, care must be taken whether JDK-8273107 exists in the backport target. If it does, then JDK-8318757 must be backported before JDK-8319048. Thanks to [~stefank] [~eosterlund] for bringing this to my attention and helping me understand the interplay.
14-03-2024

Changeset: efc39225 Author: Aleksey Shipilev <shade@openjdk.org> Date: 2023-11-28 09:49:03 +0000 URL: https://git.openjdk.org/jdk/commit/efc392259c64986bbbe880259e95b09058b9076a
28-11-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16412 Date: 2023-10-30 08:20:37 +0000
30-10-2023

I see why batching is needed: we have to modify MonitorList, and batching basically removes the entire runs of unlinkable monitors in one list update. Then we probably just need to make sure those batches are not overly large.
30-10-2023