We have seen one of the cases in our production, where we have TTSP hiccups. Profiling shows the hiccup is in monitor deflation thread, which is busy unlinking the monitors. It is batched with `MonitorDeflationMax` (default: 1M) and checks safepoint after each batch.
However, the current default for the batch is unfortunately very high. If we spend just a 1ns per unlinking monitor, 1M batch size means we would not respond to safepoint request for about 1ms. In practice, it seems the delays are up to 30 ms. This gives a major latency hiccup.
Here is the simplest reproducer I have:
```
import java.util.ArrayList;
import java.util.List;
public class MonitorChurn {
static final int THREAD_COUNT = Integer.getInteger("threads", 10);
static final int MONITORS_PER_THREAD = Integer.getInteger("monitors", 1_000_000);
static class HolderThread extends Thread {
Object[] objs = new Object[MONITORS_PER_THREAD];
public void run() {
try {
for (int c = 0; c < objs.length; c++) {
objs[c] = new Object();
}
while (true) {
for (int c = 0; c < objs.length; c++) {
synchronized (objs[c]) {}
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
static Object sink;
public static void main(String... args) throws Exception {
for (int c = 0; c < THREAD_COUNT; c++) {
Thread t = new HolderThread();
t.start();
}
while (true) {
sink = new byte[1000000];
Thread.sleep(1);
}
}
}
```
I added two logging statements in deflation/unlinking code to see when we are actually hitting the deflation/unlinking paths.
```
diff --git a/src/hotspot/share/runtime/synchronizer.cpp b/src/hotspot/share/runtime/synchronizer.cpp
index c83df922e0d..a5a16c56968 100644
--- a/src/hotspot/share/runtime/synchronizer.cpp
+++ b/src/hotspot/share/runtime/synchronizer.cpp
@@ -132,6 +132,7 @@ size_t MonitorList::unlink_deflated(Thread* current, LogStream* ls,
ObjectMonitor* prev = nullptr;
ObjectMonitor* head = Atomic::load_acquire(&_head);
ObjectMonitor* m = head;
+ log_warning(monitorinflation)("IN UNLINK PATH: " SIZE_FORMAT " MONITORS TO UNLINK", count());
// The in-use list head can be null during the final audit.
while (m != nullptr) {
if (m->is_being_async_deflated()) {
@@ -1688,6 +1689,8 @@ size_t ObjectSynchronizer::deflate_idle_monitors(ObjectMonitorsHashtable* table)
size_t unlinked_count = 0;
size_t deleted_count = 0;
if (deflated_count > 0 || is_final_audit()) {
+ log_warning(monitorinflation)("IN DEFLATION PATH: " SIZE_FORMAT " MONITORS DEFLATED", deflated_count);
+
// There are ObjectMonitors that have been deflated or this is the
// final audit and all the remaining ObjectMonitors have been
// deflated, BUT the MonitorDeflationThread blocked for the final
```
The result is like this:
```
$ /java -XX:+UnlockDiagnosticVMOptions -Xmx256m -XX:GuaranteedAsyncDeflationInterval=1000 -XX:+UseParallelGC -XX:LockingMode=0 -XX:MonitorDeflationMax=1000000 -Xlog:safepoint+stats MonitorChurn.java
[10.589s][info ][safepoint,stats ] VM Operation [ threads: total initial_running ][ time: sync cleanup vmop total ] page_trap_count
[10.589s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 10 ][ 62375 2541 62300334 62365250 ] 0
[10.705s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 10 ][ 42875 1375 62180458 62224708 ] 0
[10.756s][warning][monitorinflation] IN DEFLATION PATH: 1000000 MONITORS DEFLATED
[10.757s][warning][monitorinflation] IN UNLINK PATH: 2399912 MONITORS TO UNLINK
[10.834s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 11 ][ 12407500 7208 63076667 75491375 ] 1
[10.953s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 11 ][ 59208 2042 65406500 65467750 ] 0
[11.067s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 10 ][ 26292 1917 61031375 61059584 ] 0
[11.182s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 10 ][ 40334 1708 62094458 62136500 ] 0
[11.294s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 10 ][ 43083 2167 58604166 58649416 ] 0
[11.404s][info ][safepoint,stats ] ParallelGCFailedAllocation [ 20 10 ][ 43083 2542 56941542 56987167 ] 0
[
```
Near the 2.4M monitor unlinking path we have incurred 12.4 ms TTSP latency.
Naively, one can tune down `MonitorDeflationMax`, but it has impact on the number of monitors we actually deflate per one monitor deflation phase, which unfortunately creeps up monitor population, as deflater thread is less active.
We need to take a deeper look into safepoint checking policy during unlinkage. It is interesting that deflation loop actually checks for pending safepoint on every monitor. It is not immediately clear why unliking requires batching. If it does require batching for good reason, we might want to split the thresholds into `MonitorDeflationMax` (which would continue to drive deflation batch) and `MonitorUnlinkMax` (which would drive the batching for unlinkage).