JDK-8317960 : [17u] Excessive CPU usage on AbstractQueuedSynchronized.isEnqueued
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.concurrent
  • Affected Version: 17
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-10-11
  • Updated: 2024-01-31
  • Resolved: 2023-11-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17
17.0.11 b01Fixed
Related Reports
Relates :  
Relates :  
Description
We've seen several services running JDK17 with excessive CPU wasted on AbstractQueuedSynchronizer.isEnqueued. Calls to this method may represent up to 25% of CPU usage in some services, with more hovering around the 10% mark. We did not see this with JDK11 (AQS relevant methods were added in JDK14 as part of JDK-8229442) or JDK21

Top of the stack for that block consists of:
java/util/concurrent/locks/AbstractQueuedSynchronizer.isEnqueued (24.60%)
java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.canReacquire (24.60%)
java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.await (26.39%)
java/util/concurrent/ScheduledThreadPoolExecutor$DelayedWorkQueue.take (27.75%)

Testing the same applications with JDK21 causes all that CPU usage to go away, with AbstractQueuedSynchronizer$ConditionObject.await taking just (0.11% of cpu).

It seems the performance fix comes from JDK-8277090. Backporting the AQS related changes to JDK17 seems to fix the performance issue, although it is not clear based on JDK-8277090 if those changes are dependent on other
Comments
Changeset: 14e68123 Author: David Alvarez <alvdavi@amazon.com> Committer: Paul Hohensee <phh@openjdk.org> Date: 2023-11-29 17:38:17 +0000 URL: https://git.openjdk.org/jdk17u/commit/14e681234d4557170937fe7d2f4306cfd59b8242
31-01-2024

Changeset: 14e68123 Author: David Alvarez <alvdavi@amazon.com> Committer: Paul Hohensee <phh@openjdk.org> Date: 2023-11-29 17:38:17 +0000 URL: https://git.openjdk.org/jdk17u-dev/commit/14e681234d4557170937fe7d2f4306cfd59b8242
29-11-2023

[jdk17u-fix-request] Approval Request from David Alvarez Change is present in JDK21 and considered safe.
20-11-2023

It's hard to come up with calling contexts cases where a precheck in canReacquire would be worthwhile. So it wouldn't hurt, but adds another conditional that would almost never trigger.
26-10-2023

Update on reproducer: it would seem that the initial filter in `canReacquire`, `node.prev != null` only opens up the window to dive into `isEnqueued` when we are actually in the middle of enqueuing. So coming up with the reproducer would require some careful stress-ing / timing. Not sure if it is worthwhile to pursue.
26-10-2023

Oh, while you are here, Doug! David and me have a related question: should we check for `node == head` in `canReacquire` too? As it stands, if `ConditionNode` is the first node in the queue, then `node.prev == null`, and we would potentially dive into `isEnqueued` again, which exposes us to similar performance pitfall?
26-10-2023

IIRC, we have only seen this in internal project stress test, and it would be nice to have a clean standalone reproducer. Not required, arguably, but would be nice to have for any point fix for 17u.
26-10-2023

Thanks Aleksey for the nice explanation, but I don't see a need to create another performance test here because this test already serves as one?!
26-10-2023

The risk for this patch was not clear to me, so I dived deeper into this. Let me see if I understand what is going on here. AQS maintains the CLH-style sync queue for waiters. The nodes in that queue are linked through `prev` and `next`. One can always walk the sync queue from `TAIL` via `prev`-s. `next` is used to notify the next waiter. Assume we have queue with a single node. Global `TAIL` points to `N1`. The enqueue of new node `N2` happens by: ``` N2.prev = N1; if (CAS(&TAIL, N1, N2)): N1.next = N2; ``` The key thing about this issue, it is about conditions. The condition behavior is encapsulated in `ConditionNode`. On creation, `ConditionNode` node *is not* on the (main) sync queue. The conditions are linked on their own special condition queue. When someone signals the condition, we transfer that condition to the (main) sync queue. So the conditional wait loops basically take a form of: void await() { ConditionNode node = newConditionNode(); ... insert to conditions queue ... while (!canReacquire(node)) { // are we on sync queue yet? ... blocked wait ... } ... acquire the lock ... } What `canReacquire` does: it checks that the condition node in question is on the sync queue. That is, it tests if the condition was signaled. This nominally requires walking the sync queue from `tail` via `prev`-s looking for the node -- that is what `isEnqueued` does. But there is also a shortcut: if we know that predecessor node on sync queue points back at our `ConditionNode`, this means it was successfully enqueued by someone (see enqueue code above)! Therefore, we do not need to walk the main sync queue. This is the shortcut that this change adds, and it looks self-contained. This probably gives us a clearer picture how to reproduce the issue: many threads block on the lock, thus a large sync queue, then a pair of thread ping-pong through wait/notify, so that we keep walking the sync queue in `canReacquire` all the time on signal and re-acquisition. Let me see if I can do a microbenchmark out of this.
26-10-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/1875 Date: 2023-10-13 11:44:22 +0000
13-10-2023

As I recall, the AQS change was motivated by virtual threads and blocking queues but could be more widely useful.
11-10-2023

yes, applying only that patch shouldn't cause any other incompatibilities.
11-10-2023

[~dl], what do you think about this one? Can we do a limited performance fix for 17u, like the "jdk17-aqs.patch" in the attachment?
11-10-2023

Added the portions of JDK-8277090 related to AQS we backported to JDK17 for testing as jdk17-aqs.patch
11-10-2023