Bug ID: JDK-8342605 C2: Tighten up polling page loads for safepoint polls

JDK-8342605 : C2: Tighten up polling page loads for safepoint polls

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 24

Priority: P4
Status: Open
Resolution: Unresolved

Submitted: 2024-10-18
Updated: 2024-11-08

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Relates :

JDK-8189941 - Implementation JEP 312: Thread-local handshake

Description

Long time ago, safepoint polls worked simply by polling the page that will be MPROT_NONE by the runtime when needed. This means we could treat the polling page address as a constant. We even did RIP-based encoding for it. C2 code just put out the constant as Safepoint node input.

With JDK-8189941 (Thread-local handshakes), VM handles safepoints by creating two pages, readable and unreadable: https://github.com/openjdk/jdk/blob/8174cbd5cb797a80d48246a686897ef6fe64ed57/src/hotspot/share/runtime/safepointMechanism.cpp#L65-L75

This is smart, as we do not have to mprotect things when safepoint is needed, we only need to swap the _address_ of the unreadable page into the thread-local field: https://github.com/openjdk/jdk/blob/8174cbd5cb797a80d48246a686897ef6fe64ed57/src/hotspot/share/runtime/safepointMechanism.inline.hpp#L94-L102

...and then we should just read the address of the current page and try to poll it. If it is unreadable, we trap. So far so good.

Now we go to C2 code. C2 makes the load of page address from a thread-local field and feeds that address into Safepoint node: https://github.com/openjdk/jdk/blob/8174cbd5cb797a80d48246a686897ef6fe64ed57/src/hotspot/share/opto/parse1.cpp#L2271-L2276

Correctness question: What guarantees that the load of the page address and the safepoint poll stay close? In other words, what guarantees we _do not_ hoist the load of *currently readable page* away from the Safepoint node, and thus do not miss the transition of that thread-local location to *unreadable page*? This seems to work well currently, since polling page load and safepoint share control, and safepoint produces control as well.

Performance question: Are we risking pulling something heavy-weight between read of polling page address and the poll, so that we delay reacting to safepoint?

I think both can be solved if we push the polling address load down to Safepoint match rules. Example for AArch64:
https://github.com/openjdk/jdk/compare/master...shipilev:jdk:JDK-8342605-c2-safepoint-poll

Comments

Noob question: Is it possible to replace the unreadable page with null instead. A compare and branch is more efficient than a load since they macro-fuse into a single instruction, on x86 we can even fuse the load of the field into the compare.
08-11-2024
I think there is indeed no correctness issue; I tried searching for the cases where any suspicious commoning might have occurred, and found none so far. So I am going to pursue this as the performance optimization: a) for platforms that could use scratch registers for safepoint polls, like AArch64; b) for all other platforms to minimize the window between reading the polling page and safepoint poll.
23-10-2024
Since safepoint nodes produce control, sharing control between polling page address load and corresponding safepoint node should be enough to ensure there's no commoning between polling page address loads happening in C2. Otherwise, safepoint polls may become redundant when TLH is enabled. Keeping safepoint and load separate may benefit code generation by enabling early load scheduling (even if it resides in L1D cache), but I have some doubts it matters on modern hardware.
23-10-2024
ILW = possible correctness/performance issues with safepoint polling; never seen; no workaround = MLH
23-10-2024
Recording a debugging breadcrumb here. I think I see another reason why C2 does this two-stage thing: oop maps need to be bound to the safepoint poll itself, not to the polling page load. So, if we push the polling page load down to Safepoint node, then oop map would be associated with polling page load in it.
21-10-2024
> It seems like with the current approach, if the method had multiple safepoint polls, then we could load the polling page once, resulting in slightly faster and more compact code. I have not seen this load commoning in practice, but assuming it happens, this is exactly what we _do not_ want to happen, right? It would have made sense if we loaded the constant address of the polling page that would get mprotected later when safepoint fires. But for commonning the "current" polling page addresses, we render subsequent safepoint polls that use that "optimized" address ineffective: they would guarantee not to trap, if the first safepoint poll did not trap. I.e.: ldr x8, [x28, #48] ; <--- load polling value ldr wzr, [x8] ; <--- safepoint poll 1 (assume did not trap) ... do something ... ldr wzr, [x8] ; <--- safepoint poll 2 (guaranteed not to trap) In other words, you might have inserted the safepoint poll 2 for some good reason, but it is no longer possible for it to work as the real safepoint poll. At that point, polling page address load commoning effectively eliminates it, badly, still leaving the unnecessary, always-succeeding poll. This would be very surprising, I think.
21-10-2024
A few observations: It seems like with the current approach, if the method had multiple safepoint polls, then we could load the polling page once, resulting in slightly faster and more compact code. Also, on aarch64, the current sequence seems to be equivalent to something like: ldr x8, [x28, #48] ; <--- load polling value cbz x8, safepoint_stub if we store 0 in the thread value instead of the protected page address.
19-10-2024
Yes to a) - they stay together because of control and RAW load which don't moved across safepoints. b) it may show up on machines with different cache hierarchy (Ampere?) on first touch. Then it will be in cache anyway. c) Looks reasonable. What about other platforms (x86)?
18-10-2024
Some more investigation on AArch64. The simplest JMHSample_01_Hello produces this: # Baseline 10.44% ↗ 0x0000ffff83efb440: ldarb w11, [x10] 11.01% │ 0x0000ffff83efb444: ldr x12, [x28, #48] ; <--- load polling page 0.81% │ 0x0000ffff83efb448: add x19, x19, #0x1 9.78% │ 0x0000ffff83efb44c: ldr wzr, [x12] ; <--- touch 58.96% ╰ 0x0000ffff83efb450: cbz w11, 0x0000ffff83efb440 # Patched 11.58% ↗ 0x0000ffff83efbac0: ldarb w11, [x10] 1.52% │ 0x0000ffff83efbac4: add x19, x19, #0x1 10.65% │ 0x0000ffff83efbac8: ldr x8, [x28, #48] ; <--- load polling page 10.82% │ 0x0000ffff83efbacc: ldr wzr, [x8] ; <--- touch 55.97% ╰ 0x0000ffff83efbad0: cbz w11, 0x0000ffff83efbac0 Observations: a) Load of polling page indeed floats around, although I suspect it would still be in the same block, as both load and SafepointNode have the same control(). b) There might be pipelining benefits of loading polling page a bit farther from the poll itself, but they do not show on this benchmark at least on Graviton 3, which manages to execute the entire loop iteration in 1 cycle on average. c) With Safepoint match rule, we are able to leverage rscratch1, which AFAICS is not normally allocatable, so this way of polling would reduce register pressure in tight loops a bit.
18-10-2024
That's sensible, I don't think there is any benefit separating them into 2 nodes anyway. It would also allow us to use a load for polling instead of a load-test, which is a win regardless. Please correct me if I am wrong.
18-10-2024