Bug ID: JDK-8242115 C2 SATB barriers are not safepoint-safe

JDK-8242115 : C2 SATB barriers are not safepoint-safe

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 11,15,17,19,20

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2020-04-03
Updated: 2024-07-27
Resolved: 2022-10-03

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11	JDK 17	JDK 20
11.0.19-oracleFixed	17.0.7-oracleFixed	20 b18Fixed

Related Reports

Relates :

JDK-8295066 - Folding of loads is broken in C2 after JDK-8242115

Description

In ZGC we are very sensitive to accesses and their GC barriers being separated by safepoints. We never managed to tame the sea of nodes to ensure this, and hence elected to go with expanding the barriers as late as humanly possible, right before assembly.

For some reason it has been believed that despite this being a huge problem for ZGC, it is not at all a problem for, e.g., SATB collectors (like G1 and Shenandoah). But they face the same issue. Consider the Reference.get() intrinsic. It loads the referent, and enqueues it into its thread-local SATB buffer. The load of the referent, and the store, that publishes it to the thread-local SATB buffer may *not* be separated by any safepoints.
At safepoint polls, we can also deoptimize. Most of the time G1 is fine, because of re-marking when concurrent marking terminates in a safepoint. If it finds something not yet marked on the stack, it marks it during remark. That is probably why we do not see any crashes. However, consider a Java call being between the referent load and the SATB-barrier. The nmethod may then get deoptimized when the callee returns. In such a scenario, nobody will have marked the referent, and when we roll into the interpreter, it may store it into some field, and return. Now the object graph has been corrupted with an object that will never get marked.

There are similar issues involving stack walkers that can catch the local storing the referent, before it has become SATB enqueued, and then the nmethod deoptimizes.

Not your every day race, but it is possible in theory, and is quite annoying.

This similarly applies to the "is marking active" load expanded in the pre-write store barriers for G1.

I have written some verification code for G1 (only works without compressed oops for now):
http://cr.openjdk.java.net/~eosterlund/to_kim/webrev.00/
This patch tags the referent loads, and the SATB store with the same unique number. Then it checks the mach nodes before generating machine code, matching the load and store, traversing the store and its dominators up to the load, asserting that there are no safepoints in these blocks. With SPECjbb2015 the assertion fails after a few seconds, which gives a hint that this scary stuff is probably happening for real, and has been broken since forever.

Comments

Fix request [17u] I backport this for parity with 17.0.7-oracle. A bigger change fixing a bug in C2 for GC. Typical C2 risk. I guess we should go along. I had to do a trivial resolve in memnode.hpp. SAP nightly testing passed.
30-12-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/999 Date: 2022-12-29 12:55:42 +0000
29-12-2022
Changeset: c6e3daa5 Author: Igor Veresov <iveresov@openjdk.org> Date: 2022-10-03 17:40:10 +0000 URL: https://git.openjdk.org/jdk/commit/c6e3daa5fa0bdbe70e5bb63302bbce1abc5453fe
03-10-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/10517 Date: 2022-09-30 21:22:47 +0000
30-09-2022
Deferring to JDK 20 for now because we are close to RDP 2 for JDK 19 and we need more time to investigate.
19-07-2022
I'll dig around and see what I can find.
21-06-2022
Right. Notably last time I needed such loads to not float, I ended up not solving the problem, even with pinned control flow. That’s why ZGC has super late barrier expansion today. So I don’t expect using pinned control dependency to be enough, but it’s a start for investigating what more can go wrong.
15-06-2022
Based on discussion with Erik we may need to mark pre-value Load nodes with LoadNode::ControlDependency::Pinned and check it in LoadNode::Ideal() to make sure these loads kept in place (no control edge removal, no SafePoint bypassing).
15-06-2022
I strongly believe runtime/cds/appcds/dynamicArchive/LotsUnloadTest.java fails due to that with the attached patch (enables verification, adds some more runs to the test) with a reproduction rate of ~3% or so. I.e. the verification always finds exactly two instance of java.lang.invoke.BoundMethodHandle$Species_L in the stack frame that are not live any more, causing "random" followup crashes. Applying the patch from [~eosterlund] makes that test fail in almost all runs due to the introduced verification. There is still the possibility that this is only a naked oop/missing barrier somewhere.
14-06-2022
After investigation the conclusion is that this was a false-positive. There are indications that this might be a problem - but we have no reproducer at the moment. Reopening if a reproducer is found.
05-06-2020
As far as I can tell, remarking (G1RemarkThreadsClosure) only drains the thread-local SATB buffers and visits the code-cache. It doesn't re-mark the thread stacks (and thus active registers, etc). That doesn't seem to explain why we haven't seen any problems with this, yet.
17-04-2020