Bug ID: JDK-8253183 Fragile memory barrier selection for some weak memory model platforms

JDK 16
16 b18Fixed

Changeset: dc3a0f5f Author: Martin Doerr <mdoerr@openjdk.org> Date: 2020-09-30 09:03:06 +0000 URL: https://git.openjdk.java.net/jdk/commit/dc3a0f5f
30-09-2020
OrderAccess::loadload_for_iriw() sounds good to me too.
22-09-2020
Something like loadload_enforce_globally_consistent_order would probably be too long. OrderAccess::loadload_for_iriw() sounds good to me.
22-09-2020
I'll assume you meant OrderAccess::iriw_fence()? :) The problem I have with iriw_fence() is that it doesn't convey anything with regards to semantics that might help you understand when you might need to insert this. We need this in very specific (and hard to recognise) situations where we have a load-load ordering constraint but also require the stores we may be loading to have been globally visible. Would loadload_fence() be terrible? It tries to reflect a stronger load-load barrier. Or maybe loadload_with_iriw_fence() or loadload_for_iriw() ?
21-09-2020
How about Atomic::iriw_fence()?
21-09-2020
Thanks, guys, for figuring this out. And sorry for the confusion regarding the jcstress tests and JMM. Yeah, we didn't know much about the arm 32 port (which was closed source) when "support_IRIW_for_not_multiple_copy_atomic_cpu" was introduced. We'd probably have chosen a better name if we knew arm 32 folks wanted to use a different trick (loadload = fence which sounds horrible, but is valid). I agree. Hiding the CPU_MULTI_COPY_ATOMIC sensitive selection behind an abstraction sounds like the best option to me. The next challenge is to find a name everybody can live with.
21-09-2020
The history behind the introduction of support_IRIW_for_not_multiple_copy_atomic_cpu can be found in the following discussions threads for the PPC64 port as referenced in JDK-8032366 "Implement C1 support for IRIW conformance on non-multiple-copy-atomic platforms": http://mail.openjdk.java.net/pipermail/hotspot-dev/2013-November/011803.html and continuing: http://mail.openjdk.java.net/pipermail/hotspot-dev/2013-December/011888.html http://mail.openjdk.java.net/pipermail/hotspot-dev/2014-January/012130.html http://mail.openjdk.java.net/pipermail/hotspot-dev/2014-January/012186.html ---- So this was only about correct volatile access on non-MCA systems that needed (and opted-in to) special handling. And yes we did decide that certain ARM ports did not actually need to opt-in to this. Today with everything open it would make sense to revisit this and use clearer names etc. In the reviews for the async monitor deflation changes my head was in such a spin that we even had an IRIW-like situation that I didn't recognise that using support_IRIW_for_not_multiple_copy_atomic_cpu as a proxy for is-nMCA was wrong. Mea culpa on that. Given we have CPU_MULTI_COPY_ATOMIC it seems to me that the simple fix for the monitor code is: #ifndef CPU_MULTI_COPY_ATOMIC OrderAccess::fence(); #else OrderAccess::loadload(); #endif though we could hide that behind a more suitable abstraction (if we can agree on one).
21-09-2020
Another option (that I also described in the review), is to actually use seq_cst loads. That would result in one unnecessary fence on PPC, but on the other hand, this is the more sensible programming model/style that most other projects would use. I’m not convinced the optimization opportunity of eliding one fence is worth digging up the street for, and having explicit IRIW fences that nobody understands WTH they do. I always thought we should have seq_cst available in Atomic. That is why we have it in the access API. But it belongs in Atomic really. I would also be okay with sn iriw_fence or something like that though. I think we all agree though that explicit selection of fence vs loadload depending on nMCA property X, does not belong in shared code. The shared code should describe the semantics of the fencing, not dig into its implementation domain.
18-09-2020
@Erik: First things first :) I fully agree we have to err on the side of caution with IRIW -- that was always my position in these debates. But that is by and large is irrelevant for implementations: as you say, spec says it should be supported, IRIW should be supported, full stop. This is how I understand the problem. I think "support_IRIW_for_not_multiple_copy_atomic_cpu" is actually a barrier style selection flag. When you implement the conservative implementation of JMM-like memory model, you put something like "volatile load [LoadLoad\|LoadStore]" + "[StoreStore] volatile store", and then you need to decide where do you put "[StoreLoad]". To maintain SC, it has to be at least between volatile stores and volatile loads. So you either do StoreLoad after the volatile store, or before the volatile load. In most practical implementations, loads outnumber the stores, so it makes sense to attach it to the store. Unless you run into nMCA thingie, and it turns out you need the barrier before volatile loads, which _just so happens_ to be the same thing you'd emit for full fence or StoreLoad! In that case, you can swap the place where to put StoreLoad, and leave the volatile store without it. IIRC, that is what current compiler (at least C2) code does with support_IRIW_for_not_multiple_copy_atomic_cpu. This way is still bad, because loads still outnumber the stores, but at least stores are not penalized unnecessarily. Of course, you can just leave the StoreLoad after the volatile store as well, and that would be even more fool-proof. It irritates me the flag even mentions IRIW. It feels like a much more general property: which global barrier style you are doing for the platform. We can disregard that and have an explicit "Atomic::iriw_fence" and other fences doing the thing they do now, but that AFAICS would mean we do excessive barriers near the stores. And it would be tempting for someone to go in and hack that around with a flag... again. This is less of the problem for declarative-style loads and stores. "All you need to do" (c) is make sure there is the SC atomic loads and stores that emit the barriers in either of the styles. It is harder for explicit fences, because their placement inevitably depends on the choice of the barrier style, at every use-site. Maybe just renaming the flag to "seqcst_loads_need_fence" captures the intent better, and gives enough clue to users of explicit fences. It has the added benefit of commenting the code that emits stores why does things differently on "seqcst_loads_need_fence" (= "optimization, we have the fence near the load, don't need it here").
18-09-2020
Right. So we agree that there is no real bug now, and it's more of a matter of interpretation what "support_IRIW_for_not_multiple_copy_atomic_cpu" really means. [~mdoerr] you were arguing that its purpose is to opt out of IRIW for Java volatile as you thought that is optional. Now it sounds like we are all on the same page that this can't possibly be what this boolean means. Not only is the name completely misleading if that was the case (no mention of Java volatile), but also it really is not optional at all, whether you support IRIW or not. So first of all, I think having a better description about what this boolean means is in order (as we now only agree about what it does not mean). The way I think of it is "loadload_insufficient_for_IRIW". Or something like that. If you think of it that way, it becomes clear why it is used the way it is for Java volatiles. The load acquire the shared code generates when it is false, implies loadload which implies it is enough for IRIW. That also explains why it is true for PPC, and also makes it generally useable in shared hotspot code. So now you are saying that some people might want to put on dirty patches that relax the IRIW requirements on Java volatiles, and violate the specification by doing so, and that if you did that, you would get a bug. To that I can only simply reply "don't do that". Breaking the spec is a bug on its own, and should simply not be allowed; this is not a point that is in any way optional. IRIW must be supported for volatiles, full stop. The VM code can't cater for and have hooks for people that want to violate the spec, and relax what other bugs they might discover. That really can't be what this switch is for. You are also saying that if somebody wanted to relax the implementation of loadload() to perform e.g. dmb ld on ARMv7, instead of dmb sy (or dmb ish), then this would also break. This is indeed true. I think such a change would have to go through a lot of scrutiny as well though, as the code has already been marked in a way that makes it obvious that this simply isn't a valid optimization right now. By that I mean that loadload() calls dmb_ld(), which calls dmb_sy(), as truly using a "dmb ld" instruction has obviously been considered not safe, probably due to similar reasons. So it would likely already be an invalid change that would break. I don't think anyone can go in and just change it, without fully understanding the ramifications of doing so, and why it has not been valid to do so already. Having said that, I do agree that it would be useful to have a user facing IRIW fence function for shared code, so you don't have to go digging up the street and have these kind of lengthy discussions every time you need some kind of SC behaviour. Something like Atomic::iriw_fence() or something like that might be nice to have as the user facing API. It is really quite nasty to have these platform switches in shared code, to hook up to either a fence or a loadload conditionally, at all. Perhaps if we did that, everyone would be satisfied? Oh and [~shade], the monitor useage for the IRIW fence is indeed one reason for the philosophical question to swing in the direction of "of course we have to support IRIW". Many people argue that IRIW is never a problem in practice for real algorithms. But that mindset of "this probably works unless proven otherwise", is really bad. You then stop looking for the problems. The way you should think about it is "there is no way this works, unless proven otherwise", and actually put the IRIW glasses on, and analyze the code deeply to see if it really works, even though the odds are in your favour that you are most likely to not have an IRIW issue probability wise. In this project, the odds were not in our favour, and we caught not 1 but 2 IRIW conditions. Because we looked for them. If you don't look, you won't ever find any IRIW problems. ;-)
18-09-2020
This bug started off discussing changes made by JDK-8153224 which affects Java object monitors (ObjectMonitor class). Generally when we talk about "VM locking" we are discussing Mutex and Monitor classes so internal VM locks.
18-09-2020
JMM insufficiency for IRIW was explained here (search for IRIW): http://www.andreas-lochbihler.de/pub/lochbihler14toplas.pdf Also see here: all cases were made acceptable (not in the test code, but some JVM implementation accepted it): https://openjdk-aarch32.osci.io/results/2018_10_31/jdk_jdk_c401c536cea1/jcstress/results/org.openjdk.jcstress.tests.volatiles.VolatileIRIWTest.html
18-09-2020
> But in this VM locking code, support_IRIW_for_not_multiple_copy_atomic_cpu only makes barriers stronger, which cannot violate correctness. The problem I see here is mainly for arm32: support_IRIW_for_not_multiple_copy_atomic_cpu = false => we use OrderAccess::loadload() => we rely on loadload implemented as "Full system DMB operation" I'm really not happy about that! For PPC64: Seems like some people want to optimize for their workload and use support_IRIW_for_not_multiple_copy_atomic_cpu = false and ignore that JMM is not fully supported. Of course, I don't want to support this officially! But I've already seen such requests on the ML. If they do so, Erik's barrier will be too weak. > It would be, IMO, a massive problem in VM locking code starts _relaxing_ the barriers based on support_IRIW_for_not_multiple_copy_atomic_cpu. That's one of the reasons I don't like this constant to be used for anything else than for Java volatiles.
18-09-2020
@Martin. Java volatiles were sequentially consistent since at least Java 5, so IRIW cannot break :) What you mention is a different thing: reliably observing the volatile store from the constructor when the host object is published via the race (like you would expect for finals). That behavior is -- "SURPRISE!" -- not mandated by JLS, and on most hardware it just appeared to behave like expected. And that is very surprising, e.g. initializing AtomicInteger(42), passing it via race and then discovering the "atomic" has 0 as value. IIRC, the only time we have seen it happening was PPC, so jcstress relaxed this requirement to match the actual spec, and PPC port did some strengthening to never show these surprises to users. I think this is how we got the blurb in Parse::do_exits(): PPC (with support_IRIW_for_not_multiple_copy_atomic_cpu = true) chooses to emit the barrier before the volatile load, not after the store. This is a very important bit: support_IRIW_for_not_multiple_copy_atomic_cpu relaxes the barriers after volatile Java stores. For constructors, that means there is no barrier before the racy publication, and everything goes downhill from there. So, the special handling is needed in Java object constructor code: emit another release barrier. But in this VM locking code, support_IRIW_for_not_multiple_copy_atomic_cpu only makes barriers stronger, which cannot violate correctness. It would be, IMO, a massive problem if VM locking code starts relaxing the barriers based on support_IRIW_for_not_multiple_copy_atomic_cpu. At that point, we would need to verify that every SC-read gets the heavyweight barrier. From the looks of it, if PPC64 decides that it is finally MCA, and flips support_IRIW_for_not_multiple_copy_atomic_cpu = false, then barriers around stores would be intact, and the -- now unnecessary -- stronger barriers around the loads would relax to defaults.
18-09-2020
@Aleksey: Thanks a lot for your help. I'm glad that current JLS enforces that. Was that different in the past? I remember very lengthy discussions. And also that some JVM didn't pass the VolatileIRIWTest and it was not treated as bug. (It was a different, but a bit related test which was not required by JMM: http://hg.openjdk.java.net/code-tools/jcstress/rev/e29395e8f369) Yeah, support_IRIW_for_not_multiple_copy_atomic_cpu was introduced to support PPC64's needs wrt. Java volatiles in shared code. David has explained the meaning pretty good above. @Erik: On arm32, there are several implementations of loadload. Maybe all of them are sufficient to prevent IRIW reorderings. So you may be right with the statement that your new arm 32 barriers are sufficient to prevent errors. I wonder why that was never optimized. I have set priority to P4, but I still think it should get fixed because this implementation is very fragile. If someone optimizes arm 32 barriers your usage will break. And if people play with support_IRIW_for_not_multiple_copy_atomic_cpu on PPC64, it will also break. (There were already people playing with it for the reasons Aleksey has mentioned!) It'd be great if we could have an implementation like in "GenericTaskQueue<E, F, N>::pop_global" with description of the memory access pattern we need to fix.
18-09-2020
VolatileIRIWTest still forbids (1, 0, 0, 1): https://hg.openjdk.java.net/code-tools/jcstress/file/ad66703e2ed0/tests-custom/src/main/java/org/openjdk/jcstress/tests/volatiles/VolatileIRIWTest.java#l39 The recurring discussion about IRIW is usually framed as: "Do we really need some results of IRIW forbidden, for any practical reason?". That is mostly a philosophical question about memory model theory when MM improvements are discussed. But that is _not_ something the implementations should have the opinion about. Current JLS is specified in such a way that volatiles are sequentially consistent, and so (1, 0, 0, 1) is pretty much forbidden by spec. I thought that support_IRIW_for_not_multiple_copy_atomic_cpu flag was basically PPC's way to opt-in into heavy-weight barriers after "volatile" loads, where no other platform requires them. Or, more specifically, mark all these weird places in the code where nMCA platforms might need another barrier. On these grounds, I think the current code is correct. I think Martin's argument boils down to the expectation that support_IRIW_for_not_multiple_copy_atomic_cpu is _only_ about enforcing IRIW-like SC only for Java volatiles. That argument has a merit, but I also see it applies to a generic synchronization code in VM locking. (It would be fun to discover that is the answer to the "practical reason" to the philosophical question above.)
18-09-2020
I guess the reason we are discussing JMM semantics of volatile is that we are discussing what the ” support_IRIW_for_not_multiple_copy_atomic_cpu” flag really means. The observation is that the current code is correct and will ensure the appropriate fencing is performed on wll supported nMCA platforms. On PPC we will call fence(), resulting in a sync instruction, which is fine. On ARMv7 we will call loaload() which will produce a dmb sy, which is also fine. So the fencing is already correct. The question then is whether there is still a semantic bug here, because we should have encoded the dmb sy on ARMv7 through a fence() instead. That depends on why it is set to false on ARMv7. You are arguing it is false because supporting IRIW is optional, and I’m saying that can’t be right. I’m also saying that ARMv7 does support IRIW, but with a trailing sync convention instead. Volatile load is ldr; dmb sy; so pairs of volatile loads are interleaved by dmb sy which acts as a full fence and acquire at the same time. And that is why it is false - because its acquire (and loadload) is way stronger and also ensures IRIW is handled right. And that is also precisely the reason why in our case here, loadload() is sufficient.
17-09-2020
The referenced test does seemingly not allow 1, 0, 0, 1 in the table, which if I read it right would be the case possible without the fence on PPC. So I guess I don’t see that the test is allowing IRIW. What am I missing?
17-09-2020
I don't remember details of the discussion which has lead to allow IRIW reorderings in VolatileIRIWTest. Probably Aleksey does. Nevertheless, this is out of scope for this issue. What does JDK-8153224 have to do with JMM?
17-09-2020
The linked paper has the IRIW case in Figure 4. It states "However, no SC execution can produce such a result. By the DRF guarantee, the JMM must not allow this result either". My interpretation of this is that IRIW must be respected, according to the paper. And regardless of that, JSR 133 is pretty clear about volatile loads being synchronization actions, and that all synchronization actions have a total order. There is something I am not following here...
17-09-2020
[~mdoerr] Do you have a reference supporting that full fence is not needed between volatile loads (on PPC)? Last time I read JSR 133, it defines volatile reads and writes as "synchronization actions", and further on defines that all synchronization actions have a total order. In other words, volatile loads have a total order. To me, that implies that not supporting IRIW is a direct violation of these constraints. Is there any other specification that states otherwise, that I have missed?
17-09-2020
If true "support_IRIW_for_not_multiple_copy_atomic_cpu" means the platform is nMCA _and_ it is _choosing_ to implement the IRIW semantics for volatile accesses. There was a lot of debate around IRIW and whether it was actually necessary to support it and the answer (at the time at least) was no for some nMCA platforms - hence the variable. IIUC the recent use of that variable was more a check for is-nMCA?
17-09-2020
Ah, yes this seems wrong. I suppose I was confused by the name "support_IRIW_for_not_multiple_copy_atomic_cpu", assuming it means that this platform is nMCA and needs to explicitly support IRIW. That is what the name implies. Our situation is precisely that - we need this fence on nMCA platforms that need to explicitly support IRIW. However, it seems that the way that this variable is used (if we include ARMv7), is to denote whether volatile bindings (specifically) of nMCA machines use a leading fence on loads instead of trailing fence on stores. Perhaps the name should somehow hint that this is a property of volatile bindings only, and not anything else. But maybe that's a task for another day...
17-09-2020
"support_IRIW_for_not_multiple_copy_atomic_cpu" was introduced because the JMM doesn't strictly require full fence between volatile loads (yet), but PPC64 implementors have chosen to do so (unlike arm 32 implementors). So this constant can only be used for Java volatiles. The macro "CPU_MULTI_COPY_ATOMIC" is defined on platforms which specify this property. It can be used to select full fence operations between non-Java variable loads.
17-09-2020
So I guess if we think of the flag as a "do I need full fence between volatile loads to support IRIW" flag, then it is also sound semantically in our usecase, in some strange way. Because we don't need it to be true, for the very same reason the atomic bindings do not need it to be true. But it's really weird in a way.
17-09-2020
I don't understand how it could be possible that ARMv7 does not need to support IRIW for volatile accesses, if that is what we are saying here. It would seem that not supporting IRIW in volatile implies not supporting what is required by the JMM. Certainly, it must be that IRIW support was implemented in ARMv7, but with "trailing sync" semantics, instead of "leading sync". That allows a load acquire and load volatile to both be ldr; dmb; as an optimization of the otherwise needed dmb; ldr; dmb; for "leading sync" load volatile (the first dmb being the fence, the second dmb being the acquire). This specific quirk does not hold on PPC where there is a large difference between the leading fence (sync) and the trailing acquire (lwsync or branch + isync) of a load volatile. However, the name of the variable does not depict this very well; ARMv7 still (presumably) has to support IRIW for volatile accesses (as not doing that would be a violation of the JMM), and is indeed nMCA (at least in the spec, but perhaps not so much in practice). Yet it sets it to false - that is very confusing to me. At least judging by the name. But maybe it's just me finding that misleading. The slightly hilarious thing though, is that for the same reason this just happens to slip in right for volatile bindings, it actually happens to be right in our use case too (at least implementation wise, but the semantics is off). Since it reports false, we go through the else branch where we do loadload() instead. The loadload() is bound to a "dmb sy" in the end, and that "dmb sy" is also exactly what fence() produces, and does solve the IRIW problem. So for ARMv7, it does not matter if we report true or false in this flag, as both loadload() and fence() are equivalent in our atomic bindings. It looks like in orderAccess_linux_arm.hpp there was an attempt to use "dmb ld" instead, but the dmb_ld() function actually calls dmb_sy() instead. Possibly because the trick I just described about using trailing sync semantics does not work otherwise. Therefore, there is actually not a "real" bug, other than in terms of semantics, on ARMv7. Oh the spaghetti goes deep in this one...
17-09-2020
[~eosterlund] - I sent the following reply to Martin's original email flare: -----Original Message----- From: Daniel D. Daugherty <daniel.daugherty@oracle.com> Sent: Dienstag, 15. September 2020 16:59 To: Doerr, Martin <martin.doerr@sap.com>; Carsten Varming <varming@gmail.com>; Erik Österlund <erik.osterlund@oracle.com> Cc: Roman Kennke <rkennke@redhat.com>; hotspot-runtime- dev@openjdk.java.net Subject: Re: RFR(L) 8153224 Monitor deflation prolong safepoints (CR14/v2.14/17-for-jdk15) Hi Martin, I believe that the support_IRIW_for_not_multiple_copy_atomic_cpu stuff came from Erik O. so I'm adding him to this email thread. Yes, please create an issue that describes the problem and we'll figure out who should take the issue... Dan
15-09-2020