JDK-7143664 : Clean up OrderAccess implementations and usage
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2012-02-08
  • Updated: 2020-09-17
  • Resolved: 2015-03-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b56Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
David Holmes writes:
Okay Dave Dice has managed to untangle all this for me. Thanks Dave!

Basic problem is that we have three separate, but related, API's defined in OrderAccess:

1. Membars

These are basic membar operations to provide storestore, storeload, loadstore and loadload.

2. acquire()/release()/fence()

These are "traditional" acquire/release semantics that provide the basic "roach-motel" property:

   // A
   acquire();
   // B
   release();
   // C

accesses in A and C can move into section B. But nothing in B can move into A or C.

Plus we have the heavyweight fence().

3. load_acquire/release_store/store_fence/release_store_fence

These are/were intended primarily to map to the ia64 instruction ld.acq and st.rel (load with acquire semantics, store with release semantics), plus additional fences as needed - and as explained on orderAccess.hpp.

The critical point here is that:

release_store(addr, v) IS NOT THE SAME AS release(); store(addr, v)

because the ia64 release semantics are stronger than general release() semantics in that the store (region C) can not be moved up ( to region B). So release_store(lock, 0) is exactly what we want when doing the IUnlock because it is equivalent to:

  storeStore|storeload; store(lock, 0);

with regards to the store to the lock. Implementing release_store as "storeStore|storeload; store" is valid but provides stronger semantics (the barriers are with respect to all subsequent stores) than the ia64 st.rel.

The problem is that "we" have confused the implementation by defining some of these methods in terms of the others, in an incorrect way eg:

storestore() != release()
release_store(addr, v) != release(); *addr=v;

So we need to fix this so as not to confuse others - particularly other porters.

It seems to me that having explained the above, the release() and acquire() functions are not useful as we will typically want load_acquire() and store_release() for Java volatile semantics; release_store for monitor unlock, and the CAS on monitor locking gives us a full fence().

Note: on TSO systems there is no runtime problem because these store barriers are no-ops anyway. On other platforms it is not an issue if a strong enough h/w barrier is used for the actual implementation eg: DMB on ARM, lwsync on PPC (see JSR133 Compiler Writers Cookbook). But it is important that the correct semantics are expressed in the code.

Comments
Further commentary from Erik: Reported bugs that I know of that are fixed with my changes: 1) https://bugs.openjdk.java.net/browse/JDK-806196 <- crash due to lack of compiler barriers, was only hotfixed on linux_x86, but can crash at any time on other platforms unless fixed 2) https://bugs.openjdk.java.net/browse/JDK-7143664 <- original bug report I hijacked, originally only considering the awkwardness of release() store() != release_store() Actual bugs fixed: A. Memory model has been fixed to make more sense. It was arguably wrong before but now it should be fine. Big thanks to David for going through this one with me! B. Compiler barriers were (sometimes) omitted for acquire/release, leading to crashes due to not following the described memory model. This was fixed for all platforms that previously did not consider this and relied on volatile semantics. This includes all TSO platforms except linux_x86 that already fixed it as a hotfix by Mikael Gerdin. The exception is windows where volatile semantics has acquire/release semantics for compiler reorderings. C. release_store_fence for jfloat even on linux_x86 was overlooked, probably due to code replication: it is now fixed too. D. fence() did not use compiler barriers for uniprocessor machines, only when actual fence instruction was required. Now it does even for uniprocessor systems, to avoid single processor concurrency problems due to context switches and lack of compiler barriers. E. Introduced inline asm for solaris. Previously it used .il-files because solaris studio did not support inline asm. And I don't know if it gives any compiler reordering guarantees at all. Therefore I now use inline asm instead so that I can explicitly enforce compiler reordering guarantees so that it won't cause trouble for us. F. Removed dummy stores/loads for acquire/release on TSO. This code seems (to me) to have been not right all along: even when trying to exploit volatile semantics that in certain context could have made sense, their inline assembly would often (e.g. linux x86) come with a compiler barrier, removing the whole point of using volatile instead of compiler barrier in the first place. And for hardware reorderings it did nothing at all (as the store and load planes move as freely with as without the dummy load/store due to store buffers). G. Inconsistencies where PPC on zero flavors use isync and sync for load(load|store)/acquire which seems like a massive over approximation compared to linux/aix that uses lwsync instead for these same scenarios (which seems more reasonable). Made this consistent by using lwsync like aix/linux on zero builds too. Then of course there were a bunch of changes I needed to make to be able to fix this, such as refactoring into a generalization/specialization architecture. And some things I couldn't help but change such as outdated documentation talking about platforms no longer supported instead of platforms actually supported, inconsistencies such as __asm__ __volatile__ vs __asm__ volatile in the same files, awkwardness such as loadstore/loadload being implemented in terms of acquire, removal of store_fence that per long discussion above seems unnecessary and almost misleading, etc.
19-02-2015

Erik Osterlund has also been looking at this in detail and points out a number of issues with the current model and implementation: 1. OrderAccess for TSO lacks compiler barriers where C++ volatile semantics as described in section 1.9 of the C++ standard defined was relied upon instead. Unfortunately it does not work (e.g. as explicitly stated by GCC) on any known platformsxcept windows, leading to reported bugs like https://bugs.openjdk.java.net/browse/JDK-8061964 still being unresolved for all TSO platforms except a few (x86 linux/windows). This goes both for stand-alone acquire()/release() using volatile dummy loads/stores and joined variants using purely volatile loads/stores. 2. It describes a memory model that has issues as stated, leading to unintuitive problems like release_store() != release() store(), and other strange behavior, that could be solved by using a new model instead. 3. Moreover, most of the OrderAccess implementation is shared as the semantics are shared. Yet all the boilerplate code that looks the same for every platform except a few exceptions to the rule (for optimization purposes), is reimplemented for every platform in os_cpu. A generalized variant helps making the code more easily maintainable (by having less platform specific duplicated code, almost exclusively for optimizations), more reliable (by having reliable shared code that can always be used for the right semantics), more portable (as a new platform can be safely defined with only ~7 LoC). 4. store_fence() family of operations should be removed because they a) don't fit either acquire/release or Java volatile semantics, b) become awkward exceptions by not being volatile for some reason, c) are note being used anywhere in hotspot at all, d) lead to a lot of duplicated inline asm nobody uses --- A webrev of the proposed change is here: http://cr.openjdk.java.net/~dholmes/7143664/webrev/ Here is the proposed modified model: // Memory Access Ordering Model // // This interface is based on the JSR-133 Cookbook for Compiler Writers. // // In the following, the terms 'previous', 'subsequent', 'before', // 'after', 'preceding' and 'succeeding' refer to program order. The // terms 'down' and 'below' refer to forward load or store motion // relative to program order, while 'up' and 'above' refer to backward // motion. // // // We define four primitive memory barrier operations. // // LoadLoad: Load1(s); LoadLoad; Load2 // // Ensures that Load1 completes (obtains the value it loads from memory) // before Load2 and any subsequent load operations. Loads before Load1 // may *not* float below Load2 and any subsequent load operations. // // StoreStore: Store1(s); StoreStore; Store2 // // Ensures that Store1 completes (the effect on memory of Store1 is made // visible to other processors) before Store2 and any subsequent store // operations. Stores before Store1 may *not* float below Store2 and any // subsequent store operations. // // LoadStore: Load1(s); LoadStore; Store2 // // Ensures that Load1 completes before Store2 and any subsequent store // operations. Loads before Load1 may *not* float below Store2 and any // subsequent store operations. // // StoreLoad: Store1(s); StoreLoad; Load2 // // Ensures that Store1 completes before Load2 and any subsequent load // operations. Stores before Store1 may *not* float below Load2 and any // subsequent load operations. // // We define two further barriers: acquire and release. // // Conceptually, acquire/release semantics form unidirectional and // asynchronous barriers w.r.t. a synchronizing load(X) and store(X) pair. // They should always be used in pairs to publish (release store) and // access (load acquire) some implicitly understood shared data between // threads in a relatively cheap fashion not requiring storeload. If not // used in such a pair, it is adviced to use a membar instead: // acquire/release only make sense as pairs. // // T1: access_shared_data // T1: ]release // T1: (...) // T1: store(X) // // T2: load(X) // T2: (...) // T2: acquire[ // T2: access_shared_data // // It is guaranteed that if T2: load(X) synchronizes with (observes the // value written by) T1: store(X), then the memory accesses before the T1: // ]release happen before the memory accesses after the T2: acquire[. // // Total Store Order (TSO) machines can be seen as machines issuing a // release store for each store and a load acquire for each load. Therefore // there is an inherent resemblence between TSO and acquire/release // semantics. TSO can be seen as an abstract machine where loads are // executed immediately when encountered (hence loadload reordering not // happening) but enqueues stores in a FIFO queue // for asynchronous serialization (neither storestore or loadstore // reordering happening). The only reordering happening is storeload due to // the queue asynchronously serializing stores (yet in order). // // Acquire/release semantics essentially exploits this asynchronicity: when // the load(X) acquire[ observes the store of ]release store(X), the // accesses before the release must have happened before the accesses after // acquire. // // The API offers both stand-alone acquire() and release() as well as joined // load_acquire() and release_store(). It is guaranteed that these are // semantically equivalent w.r.t. the defined model. However, since // stand-alone acquire()/release() does not know which previous // load/subsequent store is considered the synchronizing load/store, they // may be more conservative in implementations. We advice using the joined // variants whenever possible. // // Finally, we define a "fence" operation, as a bidirectional barrier. // It guarantees that any memory access preceding the fence is not // reordered w.r.t. any memory accesses subsequent to the fence in program // order. This may be used to prevent sequences of loads from floating up // above sequences of stores. // // The following table shows the implementations on some architectures: // // Constraint x86 sparc ppc // --------------------------------------------------------------------------- // fence LoadStore | lock membar #StoreLoad sync // StoreStore | addl 0,(sp) // LoadLoad | // StoreLoad // // release LoadStore | lwsync // StoreStore // // acquire LoadLoad | lwsync // LoadStore // // release_store <store> <store> lwsync // <store> // // release_store_fence xchg <store> lwsync // membar #StoreLoad <store> // sync // // // load_acquire <load> <load> <load> // lwsync // // Ordering a load relative to preceding stores requires a fence, // which implies a membar #StoreLoad between the store and load under // sparc-TSO. A fence is required by x86. On x86, we use explicitly // locked add. // // 4. store, load <= is constrained by => store, fence, load // // Use store, fence to make sure all stores done in an 'interesting' // region are made visible prior to both subsequent loads and stores. // // Conventional usage is to issue a load_acquire for ordered loads. Use // release_store for ordered stores when you care only that prior stores // are visible before the release_store, but don't care exactly when the // store associated with the release_store becomes visible. Use // release_store_fence to update values like the thread state, where we // don't want the current thread to continue until all our prior memory // accesses (including the new thread state) are visible to other threads. // This is equivalent to the volatile semantics of the Java Memory Model.
22-01-2015

PUBLIC COMMENTS Upon further inspection my comment regarding Java volatiles need to be amended. The Java Memory Model defines the necessary memory barriers between different types of access (ie volatile/non-volatile loads and stores). However hotspot doesn't (can't) examine the instruction pairings so it has to issue the heaviest barriers needed to account for whatever accesses may precede or follow the volatile access. Consequently a volatile write ends up as: release() store fence() in some of the runtime code. However I think the release() in the above is incorrect as per previous discussion - the store can float above it. The C1 code issues: membar_release() store membar() The details of these operations are platform specific but the necessary barrier equivalences are: membar_acquire := LoadLoad + LoadStore membar_release := LoadStore + StoreStore membar := StoreLoad There are also sun.misc.Unsafe operations for volatile accesses that can then be intrinsified by C1 and C2. The runtime operations use the following OrderAccess::methods: GET_FIELD_VOLATILE OrderAccess::loadAcquire() PUT_FIELD_VOLATILE OrderAccess::release_store_fence() GET_OOP_FIELD_VOLATILE load; OrderAccess::acquire() Unsafe_SetObjectVolatile OrderAccess::release(); store; OrderAccess::fence() [Edited to remove incorrect comment regarding use of release_store_fence()]
10-02-2012

EVALUATION See description.
08-02-2012