Bug ID: JDK-8187809 UseMembar should be set true and deprecate the flag

Type: Enhancement
Component: hotspot
Sub-Component: runtime
Affected Version: 10

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2017-09-22
Updated: 2018-11-06
Resolved: 2017-11-10

JDK 10
10 b33Fixed

Using mprotect for IPI have several problems (admittedly it's cool hack):
- It's not guaranteed to work on future hardware/OS:es (we could start using membarrier() with the MEMBARRIER_CMD_SHARED_EXPEDITED to be future prof)
- It doesn't work on arm/arm64 (again MEMBARRIER_CMD_SHARED_EXPEDITED would solve this)
- Eventbased tracing must read thread states often (causes performance issues)
- The complexity in the code is costly
- The thread serialization is unstable on certain workloads/platforms/OS:es (last notice this week on win x86)
- Fences are becoming cheaper
- Thread-local handshakes is assumed to increase the reading of thread state
- Scalability
- JNI performance - false sharing

Note that some application can show performance regression. 
This is especially true for application with few threads which does many short native calls.

Were there any before/after performance numbers done for this change on various hardware platforms that can be shared here?
27-06-2018
In the totalMemory() microbench there are several transition, java->native->fence->vm->fence->native->fence->java: 3 fences when using +UseMembar (java->native we skip the fence because false positives are fine). Since you don't do anything that reads the thread state you are never calling the 'serialization' and causing any IPI's, so you are comparing the worse case for using fences vs the best case (no synchronization needed) for asymmetric synchronization. Because of all the arguments against using mprotect, I do not consider this microbench a blocker for turning on UseMembar, deprecate it and in the long run removing the asymmetric synchronization.
27-09-2017
Oracle blogs recently went through a somewhat tortuous conversion from Apache Roller to something new. The conversion was billed as automatic and transparent, but many of my blog entries are broken and have dead links. I'll try to fix that. Much of the relevant text is also in the patent : https://www.google.com/patents/US7644409. The membar elision tricks reflect a point in time where fences -- and atomics -- were exceptionally expensive. The good news is that the general trend on x86 is toward faster fencing, implemented via deeper speculation on the other side of the fence. (Biased locking is a similar response : https://www.google.com/patents/US7814488). Above and beyond the correctness issues for systems with weaker memory models, I��d say the real reason to phase out the mprotect()-based approach is scalability. Specifically, the TLB shootdowns underlying mprotect() simply don��t scale well on large systems. I think it��s also excessively hopeful to think that the RCU helper sys_membarrier() flavors will be of any help in that regard. The IPIs required aren't fundamentally different than what are required via mprotect(). Having said that, I'm keeping my eye on the performance of the new 'expedited' flavors, where some of the proposed optimization echo ideas from the patents. But the existing variants are far worse than mprotect(). Somewhat perversely, modern SPARC CPUs can perform synchronous shootdown without the need for full IPIs, although the pipelines of all victim CPUs still need to be serialized and the shooter has to wait for acknowledgement. It's like an IPI but without quite as much interrupt overhead, but it still impacts the execution of other CPUs and impairs scalability, although not quite as a badly as a classic shootdown. While interesting, this doesn't change the basic economic model. Sadly, I could also envision a mode of operation where the existing mprotect()-based approach was used on small systems (CPUS < N) and classic fences were used on bigger systems (or for applications with lots of threads on big systems). That just increase the complexity, however, and makes performance even less predictable.
26-09-2017
You can find some good background information on this topic in Dave Dice's blogs: JNI performance - false sharing on the "-UseMembar" serialization page (Dave Dice) https://blogs.oracle.com/dave/jni-performance-false-sharing-on-the-usemembar-serialization-page QPI Quiescence (Dave Dice) https://blogs.oracle.com/dave/qpi-quiescence They both reference "Asymmetric Dekker Synchronization" by Dave Dice, Hui Huang and Mingyao Yang which unfortunately isn't available any more at the original location but can be retrieved thanks to the web archive at: http://web.archive.org/web/20080220051535/http://blogs.sun.com/dave/resource/Asymmetric-Dekker-Synchronization.txt
26-09-2017
Fences are not cheap on contemporary multi-socket systems. E.g. calling Runtime totalMemory() in a loop is almost twice as fast with -XX:-UseMembar on a 40 core Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (2 socket).
26-09-2017

CSR :	JDK-8187812 - UseMembar should be set true and deprecate the flag
Relates :	JDK-8152292 - Consider using proper OS APIs for os::serialize_thread_states
Relates :	JDK-8143878 - Memory serialization page can become a bottleneck
Relates :	JDK-8213436 - Obsolete UseMembar