JDK-8061964 : Insufficient compiler barriers for GCC in OrderAccess functions
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 8u40,9
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2014-10-23
  • Updated: 2015-06-03
  • Resolved: 2014-11-06
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b42Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
In JDK-8058209 I discovered that two stores separated by a call to OrderAccess::storestore were in fact incorrectly reordered by GCC, exposing a race condition and causing crashes in the G1 collector.

The suggested fix is to add a proper "compiler barrier" to the acquire() and release() functions in orderAccess_linux_x86.hpp to ensure that no compiler reordering occurs.

Something along the lines of:
static inline void compiler_barrier() {
__asm__ volatile ("" : : : "memory");
}
Comments
noreg-hard justification: The bug concerns the exact ordering of machine instructions in the JVM, it is all but impossible to create a test for this fix.
11-11-2014

For future reference, the steps to verify the correctness of the particular G1 code where this was discovered are: Use objdump (or another disassembler) find the function called G1OffsetTableContigSpace::set_saved_mark, later renamed to record_top_and_timestamp. The C++ source for the function is inlined here for reference: void G1OffsetTableContigSpace::set_saved_mark() { G1CollectedHeap* g1h = G1CollectedHeap::heap(); unsigned curr_gc_time_stamp = g1h->get_gc_time_stamp(); if (_gc_time_stamp < curr_gc_time_stamp) { // The order of these is important, as another thread might be // about to start scanning this region. If it does so after // set_saved_mark and before _gc_time_stamp = ..., then the latter // will be false, and it will pick up top() as the high water mark // of region. If it does so after _gc_time_stamp = ..., then it // will pick up the right saved_mark_word() as the high water mark // of the region. Either way, the behaviour will be correct. ContiguousSpace::set_saved_mark(); // _saved_mark_word = _top OrderAccess::storestore(); _gc_time_stamp = curr_gc_time_stamp; // The following fence is to force a flush of the writes above, but // is strictly not needed because when an allocating worker thread // calls set_saved_mark() it does so under the ParGCRareEvent_lock; // when the lock is released, the write will be flushed. // OrderAccess::fence(); } } The memory accesses in the disassembly are annotated with "<-" below. Before JDK-6973570 in JDK 7 b108 the instruction sequence is correct, the store of this->_gc_time_stamp is performed after the store of this->_saved_mark_word mgerdin@mgerdin03:~$ objdump -d -C /net/jre.us.oracle.com/onestop/jdk/1.7.0/promoted/all//b108/binaries/linux-x64/jre/lib/amd64/server/libjvm.so | grep -A 30 -E 'G1OffsetTableContigSpace::set_saved_mark.*:' 00000000003d9f30 <G1OffsetTableContigSpace::set_saved_mark()>: 3d9f30: 55 push %rbp 3d9f31: 48 89 e5 mov %rsp,%rbp 3d9f34: 53 push %rbx 3d9f35: 48 89 fb mov %rdi,%rbx 3d9f38: 48 83 ec 08 sub $0x8,%rsp 3d9f3c: e8 df 13 fb ff callq 38b320 <G1CollectedHeap::heap()> 3d9f41: 8b 90 b0 01 00 00 mov 0x1b0(%rax),%edx <- g1->_gc_time_stamp 3d9f47: 8b 83 38 01 00 00 mov 0x138(%rbx),%eax <- this->_gc_time_stamp 3d9f4d: 39 d0 cmp %edx,%eax 3d9f4f: 73 1c jae 3d9f6d <G1OffsetTableContigSpace::set_saved_mark()+0x3d> 3d9f51: 48 8b 43 58 mov 0x58(%rbx),%rax <- this->_top 3d9f55: 48 89 43 18 mov %rax,0x18(%rbx) <- this->_saved_mark_word 3d9f59: 48 8b 05 40 f9 70 00 mov 0x70f940(%rip),%rax # ae98a0 <_DYNAMIC+0x12f8> 3d9f60: 48 c7 00 00 00 00 00 movq $0x0,(%rax) <- OrderAccess::storestore() 3d9f67: 89 93 38 01 00 00 mov %edx,0x138(%rbx) this->_gc_time_stamp 3d9f6d: 48 83 c4 08 add $0x8,%rsp 3d9f71: 5b pop %rbx 3d9f72: c9 leaveq 3d9f73: c3 retq 3d9f74: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1) 3d9f7b: 00 00 00 00 00 00000000003d9f80 <G1OffsetTableContigSpace::saved_mark_word() const>: (...) In b109 the store of this->_saved_mark_word has been moved to after the store of this->_gc_time_stamp, breaking the semantics of the storestore() operation: mgerdin@mgerdin03:~$ objdump -d -C /net/jre.us.oracle.com/onestop/jdk/1.7.0/promoted/all//b109/binaries/linux-x64/jre/lib/amd64/server/libjvm.so | grep -A 30 -E 'G1OffsetTableContigSpace::set_saved_mark.*:' 00000000003da040 <G1OffsetTableContigSpace::set_saved_mark()>: 3da040: 55 push %rbp 3da041: 48 89 e5 mov %rsp,%rbp 3da044: 53 push %rbx 3da045: 48 89 fb mov %rdi,%rbx 3da048: 48 83 ec 18 sub $0x18,%rsp 3da04c: e8 5f 1f fb ff callq 38bfb0 <G1CollectedHeap::heap()> 3da051: 8b 90 b0 01 00 00 mov 0x1b0(%rax),%edx <- g1->_gc_time_stamp 3da057: 8b 83 38 01 00 00 mov 0x138(%rbx),%eax <- this->_gc_time_stamp 3da05d: 39 d0 cmp %edx,%eax 3da05f: 73 15 jae 3da076 <G1OffsetTableContigSpace::set_saved_mark()+0x36> 3da061: 48 8b 43 58 mov 0x58(%rbx),%rax <- this->_top 3da065: c7 45 f4 00 00 00 00 movl $0x0,-0xc(%rbp) <- OrderAccess::storestore() 3da06c: 89 93 38 01 00 00 mov %edx,0x138(%rbx) <- this->_gc_time_stamp 3da072: 48 89 43 18 mov %rax,0x18(%rbx) <- this->_saved_mark_word 3da076: 48 83 c4 18 add $0x18,%rsp 3da07a: 5b pop %rbx 3da07b: c9 leaveq 3da07c: c3 retq 3da07d: 90 nop 3da07e: 66 90 xchg %ax,%ax 00000000003da080 <G1OffsetTableContigSpace::saved_mark_word() const>: (...) In the (as of now) latest build of JDK 9, the generated code looks almost the same, but with a differently named method and offsets. mgerdin@mgerdin03:~$ objdump -d -C /net/jre.us.oracle.com/onestop/jdk/1.9.0/promoted/all/b37/binaries/linux-x64/jre/lib/amd64/server/libjvm.so | grep -A 30 -E 'G1OffsetTableContigSpace::record_top_and_timestamp.*:' 00000000005d04f0 <G1OffsetTableContigSpace::record_top_and_timestamp()>: 5d04f0: 55 push %rbp 5d04f1: 48 89 e5 mov %rsp,%rbp 5d04f4: 53 push %rbx 5d04f5: 48 89 fb mov %rdi,%rbx 5d04f8: 48 83 ec 18 sub $0x18,%rsp 5d04fc: e8 bf 72 f8 ff callq 5577c0 <G1CollectedHeap::heap()> 5d0501: 8b 80 2c 13 00 00 mov 0x132c(%rax),%eax <- g1->_gc_time_stamp 5d0507: 8b 93 18 01 00 00 mov 0x118(%rbx),%edx <- this->_gc_time_stamp 5d050d: 39 c2 cmp %eax,%edx 5d050f: 73 15 jae 5d0526 <G1OffsetTableContigSpace::record_top_and_timestamp()+0x36> 5d0511: 48 8b 53 58 mov 0x58(%rbx),%rdx <- this->_top 5d0515: c7 45 ec 00 00 00 00 movl $0x0,-0x14(%rbp) <- OrderAccess::storestore() 5d051c: 89 83 18 01 00 00 mov %eax,0x118(%rbx) <- this->_gc_time_stamp 5d0522: 48 89 53 18 mov %rdx,0x18(%rbx) <- this->_saved_mark_word 5d0526: 48 83 c4 18 add $0x18,%rsp 5d052a: 5b pop %rbx 5d052b: 5d pop %rbp 5d052c: c3 retq 5d052d: 90 nop 5d052e: 66 90 xchg %ax,%ax 00000000005d0530 <G1OffsetTableContigSpace::G1OffsetTableContigSpace(G1BlockOffsetSharedArray*, MemRegion)>: (...) With my suggested fix using compiler_barrier the output is: mgerdin@mgerdin03:/localhome/hotspots$ objdump -d -C compiler_barrier/jre/lib/amd64/server/libjvm.so | grep -A 30 -E 'G1OffsetTableContigSpace::record_top_and_timestamp.*:' 00000000005cc9b0 <G1OffsetTableContigSpace::record_top_and_timestamp()>: 5cc9b0: 55 push %rbp 5cc9b1: 48 89 e5 mov %rsp,%rbp 5cc9b4: 53 push %rbx 5cc9b5: 48 89 fb mov %rdi,%rbx 5cc9b8: 48 83 ec 18 sub $0x18,%rsp 5cc9bc: e8 5f 80 f8 ff callq 554a20 <G1CollectedHeap::heap()> 5cc9c1: 8b 80 6c 23 00 00 mov 0x236c(%rax),%eax <- g1->_gc_time_stamp 5cc9c7: 8b 93 18 01 00 00 mov 0x118(%rbx),%edx <- this->_gc_time_stamp 5cc9cd: 39 c2 cmp %eax,%edx 5cc9cf: 73 15 jae 5cc9e6 <G1OffsetTableContigSpace::record_top_and_timestamp()+0x36> 5cc9d1: 48 8b 53 58 mov 0x58(%rbx),%rdx <- this->_top 5cc9d5: 48 89 53 18 mov %rdx,0x18(%rbx) <- this->_saved_mark_word 5cc9d9: c7 45 ec 00 00 00 00 movl $0x0,-0x14(%rbp) -> OrderAccess::storestore() 5cc9e0: 89 83 18 01 00 00 mov %eax,0x118(%rbx) <- this->_gc_time_stamp 5cc9e6: 48 83 c4 18 add $0x18,%rsp 5cc9ea: 5b pop %rbx 5cc9eb: 5d pop %rbp 5cc9ec: c3 retq 5cc9ed: 90 nop 5cc9ee: 66 90 xchg %ax,%ax 00000000005cc9f0 <G1OffsetTableContigSpace::G1OffsetTableContigSpace(G1BlockOffsetSharedArray*, MemRegion)>: (...)
04-11-2014

Changing "set_saved_mark_word" to do *((HeapWord *volatile *) &_saved_mark_word) = p; seems to be enough to enforce the correct ordering with GCC. Verifying that the actual ordering was maintained is indeed difficult and time consuming, I just stumbled across this bug because of an "Impossible" condition which still persists even with this reordering issue seemingly fixed.
24-10-2014

Mikael: is it easy for you to verify if the bug is gone if _top and _saved_mark_word are declared volatile? That would ease some concern about the broader impact of this problem. As a general rule any variables accessed in lock-free algorithms should be volatile (either at the point of declaration, or at the point of update), but even so we require these barriers to order all accesses. It looks like the validation that was done for 6973570 was concerned with the elision of the dummy stores rather than checking actual ordering was maintained. Though unless this particular case was examined the problem may not have been apparent. For linux x86 this needs to be fixed in release() (which in turn fixes storestore) and also all the release_store variants. The resulting compiler barrier is stronger than required by the release semantics but that is okay.
23-10-2014

The G1 code attempts to do an ordered publishing of two values: _saved_mark_word = _top; OrderAccess::storestore(); _gc_time_stamp = curr_gc_time_stamp; The types involved are HeapWord* _top, _saved_mark_word; volatile unsigned _gc_time_stamp; GCC decides to order the store of _saved_mark_word to after the store of _gc_time_stamp. I've verified that this is the case with our current RE and JPRT compilers and has been the case since at least the GA build of JDK 7. The problem affects both 32 and 64 bit X86 on Linux, I haven't investigated all other platforms yet but MSVC and SolarisStudio seem to behave as expected. Editing the code to call OrderAccess::release_store does not resolve the issue. This problem seems to have been introduced by the fix of JDK-6973570, according to JBS, JDK-6973570 was integrated into 7 with backport id JDK-2198043 and was resolved in build b109. The corresponding C++ code snipplet in JDK7 is: ContiguousSpace::set_saved_mark(); // _saved_mark_word = _top OrderAccess::storestore(); _gc_time_stamp = curr_gc_time_stamp; Below, _top is at offset 0x58, _saved_mark_word at 0x18 and _gc_time_stamp at 0x138, %rbx is "this". /net/jre/onestop/jdk/1.7.0/promoted/all//b108/binaries/linux-x64/jre/lib/amd64/server/libjvm.so: 3d9f4d: 39 d0 cmp %edx,%eax 3d9f4f: 73 1c jae 3d9f6d <G1OffsetTableContigSpace::set_saved_mark()+0x3d> 3d9f51: 48 8b 43 58 mov 0x58(%rbx),%rax 3d9f55: 48 89 43 18 mov %rax,0x18(%rbx) 3d9f59: 48 8b 05 40 f9 70 00 mov 0x70f940(%rip),%rax # ae98a0 <_DYNAMIC+0x12f8> 3d9f60: 48 c7 00 00 00 00 00 movq $0x0,(%rax) 3d9f67: 89 93 38 01 00 00 mov %edx,0x138(%rbx) /net/jre/onestop/jdk/1.7.0/promoted/all//b109/binaries/linux-x64/jre/lib/amd64/server/libjvm.so 3da05d: 39 d0 cmp %edx,%eax 3da05f: 73 15 jae 3da076 <G1OffsetTableContigSpace::set_saved_mark()+0x36> 3da061: 48 8b 43 58 mov 0x58(%rbx),%rax 3da065: c7 45 f4 00 00 00 00 movl $0x0,-0xc(%rbp) 3da06c: 89 93 38 01 00 00 mov %edx,0x138(%rbx) 3da072: 48 89 43 18 mov %rax,0x18(%rbx) In b109 the store of %rax to 0x18(%rbx) has been ordered after the store of %edx to 0x138(%rbx) in the same build as the local volatile change was integrated.
23-10-2014