Bug ID: JDK-8203466 intermittent crash at jdk.internal.misc.Unsafe::getObjectVolatile (native)

JDK 12
12 b12Fixed

[~thartmann] can you review since this is really compiler code? I almost gave it back to you :)
14-09-2018
What about r12? There aren't a lot of registers! BasicType type = is_oop ? T_OBJECT : T_LONG; BarrierSetAssembler *bs = BarrierSet::barrier_set()->barrier_set_assembler(); bs->arraycopy_prologue(_masm, decorators, type, from, to, qword_count); r12 is the heapbase register. Does this load_heap_oop()? No r12 doesn't look good. Saving to Thread as suggested above.
14-09-2018
Well. R10 is rscratch1, the single most clobbered register in the whole VM. So picking that as the register to safely stash away your callee saved register, hoping it wont get clobbered, is literally the worst idea since the tiny assistant thing in microsoft office. Sure, we could chase after each single implicit and accidental rscratch1 clobbering in this large and continuously changing block of code, including every use of a symbol in libjvm.so (like call to a symbol that implicitly clobbers r10 once in a blue moon, or lea symbol). And get an almost impossible to track down crash every time that happens. Or we could make the code less fragile by saving the callee saved register literally anywhere else than rscratch1.
13-09-2018
Really dumb question. Why can't we just push rdi and rsi to the stack and restore them? This code appears to be leaf code so shouldn't care about stack format. Well, this doesn't work, but I don't know why the Barrier code shouldn't save and restore r10 if it's going to scratch it.
12-09-2018
Reassigning back to runtime. The above analysis indicates incorrect register usage by StubGenerator::setup_arg_regs and restore_arg_regs. This can lead to arbitrary failures due to unexpected register changes. While we don't have a direct link to that as the root cause for the failure, it looks very plausible. The failed test involves concurrent hash tables (which can involve object array copies), and the jvm.dll was loaded "far away" from references generated by the compiler, which we think happens very rarely.
25-07-2018
It is worth noticing that seemingly libjvm.so has been mapped in "far away" from the code heap (Kim helped me verify this is really the case in the core file). As a result, all references to symbols in libjvm.so will clobber r10. It is AFAIK very rare for libjvm.so to be mapped in "far away", and is probably a significantly less tested scenario. This is even more fragile on windows because windows has slightly different calling convention that gets hurt even more by r10 clobbering. In particular, I noticed that a bunch of code, including (among plenty of other things) use ABI conversion fascilities to convert the windows calling convention to the System V calling convention. Notably, rdi and rsi are callee saved on windows and need to be preserved. Therefore, in StubGenerator::setup_arg_regs(), which wraps windows-to-SysV conversions, rsi is saved in r10. For rsi to be retained properly as expected by the windows ABI, it is assumed that nobody touches r10 until a subsequent call to restore_arg_regs(). But r10 is the most commonly clobbered register in hotspot. I see a bunch of workarounds involving counter updates that have been moved out of such windows-to-sysv regions for this very reason. However, I think there are more issues. An example pathology that I have high confidence is a real problem (among possibly others that I am less confident about) is that for oop arraycopy stubs, the G1 barriers call the VM for slow path code. In the unlikely scenario that libjvm.so is mapped in far away (which is the case in this crash), such slow path calls will clobber r10, which will consequently clobber rsi on windows, which was meant to be callee saved, which is now not respected. Similarly, the card table base value is loaded clobbering r10 if it is larger than a simm32 (which as it is shifted down I presume is also highly unlikely to happen). I really don't think we can rely on r10 not being clobbered reliably in this region. Perhaps a safer mechanism could be used to preserve windows callee saved registers. Conclusively, it is to my knowledge very rare that libjvm.so is mapped in far away, and its effect on windows calling convention in particular has at least one known bug in the code generation, which could lead to any native code preserving a caller saved register in rsi across a call assuming the calling convention on windows is honored, getting the preserved value clobbered with random pointers into libjvm.so in this highly unusual situation. Due to the rare nature of the far case being triggered, and now known already existing problems, it leads me to suspect we have an r10 problem at hand as Kim could verify such far cases are being triggered. For the record, I ran tier1-3 with -XX:+ForceUnreachable on windows, and it came back green, suggesting that even though there are deterministically ABI violations being made, that on its own is not enough to deterministically crash the VM, which perhaps explains the rare nature of this crash.
25-07-2018
The crash appears to be in the call to HeapAccess<...>::oop_load_at() in Unsafe_GetObjectVolatile, with p == 0x01, offset == 0x194. p is the result of the immediately preceeding JNIHandles::resolve(). The call to resolve was inlined, with the "argument" passed in rdi, which came from r8. r8 is the third argument (per calling convention on Windows), which is the obj (base) argument for Unsafe_GetObjectVolatile. So the code looks okay here. The value passed to resolve is 0xfff09348 (in rdi, and still in r8 too). At that memory location we do indeed have 0x01. But where did that value in r8 come from? I don't see anything in the caller to set r8. And I think it ought to be a stack pointer, as this should be a local "stack allocated" jobject, and that value doesn't look like a stack pointer. That suggests someone may have stomped on r8.
24-07-2018
I've not had any luck reproducing this either. The one compilation event is for the "tabAt" function that called Unsafe.getObjectAcquire, but the stack trace suggests the call into Unsafe hasn't been intrinsified. [~coleenp] suggested looking at JDK-8202377. The timing of that change with respect to this failure looked suspicious, since it was pushed 28-05-18, and this failure was reported 2018-05-20. However, I think we can rule it out for several reasons, not least of which is that it shouldn't be present in the version under test. The failure was reported as having occurred in a "same binaries" run for jdk-11+14, which is changeset 50155:3595bd343b65. JDK-8202377 is changeset 50180:ffa644980dff. (JDK-8202377 is also almost exclusively C2-related, while the only compilation event came from a C1 compiler thread. And I think we're not dealing with compiled code here anyway.)
01-07-2018
The only interesting part of the stack trace is the intrinsic call to: J 3 jdk.internal.misc.Unsafe.getObjectVolatile(Ljava/lang/Object;J)Ljava/lang/Object; java.base (0 bytes) @ 0x000000a162150212 [0x000000a1621501c0+0x0000000000000052] No GCs have occurred and the only compilation event is for Compilation events (1 events): Event: 0.102 Thread 0x000000a170429800 1 3 java.util.concurrent.ConcurrentHashMap::tabAt (22 bytes) There were a lot of changes to the intrinsic for getObjectVolatile in this linked bug https://bugs.openjdk.java.net/browse/JDK-8202377. It should probably be looked at by GC. [~eosterlund] I can't reproduce this and see no connection to any runtime code.
25-06-2018
The crash did not happen in a nmethod but at pc=0x00007ffe62edeb16 in jvm.dll after calling through the native wrapper @0x000000a162150212. Here's the code that is being executed: int3 int3 int3 rex push rbx sub rsp,0x20 -> mov eax,DWORD PTR [rcx+rdx*1] mov r9,rcx test eax,eax jne 0x2e xor ebx,ebx jmp 0x41 Looks like we are somewhere in native Unsafe_GetObjectVolatile trying to access offset rdx (0x0000000000000194) in Object rcx (0x0000000000000001) which fails because the oop is invalid (the code was heavily refactored by JDK-8189871, maybe something got broken). I don't think this is a JIT issue. Moving back to runtime.
23-05-2018
It looks like the crash happened from an nmethod.
21-05-2018