JDK-8227369 : pd_disjoint_words_atomic() needs to be atomic
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 14
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2019-07-07
  • Updated: 2022-03-07
  • Resolved: 2022-03-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19 b12Fixed
Related Reports
Relates :  
Relates :  
Description
This issue came up during the code review for the following fix:

     JDK-8227338 templateInterpreter.cpp: copy_table() needs to be safer
     https://bugs.openjdk.java.net/browse/JDK-8227338

On 7/6/19 6:06 PM, David Holmes wrote:
> Hi Dan,
>
>> http://cr.openjdk.java.net/~dcubed/8227338-webrev/0_for_jdk14/
>
> So the original code uses a loop to copy, while the new code calls
> Copy::disjoint_words_atomic, but the implementation of that on x64
> is just a loop same as the original AFAICS:
>
> static void pd_disjoint_words_atomic(const HeapWord* from, HeapWord* to, size_t count) {
> #ifdef AMD64
>   switch (count) {
>   case 8:  to[7] = from[7];
>   case 7:  to[6] = from[6];
>   case 6:  to[5] = from[5];
>   case 5:  to[4] = from[4];
>   case 4:  to[3] = from[3];
>   case 3:  to[2] = from[2];
>   case 2:  to[1] = from[1];
>   case 1:  to[0] = from[0];
>   case 0:  break;
>   default:
>     while (count-- > 0) {
>       *to++ = *from++;
>     }
>     break;
>   }
> #else
>
> David
> ----- 

Erik replied with:

On 7/7/19 4:48 AM, Erik Osterlund wrote:
> Yeah that switch statement code and yet another plain non-volatile load/store loop looks like complete nonsense unfortunately. It should at least use Atomic::load/store.
>
> Fortunately, on x86_64, I believe it will in practice yield word atomic copying anyway by chance. But it should be fixed anyway. *sigh*
>
> The real danger is SPARC though and its BIS instructions. I don't have the code in front of me, but I really hope not to see that switch statement and non-volatile loop in that pd_disjoint_words_atomic() function.
>
> And I agree that the atomic copying API should be used when we need atomic copying. And if it turns out the implementation of that API is not atomic, it should be fixed in that atomic copying API.
>
> So I think this change looks good. But it looks like we are not done yet. :c
>
> Thanks,
> /Erik

So we need to take a closer look at the various implementations
of pd_disjoint_words_atomic() to make sure they are intentionally
atomic and not accidentally atomic.
Comments
Changeset: 44d599aa Author: David Holmes <dholmes@openjdk.org> Date: 2022-03-01 12:07:21 +0000 URL: https://git.openjdk.java.net/jdk/commit/44d599aad3994816997a61d9e36265dcefa52965
01-03-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7567 Date: 2022-02-22 05:45:12 +0000
23-02-2022

Using this code as an example of where pd_disjoint_words_atomic eventually gets called void TemplateInterpreter::notice_safepoints() { if (!_notice_safepoints) { log_debug(...); // switch to safepoint dispatch table _notice_safepoints = true; copy_table((address*)&_safept_table, (address*)&_active_table, sizeof(_active_table) / sizeof(address)); } else { log_debug(...); } } Where: static inline void copy_table(address* from, address* to, int size) { // Copy non-overlapping tables. if (SafepointSynchronize::is_at_safepoint()) { // Nothing is using the table at a safepoint so skip atomic word copy. Copy::disjoint_words((HeapWord*)from, (HeapWord*)to, (size_t)size); } else { // Use atomic word copy when not at a safepoint for safety. Copy::disjoint_words_atomic((HeapWord*)from, (HeapWord*)to, (size_t)size); } } if we then look at the disassembly in gdb (where copy_table has been inlined) then we see: (gdb) disassemble 'TemplateInterpreter::notice_safepoints()' Dump of assembler code for function TemplateInterpreter::notice_safepoints(): 0x00007ffff783d970 <+0>: push %rbp 0x00007ffff783d971 <+1>: mov %rsp,%rbp 0x00007ffff783d974 <+4>: push %rbx 0x00007ffff783d975 <+5>: sub $0x8,%rsp 0x00007ffff783d979 <+9>: lea 0x4d9868(%rip),%rbx # 0x7ffff7d171e8 <_ZN19AbstractInterpreter18_notice_safepointsE> 0x00007ffff783d980 <+16>: mov 0x4f9431(%rip),%rax # 0x7ffff7d36db8 <_ZN16LogTagSetMappingILN6LogTag4typeE54ELS1_119ELS1_0ELS1_0ELS1_0ELS1_0EE7_tagsetE+56> 0x00007ffff783d987 <+23>: cmpb $0x0,(%rbx) 0x00007ffff783d98a <+26>: jne 0x7ffff783d9d0 <TemplateInterpreter::notice_safepoints()+96> 0x00007ffff783d98c <+28>: test %rax,%rax 0x00007ffff783d98f <+31>: je 0x7ffff783d99f <TemplateInterpreter::notice_safepoints()+47> 0x00007ffff783d991 <+33>: lea 0x1c74a0(%rip),%rdi # 0x7ffff7a04e38 0x00007ffff783d998 <+40>: xor %eax,%eax 0x00007ffff783d99a <+42>: callq 0x7ffff74e3a70 <LogImpl<(LogTag::type)54, (LogTag::type)119, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::write<(LogLevel::type)2>(char const*, ...)> 0x00007ffff783d99f <+47>: lea 0x50e976(%rip),%rax # 0x7ffff7d4c31c <_ZN20SafepointSynchronize6_stateE> 0x00007ffff783d9a6 <+54>: mov $0x5000,%edx 0x00007ffff783d9ab <+59>: movb $0x1,(%rbx) 0x00007ffff783d9ae <+62>: lea 0x517beb(%rip),%rsi # 0x7ffff7d555a0 <_ZN19TemplateInterpreter13_safept_tableE> 0x00007ffff783d9b5 <+69>: lea 0x521be4(%rip),%rdi # 0x7ffff7d5f5a0 <_ZN19TemplateInterpreter13_active_tableE> 0x00007ffff783d9bc <+76>: mov (%rax),%eax 0x00007ffff783d9be <+78>: callq 0x7ffff6d38d20 <memcpy@plt> 0x00007ffff783d9c3 <+83>: mov -0x8(%rbp),%rbx 0x00007ffff783d9c7 <+87>: leaveq 0x00007ffff783d9c8 <+88>: retq 0x00007ffff783d9c9 <+89>: nopl 0x0(%rax) 0x00007ffff783d9d0 <+96>: test %rax,%rax 0x00007ffff783d9d3 <+99>: je 0x7ffff783d9c3 <TemplateInterpreter::notice_safepoints()+83> 0x00007ffff783d9d5 <+101>: mov -0x8(%rbp),%rbx 0x00007ffff783d9d9 <+105>: lea 0x1c7480(%rip),%rdi # 0x7ffff7a04e60 0x00007ffff783d9e0 <+112>: xor %eax,%eax 0x00007ffff783d9e2 <+114>: leaveq 0x00007ffff783d9e3 <+115>: jmpq 0x7ffff74e3a70 <LogImpl<(LogTag::type)54, (LogTag::type)119, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::write<(LogLevel::type)2>(char const*, ...)> Here it turns out that the is_at_safepoint() check has been elided because the compiler deterrined that both the regular and "atomic" copy loops can be replaced by the call to memcpy as was suspected! With the new code we have: (gdb) disassemble 'TemplateInterpreter::notice_safepoints()' Dump of assembler code for function TemplateInterpreter::notice_safepoints(): 0x00007ffff6ea18f0 <+0>: push %rbp 0x00007ffff6ea18f1 <+1>: mov %rsp,%rbp 0x00007ffff6ea18f4 <+4>: push %rbx 0x00007ffff6ea18f5 <+5>: sub $0x8,%rsp 0x00007ffff6ea18f9 <+9>: lea 0x4d98e8(%rip),%rbx # 0x7ffff737b1e8 <_ZN19AbstractInterpreter18_notice_safepointsE> 0x00007ffff6ea1900 <+16>: mov 0x4f94b1(%rip),%rax # 0x7ffff739adb8 <_ZN16LogTagSetMappingILN6LogTag4typeE54ELS1_119ELS1_0ELS1_0ELS1_0ELS1_0EE7_tagsetE+56> 0x00007ffff6ea1907 <+23>: cmpb $0x0,(%rbx) 0x00007ffff6ea190a <+26>: jne 0x7ffff6ea1968 <TemplateInterpreter::notice_safepoints()+120> 0x00007ffff6ea190c <+28>: test %rax,%rax 0x00007ffff6ea190f <+31>: je 0x7ffff6ea191f <TemplateInterpreter::notice_safepoints()+47> 0x00007ffff6ea1911 <+33>: lea 0x1c7508(%rip),%rdi # 0x7ffff7068e20 0x00007ffff6ea1918 <+40>: xor %eax,%eax 0x00007ffff6ea191a <+42>: callq 0x7ffff6b47990 <LogImpl<(LogTag::type)54, (LogTag::type)119, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::write<(LogLevel::type)2>(char const*, ...)> 0x00007ffff6ea191f <+47>: lea 0x50e9f6(%rip),%rax # 0x7ffff73b031c <_ZN20SafepointSynchronize6_stateE> 0x00007ffff6ea1926 <+54>: lea 0x521c73(%rip),%rdi # 0x7ffff73c35a0 <_ZN19TemplateInterpreter13_active_tableE> 0x00007ffff6ea192d <+61>: movb $0x1,(%rbx) 0x00007ffff6ea1930 <+64>: lea 0x517c69(%rip),%rsi # 0x7ffff73b95a0 <_ZN19TemplateInterpreter13_safept_tableE> 0x00007ffff6ea1937 <+71>: lea 0x5000(%rdi),%rdx 0x00007ffff6ea193e <+78>: mov (%rax),%eax 0x00007ffff6ea1940 <+80>: cmp $0x2,%eax <= SAFEPOINT CHECK HERE 0x00007ffff6ea1943 <+83>: je 0x7ffff6ea1980 <TemplateInterpreter::notice_safepoints()+144> 0x00007ffff6ea1945 <+85>: mov %rsi,%rax 0x00007ffff6ea1948 <+88>: add $0x8,%rsi 0x00007ffff6ea194c <+92>: mov (%rax),%rcx 0x00007ffff6ea194f <+95>: mov %rdi,%rax 0x00007ffff6ea1952 <+98>: add $0x8,%rdi 0x00007ffff6ea1956 <+102>: mov %rcx,(%rax) 0x00007ffff6ea1959 <+105>: cmp %rdx,%rdi 0x00007ffff6ea195c <+108>: jne 0x7ffff6ea1945 <TemplateInterpreter::notice_safepoints()+85> 0x00007ffff6ea195e <+110>: mov -0x8(%rbp),%rbx 0x00007ffff6ea1962 <+114>: leaveq 0x00007ffff6ea1963 <+115>: retq 0x00007ffff6ea1964 <+116>: nopl 0x0(%rax) 0x00007ffff6ea1968 <+120>: test %rax,%rax 0x00007ffff6ea196b <+123>: je 0x7ffff6ea195e <TemplateInterpreter::notice_safepoints()+110> 0x00007ffff6ea196d <+125>: mov -0x8(%rbp),%rbx 0x00007ffff6ea1971 <+129>: lea 0x1c74d0(%rip),%rdi # 0x7ffff7068e48 0x00007ffff6ea1978 <+136>: xor %eax,%eax 0x00007ffff6ea197a <+138>: leaveq 0x00007ffff6ea197b <+139>: jmpq 0x7ffff6b47990 <LogImpl<(LogTag::type)54, (LogTag::type)119, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::write<(LogLevel::type)2>(char const*, ...)> 0x00007ffff6ea1980 <+144>: mov $0x5000,%edx 0x00007ffff6ea1985 <+149>: callq 0x7ffff639cd20 <memcpy@plt> 0x00007ffff6ea198a <+154>: mov -0x8(%rbp),%rbx 0x00007ffff6ea198e <+158>: leaveq 0x00007ffff6ea198f <+159>: retq and the safepoint check is restored, causing a jump to the memcpy version when at a safepoint and otherwise using the mov-add-loop.
23-02-2022

[~dholmes] I would just keep the current structure, but make sure that the loads and stores in the switch statement that we want to be atomic, use Atomic::load/store.
21-02-2022

Yes, exactly like that.
21-02-2022

You mean: switch (count) { case 8: Atomic::store(&to[7], Atomic::load(&from[7])); ... case 1: Atomic::store(&to[0], Atomic::load(&from[0])); case 0: break; default: while (count-- > 0) { Atomic::store(to, Atomic::load(from)); to++; from++; } break; } ?
21-02-2022

Let me restate so that it's not lost in the noise ... the issue here is not that the code we use is itself not atomic (copying aligned words is atomic), but that the compiler (in its infinite wisdom) can replace our atomic code with a non-atomic block-move style of instruction. How exactly we prevent that is unclear. It is also unclear when the compiler might choose to make such a replacement. The compiler believes it has every right to do this because C/C++ says nothing about atomicity of accesses, unless using specific atomic types/APIs. This is of course completely impractical in the VM as every basic Java field access has atomicity requirements. We rely on using aligned variables, appropriately generated machine instructions (basic loads and stores) and the architectural guarantees of the hardware. Perhaps the only solution is to replace our C/C++ loops with inline-assembly (assuming the compiler won't try to replace those).
19-07-2019

ILW = MLH = P4
11-07-2019

Kim made the following comments in the JDK-8228338 review: On 7/8/19 7:00 PM, Kim Barrett wrote: >> On Jul 7, 2019, at 8:08 PM, David Holmes <david.holmes@oracle.com> wrote: >> >> On 7/07/2019 6:48 pm, Erik Osterlund wrote: >>> The real danger is SPARC though and its BIS instructions. I don’t have the code in front of me, but I really hope not to see that switch statement and non-volatile loop in that pd_disjoint_words_atomic() function. >> >> sparc uses the same loop. >> >> Let's face it, almost no body expects the compiler to do these kinds of transformations. :( > > See JDK-8131330 and JDK-8142368, where we saw exactly this sort of transformation from a fill-loop > to memset (which may use BIS, and indeed empirically does in some cases). The loops in question > seem trivially convertible to memcpy/memmove. > > Also see JDK-8142349. > >>> And I agree that the atomic copying API should be used when we need atomic copying. And if it turns out the implementation of that API is not atomic, it should be fixed in that atomic copying API. >> >> I agree to some extent, but we assume atomic load/stores of words all over the place - and rightly so. The issue here is that we need to hide the loop inside an API that we can somehow prevent the C++ compiler from screwing up. It's hardly intuitive or obvious when this is needed e.g if I simply copy three adjacent words without a loop could the compiler convert that to a block move that is non-atomic? >> >>> So I think this change looks good. But it looks like we are not done yet. :c >> >> I agree that changing the current code to use the atomic copy API to convey intent is fine. > > I’ve been reserving Atomic::load/store for cases where the location “ought” to be declared std::atomic<T> if > we were using C++11 atomics (or alternatively some homebrew equivalent). Not all places where we do > stuff “atomically” is appropriate for that though (consider card tables, being arrays of bytes, where using an > atomic<T> type might impose alignment constraints that are undesirable here). I *think* just using volatile > here would likely be sufficient, e.g. we should have > > Copy::disjoint_words_atomic(const HeapWord* from,volatile HeapWord* to, size_t count) On 7/8/19 7:15 PM, Kim Barrett wrote: >> On Jul 8, 2019, at 7:00 PM, Kim Barrett <kim.barrett@oracle.com> wrote: >> Copy::disjoint_words_atomic(const HeapWord* from,volatile HeapWord* to, size_t count) > > Or maybe > > Copy::disjoint_words_atomic(const volatile HeapWord* from, volatile HeapWord* to, size_t count) I'm copying these comments here since they are more appropriate for this bug...
09-07-2019