JDK-8258825 : strange crashes with applications/jcstress on AMD EPYC
  • Type: Bug
  • Component: hotspot
  • Sub-Component: test
  • Affected Version: 16,17,19,20,21,22
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: x86_64
  • Submitted: 2020-12-22
  • Updated: 2023-09-28
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The following test failed in the JDK16 CI:

applications/jcstress/seqcst.java

Here's a snippet from the log file:

----- [OK] Unlocking diagnostic VM options
Burning up to figure out the exact CPU count....... done!

----- [OK] Trimming down the default VM heap size to 1/8-th of max RAM
----- [OK] Trimming down the number of compiler threads
----- [OK] Trimming down the number of parallel GC threads
----- [OK] Trimming down the number of concurrent GC threads
----- [OK] Trimming down the number of G1 concurrent refinement GC threads
----- [FAILED] Testing @Contended works on all results and infra objects
Java HotSpot(TM) 64-Bit Server VM warning: Option MaxRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: Option MinRAMFraction was deprecated in version 10.0 and will likely be removed in a future release.
Exception in thread "main" java.lang.IllegalStateException: /opt/mach5/mesos/work_dir/slaves/983c483a-6907-44e0-ad29-98c7183575e2-S154452/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d8d1d317-a8a1-472d-8f96-de05b7474bbc/runs/8509417c-55e7-42e2-ba5a-f64cc65beb01/testoutput/test-support/jtreg_open_test_hotspot_jtreg_jcstress_part1/classes/0/applications/jcstress/seqcst/d/applications/jcstress/JcstressRunner
	at org.openjdk.jcstress.util.Reflections.getClasses(Reflections.java:66)
	at org.openjdk.jcstress.vm.ContendedTestMain.main(ContendedTestMain.java:43)
Caused by: java.lang.ClassNotFoundException: /opt/mach5/mesos/work_dir/slaves/983c483a-6907-44e0-ad29-98c7183575e2-S154452/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d8d1d317-a8a1-472d-8f96-de05b7474bbc/runs/8509417c-55e7-42e2-ba5a-f64cc65beb01/testoutput/test-support/jtreg_open_test_hotspot_jtreg_jcstress_part1/classes/0/applications/jcstress/seqcst/d/applications/jcstress/JcstressRunner
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:466)
	at org.openjdk.jcstress.util.Reflections.getClasses(Reflections.java:64)
	... 1 more

This test ran for almost three hours, but it's not clear when
the above exceptions were thrown. This failure might be due
to test bug in applications/jcstress/seqcst.java or in the harness
so I'm starting this bug off in hotspot/test.
Comments
Recently we were able to get crashes on a bare metal host, so that makes it harder to blame virtualization. Just having a bad PC doesn’t always explain the crash. We get SEGV crashes on instructions that don’t access memory and we even get crashes on instructions like PAUSE. It’s possible that it crashed somewhere else and the PC given to the signal handler is wrong for some reason. The crashes on the host were using big heaps, so there were no GC events, which I believe means no nmethod code heap could have been reused. That makes it hard to explain why the crash information doesn’t match with what is in memory. I’m wondering if the processor could be seeing old data from an earlier invocation of the JVM. The test uses all processors and forks a lot of JVMs, all running the same or similar code. The test seems to be using onSpinWait, because the hot method at the top of the stack has a few loops with PAUSE instructions at the top. The PC reported for most of the crashes is within 20 bytes after a PAUSE instruction.
02-03-2023

Here are some general thoughts. I will be happy to correct this if informed of any misunderstandings. The bug comments suggest that an interrupt handler could be getting a bad PC from the virtualization software. (I assume that’s what KVM means.) Has that been ruled out yet? If not, it’s the likeliest cause. (…Said the non-KVM engineer.) Either it’s a bad address or a bad read from a good address, right? (Or two bad things on top of each other, maybe, but that’s rare…) If it’s a bad address, it could be coming from HotSpot. (Unlikely but possible; we have some technical debt here). Or else it is coming from some other system component. Probably not hardware, probably not OS, which is why I’m looking at the virtualizer. It reproduces only with KVM in the mix, right? Does there seem to be a pattern in the bad addresses being reported? If there were a virtualization bug, how might it produce such a pattern? If the addresses are consistently in the code-cache, that suggests that some part of the virtualizer is getting stressed out by our activities in the code cache. Which is likely, if the virtualizer contains a little JIT to lower emulated instructions. After all, how would you like to maintain the consistency of a JIT whose inputs are unpredictably varying instruction blocks? Long term, we are working with various HW designers to clarify or adjust their specs to allow reasonable and useful non-self-patching. (For a reasonable and useful definition of reasonable, that is useful to HotSpot!) I’m going to assume we are on that road, and that if we have found HW that is acting unreasonable, it is a bug even if it is within some existing spec. that could be adjusted. Let’s assume, for now, that HotSpot is doing reasonable patches. Under all those assumptions, if it is a bad read, then we have a hardware bug. It may be that the fix for this would be to have HotSpot work harder to avoid the hardware bug, which is not preferable but tolerable. If it is such a bug, what kind of bug might it be? Is it *often* the case that the failing instructions are ones which we are patching one-by-one? It does not look like that; they are just in the code cache. That’s why you are looking at code-installation as a root cause, since that *patches all the instructions* not just the special ones that we patch on the fly. I think it is super-unlikely that code-installation, all by itself, will cause a HW bug (“bug” in the special sense here). That’s because operating systems do code installation all day long when they map and unmap shared libraries. It is not unique to us. What’s special here is not that we load code into memory, but that we change it later on, and without the help of munmap (TLB shootdown, etc.). Going back to the hypothesis of a bad address, in the context of a signal handler, it could be that the virtualization software has a seldom-used, out-of-date cache that maps physical signal addresses to virtual (emulated) addresses. Suppose our HotSpot code raised a signal, populated that table, and then went on with business as usual. Suppose much time goes by but the table is not cleared. Suppose somebody looks at that table again for a different virtual code chunk, now using the same physical address. Suppose that an emulated signal is raised, and wrongly pointed at the stale address. That could cause this bug. It’s just a for-instance, of course.
02-03-2023

[~dlong] I think it makes sense to assign this issue to you since you are investigating it.
01-03-2023

The latest crash is interesting. See the attached hs_err_pid3615915.log. The values at the crash site seem consistent. We apparently crashed here: 0x7f8ef5208e88: mov 0x8(%rbp),%r8d because %RBP was 0x2. This code is in block B5. This is right after the OSR entry point, so the last call should have been to OSR_migration_end(). But the return PC at [RSP - 8] is a call site in block B34, and there doesn't seem to be a path back to B5 from B34.
05-10-2022

I still got a crash after changing Assembler::pause() to replace PAUSE with NOP. However, the crash was not as mysterious as earlier crashes. The PC that crashed points into the middle of an instruction, but unlike before, decoding at that address gives an instruction that would actually explain the crash. So we could have two bugs here: 1) Some bug that causes us to jump into the middle of an instruction 2) A bug that sometimes causes the wrong PC to be reported by the signal handler If using NOP instead of PAUSE gets rid of problem [2], then it might help us find the cause of problem [1]. Also, even the the JIT compilers are no longer using PAUSE, if that instruction can cause problem [2], then perhaps other PAUSE instructions in C++ code/libraries/kernel is causing problem [1].
01-10-2022

The PAUSE instruction is special on a virtual guest (I believe OCI uses KVM). It can cause the guest execution to suspend, returning control to the host. See the "Virtual CPU timeslice sharing" section here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/6.2_release_notes/virtualization On AMD this is called Pause Filter.
30-09-2022

One thing I keep noticing in a lot of these crashes in the spinloop PAUSE instruction. I'm going to try replacing it with NOP to see what happens. Lately I have been able to reproduce a crash about 2% of the time.
30-09-2022

The latest crash shows we got a SI_KERNEL trap: siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000 This is after JDK-8294003, so we may be getting more accurate trap info now that we don't try to handle SI_KERNEL SEGV traps in the signal handler.
27-09-2022

According to the core dump from the latest crash, we got a SEGV on this instruction: 0x7f53d90e9d93: lea 0x1(%r13),%r13 which is curious because it doesn't access memory. Then we tried to handle the SEGV as if it was an implicit null check, and not finding the expected exception handler, we crashed while trying to print the nmethod code. Apparently a relocation contained a bad metadata pointer.
26-07-2022

I can't find an hs_err file in the latest crash. The core dump says the current PC is here: 0x7f9700b0955f: mov %ebp,%edx but that seems to be in the middle of an instruction: 0x7f9700b09556: lock addl $0x0,-0x40(%rsp) 0x7f9700b0955c: jmp 0x7f9700b09561 0x7f9700b0955e: mov %rbp,%r10 0x7f9700b09561: mov 0x38(%r9),%r11 0x7f9700b09565: test %r11,0x28(%r15)
06-07-2022

According to hs_err_pid2424685.log, we crashed in the middle of a branch instruction.
07-06-2022

All of the seqcst.java failures I can find only happen on AMD machines. It seems strange that none of the failures are on Intel hardware.
03-06-2022

stdout/stderr files don't have anything that suggests the crash or VM failure, the only things that caught my eye are: 1. Failed to delete stdout log: Failed to delete stderr log: messages in org.openjdk.jcstress.tests.seqcst.sync.S1__S1_L2__S2_S1_Test.html file, yet the mentioned stdout and stderr files are nowhere to be found 2. `traps: jcstress-worker[9244] trap int3 ip:7fdb19b157fe sp:7fdb143c0540 error:0 in libjvm.so[7fdb182d4000+1f5c000]` in dmesg
13-01-2021

ClassNotFoundException aren't problems, it's just how jsctress check if @Contended works. the reason the test failed is below: org.openjdk.jcstress.tests.seqcst.sync.S1__S1_L2__S2_S1_Test [-Djava.io.tmpdir=/workdir/testoutput/test-support/jtreg_open_test_hotspot_jtreg_jcstress_part1/scratch/0, -Djava.io.tmpdir=/workdir/testoutput/test-support/jtreg_open_test_hotspot_jtreg_jcstress_part1/scratch/0, -XX:MaxRAMPercentage=6, -Djava.io.tmpdir=/workdir/testoutput/test-support/jtreg_open_test_hotspot_jtreg_jcstress_part1/tmp, -XX:+CreateCoredumpOnCrash, -XX:+UseParallelGC, -XX:+UseNUMA] had failed with the VM error.
13-01-2021