JDK-8320151 : some javax/sound JCK tests fail with Invalid descriptor with graal on linux-aarch64
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: repo-galahad
  • Priority: P2
  • Status: Closed
  • Resolution: Duplicate
  • CPU: aarch64
  • Submitted: 2023-11-15
  • Updated: 2024-01-30
  • Resolved: 2024-01-30
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
repo-galahadResolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Description
Currently this has been seen with:
api/javax_sound/midi/MidiChannel/Pressure.html
api/javax_sound/midi/MidiChannel/Solo.html
api/javax_sound/midi/MidiChannel/control.html
api/javax_sound/midi/MidiDevice/get.html
api/javax_sound/midi/MidiDevice/recvTransm.html
api/javax_sound/midi/MidiSystem/get.html
api/javax_sound/midi/Receiver/Receiver.html
api/javax_sound/midi/Sequencer/Sync.html
api/javax_sound/midi/Sequencer/Tempo.html
api/javax_sound/midi/Soundbank/Resource.html
api/javax_sound/midi/Soundbank/Soundbank.html
api/javax_sound/midi/SoundbankResource/getName.html
api/javax_sound/midi/Synthesizer/load.html
api/javax_sound/sampled/spi/MixerProvider/MixerProviderTests.html

Comments
Sounds good. I'm closing this as duplicate of JDK-8320892 then.
30-01-2024

Sorry I wasn't paying close enough attention to bug ids and didn't see that it was split off into a new issue. I think closing it as a duplicate seems like the best solution.
29-01-2024

Understood. What I meant is that the issue was originally filed to track the Graal failures. So, assuming that we don't want to put in a general fix at this point, I'd suggest to either close this as duplicate of JDK-8320892 or as "External".
29-01-2024

It's not actually specific to Graal. The corruption of the floating point control word is persistent and affects all compilers. We're just the only compiler to notice. Our integer intrinsics are no longer affected by this problem but if you ran the right FP tests after loading this library you could notice it with the interpreter or any compiler. There is a PR to add flags to fix and verify the FP control word on aarch64 in the same way it's done for XMM but it wouldn't change anything by default. It would just provide some flags to detect and fixup the corruption. So this issue is completely unrelated to Graal at this point. Presumably we want to go ahead with the PR [~aph] created.
26-01-2024

Since this issue is specific to failures with Graal, I'm moving it to repo-galahad.
26-01-2024

We have removed the use of floating point compare in our integer intrinsics so they execute correctly even if the control word is broken.
12-01-2024

JDK-8320892: AArch64: Restore FPU control state after JNI https://github.com/openjdk/jdk/pull/16851
29-11-2023

Wrapping ALSA calls is not such a terrible idea, but it's not straightforward. These are JNI libraries, and they're not part of HotSpot. At present, HotSpot doesn't know when it's calling ALSA. It might be that ALSA itself has `dlopen`ed some library compiled with -ffast-math. If so, this bug is probably not AArch64-specific, even though we haven't seen it trigger it elsewhere. I can't find anything in ALSA itself that fiddles with the AArch64 flags, so it's probably in some library ALSA is using. We haven't yet found the root cause.
29-11-2023

Sorry to jump in, but since we know ALSA calls might change the FPCR (and presumably haven't seen other offenders), wouldn't a pragmatic compromise be to always save and restore the FPCR (only) around those calls (i.e. in libjsound), so that we can leave RestoreFPCR/MXCSROnJNICalls off by default and still address this issue?
29-11-2023

I have no reason to believe this problem is any more likely on AArch64 than x86. I don't know the history of RestoreMXCSROnJNICalls because it dates from before OpenJDK, but I imagine it's set to false because of the overhead. Given that some native calls are to very short functions, and particularly given that we're adding a new high-performance foreign function API, any additional overhead is unwelcome. There's no more reason for RestoreFPCROnJNICalls to be eabled by default than RestoreMXCSROnJNICalls. Given that Oracle is running the JCK with assertions enabled for its own internal reasons, and this is an OS problem with a broken sound library, it's not clear to me why Oracle can't run the JCK with RestoreFPCROnJNICalls enabled until the broken library is fixed.
28-11-2023

Having dealt with these problems in the pre SSE days on x86, silent fixup with optional warning is kind of the best case solution. RestoreMXCSROnJNICalls is currently false so it's not actually doing any fixup or warning by default since I think everyone is pretty clean with regards to MXCSR. So maybe there should be RestoreFPCROnJNICalls which defaults to true on aarch64 and maybe there's a day when it could be set to false.
27-11-2023

I suppose we could silently fix things up and continue. In a release build that's surely the right thing to do. In a debug build, I'm not so sure. x86 has two -XX flags, RestoreMXCSROnJNICalls and CheckJNICalls. If CheckJNICalls is on, FP control register changes are warned about, and if RestoreMXCSROnJNICalls is on FP control register corruption gets fixed. So we could do likewise on AArch64: silently fix things and make the assertion depend on CheckJNICalls. That leaves every other caller to ALSA on this OS release at risk. That doesn't feel great to me, but I guess it fixes our problem.
27-11-2023

Thanks for the explanation about _thread_in_native, you're exactly right. Checking the thread state is clearly the wrong thing to do in this case: the correct thing is to see if the signal was from some code that we generated, i.e. if it was in a code buffer.
27-11-2023

Yes it's an imperfect world but it's not the job of HotSpot to police broken external libraries. Correcting it on return and asserting in debug builds is a fine solution as far as I'm concerned. But that will of course will cause all of the tests in this issue to fail and something will need to be done to address that.
27-11-2023

This is not an all-or-nothing situation. The call stub is really a different problem: when we're launched from native by the invocation interface we set things up how we need them to be. It's on us to respect the caller's FP environment and restore it the way it was at Java entry. If any library code, anywhere, is messing with the floating-point control register such that the system is no longer IEEE compliant, then every caller is at risk. That's a bug in the library. This isn't just a Java thing: if we were to accept that libraries just do this, then every library call in every program would have to check, just in case some random library messed things up. It's not really different from a library you called corrupting your stack. It's arguably worse, because the library is corrupting the user's running environment in a way that silently produces incorrect results. But we live in an imperfect world, so I propose to do this: check the FPCR at return, fix it up, and assert on debug builds.
27-11-2023

Just a thought: the bug might depend on what sound hardware is in use, or some other sound environment feature.
27-11-2023

If it causes incorrect results in our computations then it sure seems like our bug. Why wouldn't we reset the control word on every return from native? It's not exactly expensive and we're explicitly setting one in the call stub. If we can trust that it already has the right value then we shouldn't need JDK-8319973. It just seems like this is an all or nothing situation. Either we always set it to a sane value at various boundaries or we never set it. At a minimum fastdebug build should assert that it's properly set. That sure would have saved us a whole lot of time. I have no idea how to get that information. Maybe someone else does.
27-11-2023

Can you tell me what 'rpm -qi alsa-lib' says on a failing system?
27-11-2023

It looks like the bug is somewhere in here: int getMidiDeviceCount(snd_rawmidi_stream_t direction) { int deviceCount; TRACE0("> getMidiDeviceCount()\n"); initAlsaSupport(); deviceCount = iterateRawmidiDevices(direction, NULL, NULL); TRACE0("< getMidiDeviceCount()\n"); return deviceCount; } We certainly could check FPCR on every native call, but it's not really our bug.
27-11-2023

It's occurring in the context of the generated native wrapper and we're still _thread_in_native I think so it skips the magic in the signal handler.
27-11-2023

So I just tried stop() on AArch64, and I got # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (0xe0000000), pid=2485211, tid=2485212 # stop: corrupted control word detected # # JRE version: OpenJDK Runtime Environment (22.0) (slowdebug build 22-internal-adhoc.aph.theRealAph-jdk) ... just as it should be. Maybe there's something wrong with the installed signal handler.
27-11-2023

I applied the attached fpcr.patch to jdk master and ran these same failing test with mach5 against that build and got a bunch of failures. The failures all look like SIGILL since aarch64 doesn't have a real stop routine: # # A fatal error has been detected by the Java Runtime Environment: # # SIGILL (0x4) at pc=0x0000ffff8cc6f884, pid=666054, tid=666103 # # JRE version: Java(TM) SE Runtime Environment (22.0) (fastdebug build 22-internal-2023-11-27-0353158.tom.rodriguez.jdk-jdk) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-internal-2023-11-27-0353158.tom.rodriguez.jdk-jdk, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # j com.sun.media.sound.DirectAudioDeviceProvider.nGetNumDevices()I+0 java.desktop@22-internal # The actual assembly at the fault is. 0xffff9cc6f878: mrs x8, fpcr 0xffff9cc6f87c: and x8, x8, #0x1000000 0xffff9cc6f880: cbz x8, #0xffff9cc6f890 => 0xffff9cc6f884: dcps1 #0xdeae
27-11-2023

I just ran some tests based on b25 and it seems much less frequent but it still occurs. it had about a 75% failure rate on an OL9 machine with b23 and I've only gotten a single failure in about 50 runs on the b25 build. So it's better but I don't think it's fully correct. Adding checks of the control word in generated code is really the only way to know that's it's fully correct.
26-11-2023

If I understand it correctly, JDK-8319927 is specifically about the case of the library .init causing the control word to change which seems like a subset of the possible general but it might have been the source of our problem from the description. We were testing on b23 and we've just started b25 so I'll test if the problem is still reproducible. Shouldn't JDK-8319973 also fix the generated native wrappers? Those don't go through the call_stub.
25-11-2023

Is this issue resolved by JDK-8319973? Note we also have JDK-8319927 to try and show which library is causing the problem.
24-11-2023

After lots of debugging it appears that some code on those OL9 machines is setting the flush to zero flag in the floating point control register. This is breaking our usage of the FCMP instruction because it treats any bit pattern that looks like a denorm as 0. According to the aarch64 manual, setting this control bit violates IEEE 754 compatibility which I assume is a problem for HotSpot as a whole. I notice that arm 32 has code for AlwaysRestoreFPU that tries to correct the control word on return from native methods but aarch64 has no such logic. I would think the native wrapper should unconditionally correct the control word on return or at least verify it in debug mode. Adding some code to verify the FPCR on return from native methods should make these tests fail in master. Anyway, it doesn't seem like this is a Graal bug but I'm not sure where it should go.
23-11-2023

ILW = JCK test fails with IllegalArgumentException: Invalid descriptor, javax/sound tests with Graal, no workaround but disable compilation of affected method = HMM = P2
15-11-2023