Bug ID: JDK-8004124 Handle and/or warn about SI

JDK-8004124 : Handle and/or warn about SI_KERNEL

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: hs25

Priority: P3
Status: Closed
Resolution: Fixed
OS: linux

Submitted: 2012-11-28
Updated: 2021-07-21
Resolved: 2013-06-21

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 8	Other
8Fixed	hs25Fixed

Related Reports

Duplicate :	JDK-8007019 - Crash: guarantee(cb->is_adapter_blob() \|\| cb->is_method_handles_adapter_blob()) failed: exception happened outside interpreter, nmethods and vtable stubs (1)
Duplicate :	JDK-8186278 - JVM crash with guarantee(cb->is_adapter_blob() \|\| cb->is_method_handles_adapter_blob()) failed: exception happened outside interpreter, nmethods and vtable stubs
Duplicate :	JDK-8186278 - JVM crash with guarantee(cb->is_adapter_blob() \|\| cb->is_method_handles_adapter_blob()) failed: exception happened outside interpreter, nmethods and vtable stubs
Relates :	JDK-8023825 - fatal error: An irrecoverable SI_KERNEL SIGSEGV has occurred due to unstable signal handling in this distribution
Relates :	JDK-8166250 - fails on running java or even java -version
Relates :	JDK-7078412 - SIGSEGV in compareAndSwapLong
Relates :	JDK-8015837 - Nashorn crashes with tiered on x86 when running v8 benchmark
Relates :	JDK-8271006 - SpringBoot application crash at libc.so.6 __libc_malloc

Description

Running javac on linux 32 bit (carrs) from specjvm98 with product VM gets an intermittent crash with si_code = 128 which is SI_KERNEL.    The code that the crash points to is perfectly valid.   I don't know why we get these signals and other products get them also, but we should either warn and/or ignore when we get these so we don't confuse this with a real bug.

From digging in the kernel, I think this is the source of the signal.  I don't know why things have gone wrong with the linux signal handler.

http://www.takatan.net/lxr/source/kernel/signal.c#L1250
1262 void
1263 force_sig(int sig, struct task_struct *p)
1264 {
1265         force_sig_info(sig, SEND_SIG_PRIV, p);
1266 }
1267 
1268 /*
1269  * When things go south during signal handling, we
1270  * will force a SIGSEGV. And if the signal that caused
1271  * the problem was already a SIGSEGV, we'll want to
1272  * make sure we don't even try to deliver the signal..
1273  */
1274 int
1275 force_sigsegv(int sig, struct task_struct *p)
1276 {
1277         if (sig == SIGSEGV) {
1278                 unsigned long flags;
1279                 spin_lock_irqsave(&p->sighand->siglock, flags);
1280                 p->sighand->action[sig - 1].sa.sa_handler = SIG_DFL;
1281                 spin_unlock_irqrestore(&p->sighand->siglock, flags);
1282         }
1283         force_sig(SIGSEGV, p);
1284         return 0;
1285 }

Comments

See JDK-8015837 for David Simms' analysis of the OS bug that causes the SI_KERNEL crash
29-08-2013
This may still be a JVM issue. I found this interesting discussion: http://unix.stackexchange.com/questions/71240/sigaction7-semantics-of-siginfo-ts-si-code-member in particular: "A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL"
24-07-2013
Yes, this version of Ubuntu displayed this error. It was not Xen or OVM in the case I saw but then I couldn't reproduce it to see the new error that I put in. It's nice to see that the error was caught correctly. SQE should upgrade this machine.
16-07-2013
The Xen issue has not been connected directly to this SI_KERNEL issue. As per Coleen's comments this was seen mostly on bare metal. Hence the error message refers to "unstable signal handling" in the Linux distribution, not to any bug in a hypervisor.
09-07-2013
Do we expect to find "unstable signal handling" in this distribution: OS:squeeze/sid uname:Linux 2.6.38-13-generic #53-Ubuntu SMP Mon Nov 28 19:23:39 UTC 2011 i686 libc:glibc 2.13 NPTL 2.13 ??
08-07-2013
According to comments above, the bug is not in the Linux distribution but on the virtualization layer below (Xen or OVM).
08-07-2013
Yes, absolutely. When I could reproduce this it was on some non-trapping instruction 100% of the time (once a nop). I think it's the only explanation for 8014049.
20-06-2013
It may still be relevant. As far as I know OVM is based on Xen. Have we ever seen a non-expainable SIGSEGV/SI_KERNEL on a non-trapping instruction on a bare-metal machine?
20-06-2013
You're right it says xen ... I thought it was a clue.
20-06-2013
So is that a linux bug or a Xen bug?
20-06-2013
I just found this in my mailbox from konrad.wilk@oracle.com I am on vacation today, but there was this one that went in the Linux tree recently: commit a349e23d1cf746f8bdc603dcc61fae9ee4a695f6 Author: David Vrabel <david.vrabel@citrix.com> Date: Fri Oct 19 17:29:07 2012 +0100 xen/x86: don't corrupt %eip when returning from a signal handler In 32 bit guests, if a userspace process has %eax == -ERESTARTSYS (-512) or -ERESTARTNOINTR (-513) when it is interrupted by an event /and/ the process has a pending signal then %eip (and %eax) are corrupted when returning to the main process after handling the signal. The application may then crash with SIGSEGV or a SIGILL or it may have subtly incorrect behaviour (depending on what instruction it returned to). The occurs because handle_signal() is incorrectly thinking that there is a system call that needs to restarted so it adjusts %eip and %eax to re-execute the system call instruction (even though user space had not done a system call). If %eax == -514 (-ERESTARTNOHAND (-514) or -ERESTART_RESTARTBLOCK (-516) then handle_signal() only corrupted %eax (by setting it to -EINTR). This may cause the application to crash or have incorrect behaviour.
20-06-2013
I can't seem to reproduce this anymore on the machine where it was easy to reproduce. The OS there hasn't been upgraded.
14-06-2013
The problem with the non-canonical address is that x64 will simply raise a general protection fault and there is no way to find the address. I suspect that the reason for not giving the address is that the CPU doesn't actually need to represent them physically since they are all either 1 or 0. The only way I can see us detecting this is disassembling the instruction at %rip and simulating it to determine if it would yield a non-canonical address trap or not. I have no explaination for trapping on a nop though. There may be other causes of si_code=128 than non-canonical addresses. Perhaps other types of general protection faults raised by the CPU. I've read somewhere that a MCE (machine check exception) can cause this as well, but that would indicate a physical hardware problem. Are you able to look in the kernel right after the SI_KERNEL? There may be some hints there if there was a machine check error.
14-06-2013
When I was digging through the jvm98 javac failures, the instruction we were trapping on was a nop. Do you think there is something in the signal handler that can see that we tried to read/write a non-canonical address? We really need a warning though. I think I'm chasing another one, which I can't reproduce.
14-06-2013
I believe that I've discovered what causes si_code=128. While digging through JDK-8005684 i noticed that the instruction we were trapping on was trying to write to an interesting address: 0x9090c3c95b08c483 It turns out that this address is not on what is called "canonical form" in the Intel Software Developer's Manual. That is the most significant bits (which are currently unused) must either all be 1 or 0. In current incarnations of x86_64 the bits 47-64 are unused and must all be set to the same value. Trying to read at such an address will trigger a #GP and the kernel will throw a SIGSEGV without knowing the faulting address. On Linux this will cause a SIGSEGV with si_code=128 (SI_KERNEL) and si_addr=0 On Solaris this will cause a SIGSEGV with si_code=1 (SEGV_MAPERR) and si_addr=[%rip of faulting instruction] On none of these systems we will actually find out the real faulting address and I believe the only thing we can do is to let the VM crash. However there seems to be problems with recovering even enough to run the VMError.report_and_die function since many of these crashes look like they're recursive.
24-01-2013