Bug ID: JDK-8009302 Mac OS X: JVM crash on infinite recursion on Appkit Thread

JDK 7	JDK 8	Other
7u40Fixed	8Fixed	hs24Fixed

Attaching XNU (Mac) specific tests checking for specific XNU behavior wrt signals on main vs non-main threads for recursion. Christian Tornqvist provided JTREG wrapper test.
05-06-2013
JCK vm via UTE passes. There are 6440 tests that pass and 1162 that fail on both JCK7 original and the one with my fix, so the results are identical.
03-06-2013
JCK language via UTE passes. 1st time I ran it, it returned one error, but I was using the computer for other tasks while the tests were running and when I reran it twice next without using the computer for anything else it passed all the tests without anymore problems.
03-06-2013
Filed a bug with Apple "Inconsistent and problematic catch_mach_exception_raise in ux_exception.c" (id: 14047928) to address the XNU implementation here: Summary: Inconsistent and problematic catch_mach_exception_raise in ux_exception.c #1 The code that maps SIGBUS into SIGSEGV only works for main thread. Secondary threads get sent the original SIGBUS signal resulting in inconsistent behavior depending on which thread the code is running - quite confusing (ie. proc_findthread seems to return stack info for the process, not thread?) #2 The mapping of SIGBUS into SIGSEGV might be well intentioned, but it ends up "hijacking" user code. Consider user code that sets up its own protected guards inside the stack address space to catch recursions on its own. If protected memory access happens inside the stack address space (outside the guard pages) the mapping from SIGBUS into SIGSEGV still happens! #3 The mapping of SIGBUS into SIGSEGV uses a hardcoded MAXSSIZ value for stack size - this will not work for arbitrary threads who are free to use their own stack size. #4 If the faulting address is caught inside the stack address the code should still attempt to deliver the signal, and not insist on alternate stack. Only if the faulting address is outside the stack address, but inside the guard pages, should alternate stack be required. #5 The code mapping SIGBUS into SUGSIGV refers to "stack space" as "stack address space" + "guard pages", which is confusing on cursory examination. #6 The code uses "sp" variable to mean "Faulting Address". "sp" (Stack Pointer) is an inept and confusing choice for the variable - it really should be names "faulting_address" or something similar. Steps to Reproduce: Simple recursion code that runs on main and non-main thread with a BSD catcher defined (with alternate stack) Observe that on main thread the signal delivered will be SIGSEGV, but on non-main threads it will be "SIGBUS" Additionally, accessing protected pages clear inside the stack address space, will still be delivered as SIGSEGV. Expected Results: Same behavior regardless of which thread recursion occurs.
03-06-2013
The fix passes vm.quick.testlist, vm.signal.testlist, nsk.stack.testlist and JPRT. Trying to figure out how to run JCK on Mac.
28-05-2013
The fix passes vm.quick.testlist and nsk.stack.testlist (using jdk8 code base) Working on figuring out how to submit JPRT job as a remote employe, then JCK.
21-05-2013
Here is an "hg diff" of the workaround: > hg diff src/os/bsd/vm/os_bsd.cpp diff -r 5d395eb2626f src/os/bsd/vm/os_bsd.cpp --- a/src/os/bsd/vm/os_bsd.cpp Thu Feb 28 10:42:09 2013 -0800 +++ b/src/os/bsd/vm/os_bsd.cpp Fri May 17 08:27:34 2013 -0500 @@ -3036,6 +3036,20 @@ sigAct.sa_sigaction = signalHandler; sigAct.sa_flags = SA_SIGINFO\|SA_RESTART; } + +#if __APPLE__ + // Needed for main thread as XNU (Mac OS X kernel) will only deliver SIGSEGV + // (which starts as SIGBUS) on main thread with faulting address inside "stack+guard pages" + // if the signal handler declares it will handle it on alternate stack + // Notice we only declare we will handle it on alt stack, but we are not + // actually going to use real alt stack - this is just a workaround + // Please see ux_exception.c, method catch_mach_exception_raise for details + // link http://www.opensource.apple.com/source/xnu/xnu-2050.18.24/bsd/uxkern/ux_exception.c + if (sig == SIGSEGV) { + sigAct.sa_flags \|= SA_ONSTACK; + } +#endif + // Save flags, which are set by ours assert(sig > 0 && sig < MAXSIGNUM, "vm signal out of expected range"); sigflags[sig] = sigAct.sa_flags;
17-05-2013
I agree that the XNU code here is quite poor, starting with comments (ex, sp when it means fault_address) and the entire stack space is stack + guard pages and ending with hijacking user signal and converting it into SIGSEGV from SIGBUS even when it's inside the stack address (outside guard pages) but in user established protected area. XNU code assumes much and gets it quite wrong. I will most likely file a bug against this XNU code with Apple.
17-05-2013
To detect stackoverflow the Apple code should only be checking for a fault in or beyond the guard pages. By checking within the stack it negates any library/application additional guard mechanism that might be put in place - as with the VM. Further because it doesn't detect beyond the stack it will only detect true overflows that still fit within the guard region. So by any measure this code is broken: it is neither necessary in its entirety, nor sufficient, to deal with stackoverflow. Great find on the workaround! Yes based on this code: /* * If the thread/process is not ready to handle * SIGSEGV on an alternate stack, force-deliver * SIGSEGV with a SIG_DFL handler. */ mask = sigmask(ux_signal); ps = p->p_sigacts; if ((p->p_sigignore & mask) \|\| (ut->uu_sigwait & mask) \|\| (ut->uu_sigmask & mask) \|\| (ps->ps_sigact[SIGSEGV] == SIG_IGN) \|\| (! (ps->ps_sigonstack & mask))) { p->p_sigignore &= ~mask; p->p_sigcatch &= ~mask; ps->ps_sigact[SIGSEGV] = SIG_DFL; ut->uu_sigwait &= ~mask; ut->uu_sigmask &= ~mask; } it is sufficient that the signal action was marked as SA_ONSTACK to bypass the problem.
16-05-2013
It still seems to me that if checking for an overflow it should be checking if the fault lies outside the current stack not inside it. The main vs non-main thread is a different bug. Either way we don't have a choice but to use an alt_stack mechanism - something I still have reservations about.
16-05-2013
Again, the discussed XNU code defines stack as including the guard pages, so when it checks if the fault address is inside the stack it is in fact checking whether it is inside the "stack" OR the "stack guard pages" - ideally it should be just checking whether it is inside the guard pages, but that's how Apple chosen to code it and even though it might be confusing that is not a problem here. To give direct example from JavaVM test case in this issue: we setup our yellow guard pages inside the stack address (ie. above the glibc guard pages), so when SIGBUS happens the faulting address space is inside the stack address space (ie. outside glib guard pages), BUT inside our own guard pages (ie. the yellow pages), so the XNU code referenced in this issue triggers and passes the "inside stack address". In case of some native app without its own custom guard pages, the SIGBUS will happen outside the stack address space, but inside glibc guard pages, and since XNU counts those guard pages as being inside stack address space again it passes the "inside stack address" test. The only problems here are : 1) that this particular SIGSEGV will be delivered only if alt stack is used and 2) that if Apple ever selects to fix the "main vs non-main" issue all JavaVM threads will start receiving different signals (ie. SIGSEGV and not SIGBUS) in cases of "stack overflow" - or to be more specific "an attempted access inside protected pages". Also, from my testing so far a viable workaround here might be to register SIGSEGV signal to be delivered on alt stack without actually creating one, which seems enough to trick XNU into delivering that elusive SIGSEGV.
16-05-2013
The kernel asks for stack size and address that includes the guard pages. It then checks whether the address is inside that range. In that respect it is correct in my opinion - we are inside that code when we get SIGBUS, so we basically check here whether we're inside guard pages. The bug here is that it only works for main thread.
15-05-2013
This looks like a bug in the kernel exception code: /* * Stack overflow should result in a SIGSEGV signal * on the alternate stack. * but we have one or more guard pages after the * stack top, so we would get a KERN_PROTECTION_FAILURE * exception instead of KERN_INVALID_ADDRESS, resulting in * a SIGBUS signal. * Detect that situation and select the correct signal. / if (code[0] == KERN_PROTECTION_FAILURE && ux_signal == SIGBUS) { user_addr_t sp, stack_min, stack_max; int mask; struct sigacts ps; sp = code[1]; stack_max = p->user_stack; stack_min = p->user_stack - MAXSSIZ; if (sp >= stack_min && sp < stack_max) { /* * This is indeed a stack overflow. Deliver a * SIGSEGV signal. */ ux_signal = SIGSEGV; It states it is looking for stackoverflow but then processes this path when the sp is _inside_ the stack not outside it!
15-05-2013
Here is what is happening: XNU (Mac OS X mach based kernel) uses mach signal handler to catch low-level machine signals, which it then converts into BSD signals (as long as the client does not install its own mach signal handler, which is the case with JavaVM). XNU receives EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE mach signal, which it normally maps into SIGBUS. However, http://www.opensource.apple.com/source/xnu/xnu-2050.18.24/bsd/uxkern/ux_exception.c in method catch_mach_exception_raise the XNU kernel looks at the faulting address and if it lies inside the stack (ie. stack + guard page) then it converts it into SIGSEGV and will forward it only if the signal handler uses alt stack. Without alt stack it overrides the BSD signal handler to SIG_DFL, which results in native crash caught by the Mac OS X native System Reported as seen in this case. The reason it only happens for the main (primodial, AppKit) thread is that XNU kernel looks up the task's stack address and size, which only works for the main thread, and fails for all the other threads, even though the XNU kernel intent here is to (probably) map all SIGBUS that happen inside the thread stack into SIGSEGV. So we're in effect getting lucky that XNU kernel logic only works for the main thread, which is not used by JavaVM. If Apple ever fixes this issue, then we will start receiving SIGSEGV, not SIGBUS for all the threads and we will need alt stacks for all thread handlers. The fix here is to use alt stack for BSD signal handler, which is basically what I already suggested back on 2013-04-09. I will clean that code up and provide a real webrev shortly.
14-05-2013
Here is the memory layout from a sample run: // Low memory addresses // // 0x7fff5bc00000 (glibc guard page) // // 0x7fff5f400000 (thread->_stack_base - thread->_stack_size) // 0x7fff5f401000 (thread->stack_red_zone_base()) // // <------- 0x7fff5f402fe0 FAULTING ADRRESS CRASH (0x20 bytes inside yellow zone) // // 0x7fff5f403000 (thread->stack_yellow_zone_base()) // // <------- 0x7fff5f418fe0 $RSP // // 0x7fff5fc00000 (thread->_stack_base) // // High memory addresses Supporting info: JAVA output from the test case (ie. TestInfiniteRecursion6): >>>>>>>>>>>>>>> Java_TestInfiniteRecursion6_start on thread 0x103301000 [0] >>>>>>>>>>>>>>> attaching Appkit thread 0x7fff70e13180 [1] >>>>>>>>>>>>>>> stack info: bottom=0x7fff5f400000, size=0x800000 [8388608], top=0x7fff5fc00000 for thread 0x7fff70e13180 >>>>>>>>>>>>>>> about to enter Java infinite recursion loop on 0x7fff70e13180 [1] GDB: Program received signal SIGSEGV, Segmentation fault. (gdb) print thread->_stack_base $1 = (address) 0x7fff5fc00000 "????\a" (gdb) print thread->_stack_size $2 = 8388608 (gdb) print thread->stack_yellow_zone_base() $3 = (address) 0x7fff5f403000 "" (gdb) print thread->stack_red_zone_base() $4 = (address) 0x7fff5f401000 <Address 0x7fff5f401000 out of bounds> (gdb) print thread->in_stack_yellow_zone(0x7fff5f402fe0) $5 = true (gdb) info mach-region 0x7fff5f402fe0 Region from 0x7fff5f400000 to 0x7fff5f403000 (---, max rwx; copy, private, not-reserved) (gdb) info mach-regions ... from 0x7fff5bc00000 to 0x7fff5f403000 (---, max rwx; copy, private, not-reserved) (2 sub-regions) ... from 0x7fff5f403000 to 0x7fff5fc00000 (rw-, max rwx; copy, private, not-reserved) (3 sub-regions) KERNEL (XNU modified file - ux_exception, method catch_mach_exception_raise, link http://www.opensource.apple.com/source/xnu/xnu-2050.18.24/bsd/uxkern/ux_exception.c): May 13 18:07:43 gerards-MacBook-Pro kernel[0]: catch_mach_exception_raise May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . ux_exception returns ux_signal=10 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . code[0] == KERN_PROTECTION_FAILURE && ux_signal == SIGBUS May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . sp: 0x7fff5f402fe0 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . stack_max: 0x7fff5fc00000 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . stack_min: 0x7fff5bc00000 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . sp >= stack_min && sp < stack_max May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . (p->p_sigignore & mask): 0 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . (ut->uu_sigwait & mask): 0 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . (ut->uu_sigmask & mask): 0 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . (ps->ps_sigact[SIGSEGV] == SIG_IGN): 0 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . (! (ps->ps_sigonstack & mask)): 1 May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . ((p->p_sigignore & mask) \|\| (ut->uu_sigwait & mask) \|\| (ut->uu_sigmask & mask) \|\| (ps->ps_sigact[SIGSEGV] == SIG_IGN) \|\| (! (ps->ps_sigonstack & mask))) May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . threadsignal May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . threadsignal: signal_setast May 13 18:07:43 gerards-MacBook-Pro kernel[0]: . result: 0 The "sp" here is the Faulting Address (ie. 0x7fff5f402fe0)
14-05-2013
So I am still not seeing anything that indicates we have looked at the stack layout and the guard page locations and the faulting address to actually see what is going wrong here.
09-04-2013
After learning about signals, stacks and threads I found one possible solution that involves setting up an alternate stack (sigaltstack) and asking signals to use it (SA_ONSTACK). This would require us to modify 2 files: - os_bsd.cpp, method os::Bsd::set_signal_handler needs: #if __APPLE__ // needed by AppKit thread if (sig == SIGSEGV) { sigAct.sa_flags \|= SA_ONSTACK; } #endif - jni.cpp , method attach_current_thread (though it should probably live somewhere in thread.cpp instead) needs: #if __APPLE__ // create alternate stack for SIGSEG stack_t sigstack = {0}; sigstack.ss_flags = 0; sigstack.ss_size = SIGSTKSZ; sigstack.ss_sp = valloc(sigstack.ss_size); // page aligned memory if (sigstack.ss_sp != NULL) { if (sigaltstack(&sigstack, NULL) == -1) { fprintf(stderr, "sigaltstack err\n"); } } else { fprintf(stderr, "valloc err\n"); } #endif This would require us to increase slightly memory per app as the alternate stack requires memory (SIGSTKSZ, ie. 128K), though that value could be lowered down possibly. Another advantage here is that if recursion happens even somewhere in native code, it will also be caught and proper Java dump stack produced, as opposed to Mac OS X CrashReporter popping up. The initial feedback from the Runtime team, however, is that we used to use alternate stack, but we took it out because it was not robust - needs discussion. Another solution is to use XNU signals (just like http://www.gnu.org/software/libsigsegv/) to catch the low level signal and make sure our handler gets it (what about stack corruption if the thread that gets it was the one that had its stack corrupted? seems we must have a clean stack we know we can use somehow). This issue has a solution, but it needs more discussion.
09-04-2013
Attaching updated test cases, prototype code.
09-04-2013
Another point of info. Entering an infinite recursive loop in native code (after attaching the thread to the VM) has different results. On secondary user thread we get: >>>>>>>>>>>>>>> attaching native thread 0x19c89e000 [0] JavaThread::remove_stack_guard_pages for 0x10a50c000 [0] os::unguard_memory for 0x10a50c000 [0] os::bsd_mprotect for 0x10a50c000 [0] JavaThread::create_stack_guard_pages for 0x19c89e000 [0] os::guard_memory for 0x19c89e000 [0] os::bsd_mprotect for 0x19c89e000 [0] JavaThread::create_stack_guard_pages for 0x10a50c000 [0] os::guard_memory for 0x10a50c000 [0] os::bsd_mprotect for 0x10a50c000 [0] os::set_native_priority for 0x10a50c000 [0] os::set_native_priority for 0x19c89e000 [0] Illegal instruction: 4 On Appkit thread we get: >>>>>>>>>>>>>>> attaching Appkit thread 0x7c64e180 [1] JavaThread::create_stack_guard_pages for 0x1072fa000 [0] JavaThread::create_stack_guard_pages for 0x7fff7c64e180 [1] os::guard_memory for 0x1072fa000 [0] os::guard_memory for 0x7fff7c64e180 [1] os::bsd_mprotect for 0x1072fa000 [0] os::bsd_mprotect for 0x7fff7c64e180 [1] os::set_native_priority for 0x1072fa000 [0] os::set_native_priority for 0x7fff7c64e180 [1] Segmentation fault: 11 Don't know what that means yet.
13-03-2013
SEGV is a process directed signal when sent async like that, so the first thread that has it unblocked will process it.
13-03-2013
Instead of infinite recursion I tried signaling the process with SIGSEGV to see whether the signal will be caught and it is caught in both cases: The "working" case (SIGSEGV called on user thread): >>>>>>>>>>>>>>> attaching native thread 0x19b1c3000 [0] os::unguard_memory for 0x108e35000 [0] os::bsd_mprotect for 0x108e35000 [0] JavaThread::create_stack_guard_pages for 0x19b1c3000 [0] os::guard_memory for 0x19b1c3000 [0] os::bsd_mprotect for 0x19b1c3000 [0] JavaThread::create_stack_guard_pages for 0x108e35000 [0] os::guard_memory for 0x108e35000 [0] os::bsd_mprotect for 0x108e35000 [0] os::set_native_priority for 0x108e35000 [0] os::set_native_priority for 0x19b1c3000 [0] >>>>>>>>>>>>>>> kill(0, SIGSEGV) JavaThread::remove_stack_guard_pages for 0x19b1c3000 [0] os::unguard_memory for 0x19b1c3000 [0] os::signalHandler for 0x7fff7c64e180 [1] os::bsd_mprotect for 0x19b1c3000 [0] sig: 11 JVM_handle_bsd_signal for 0x7fff7c64e180 [1] os::get_preinstalled_handler for 0x7fff7c64e180 [1] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fff90d55686, pid=9550, tid=1799 # # JRE version: Java(TM) SE Runtime Environment (8.0) (build 1.8.0-internal-gerard_2013_03_08_15_51-b00) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b20 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libsystem_kernel.dylib+0x10686] mach_msg_trap+0xa The "failing" case (SIGSEGV called on Appkit thread): >>>>>>>>>>>>>>> attaching Appkit thread 0x7c64e180 [1] os::unguard_memory for 0x107763000 [0] os::bsd_mprotect for 0x107763000 [0] JavaThread::create_stack_guard_pages for 0x7fff7c64e180 [1] os::guard_memory for 0x7fff7c64e180 [1] os::bsd_mprotect for 0x7fff7c64e180 [1] JavaThread::create_stack_guard_pages for 0x107763000 [0] os::guard_memory for 0x107763000 [0] os::bsd_mprotect for 0x107763000 [0] os::set_native_priority for 0x107763000 [0] os::set_native_priority for 0x7fff7c64e180 [1] >>>>>>>>>>>>>>> kill(0, SIGSEGV) os::signalHandler for 0x7fff7c64e180 [1] sig: 11 JVM_handle_bsd_signal for 0x7fff7c64e180 [1] os::get_preinstalled_handler for 0x7fff7c64e180 [1] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fff90d56d46, pid=9555, tid=1799 # # JRE version: Java(TM) SE Runtime Environment (8.0) (build 1.8.0-internal-gerard_2013_03_08_15_51-b00) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b20 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libsystem_kernel.dylib+0x11d46] __kill+0xa What I don't understand is why in both cases the signal is actually caught on the Appkit thread, not the thread it was signaled ([1] in the spewage means Appkit thread) Also, the "Problematic frame" is different in these cases. Still, it does look like the signal is being caught in both cases, so the signal handler seems to be installed and being called.
13-03-2013
Of course what we really need to see is where the faulting address is, and which of our guard regions, if any, it falls into - and whether they are actually guarded as expected. We recently had a problem on another platform where we crashed instead of getting a stackoverflow. This was caused by a call into verification code during the the signal handling. The problem was that the compiler for that platform generated a call frame that was larger than two pages - hence we skipped over the shadow region and crashed. This is just to demonstrate that there can be a lot of subtle interactions in having the guard pages and shadow pages work exactly as expected.
13-03-2013
Notice that in the good case what you have is: >>>>>>>>>>>>>>> attached native thread 0x194de6000 signalHandler for 0x194de6000 This shows our signal handler being called to handle what turns out to be the stackoverflow. The additional protect/guard/unguard is all part of signal handling. In the bad case this is missing - our signal handler is not called. Perhaps we need to check whether "install_signal_handlers" is actually doing the same thing in both cases.
13-03-2013
Attaching the updated test cases for this issue.
12-03-2013
I added printouts to various os_bsd methods to see what subset of those is called for the threads. In case of a native thread (test5) we get: > java8 TestInfiniteRecursion5 install_signal_handlers for 0x102adb000 set_signal_handler for 0x102adb000 set_signal_handler for 0x102adb000 set_signal_handler for 0x102adb000 set_signal_handler for 0x102adb000 set_signal_handler for 0x102adb000 set_signal_handler for 0x102adb000 guard_memory for 0x102adb000 bsd_mprotect for 0x102adb000 protect_memory for 0x102adb000 bsd_mprotect for 0x102adb000 set_native_priority for 0x1944f8000 set_native_priority for 0x102adb000 set_native_priority for 0x102adb000 guard_memory for 0x1945fb000 bsd_mprotect for 0x1945fb000 set_native_priority for 0x102adb000 guard_memory for 0x1946fe000 bsd_mprotect for 0x1946fe000 guard_memory for 0x194943000 bsd_mprotect for 0x194943000 set_native_priority for 0x194943000 set_native_priority for 0x102adb000 guard_memory for 0x194a46000 bsd_mprotect for 0x194a46000 set_native_priority for 0x102adb000 guard_memory for 0x194b49000 bsd_mprotect for 0x194b49000 guard_memory for 0x194c4c000 bsd_mprotect for 0x194c4c000 set_native_priority for 0x102adb000 unguard_memory for 0x102adb000 guard_memory for 0x194de6000 bsd_mprotect for 0x102adb000 bsd_mprotect for 0x194de6000 guard_memory for 0x102adb000 bsd_mprotect for 0x102adb000 set_native_priority for 0x102adb000 set_native_priority for 0x194de6000 >>>>>>>>>>>>>>> attached native thread 0x194de6000 signalHandler for 0x194de6000 unguard_memory for 0x194de6000 bsd_mprotect for 0x194de6000 guard_memory for 0x194de6000 bsd_mprotect for 0x194de6000 Exception in thread "Thread-0" java.lang.StackOverflowError at TestInfiniteRecursion5.recursion(TestInfiniteRecursion5.java:10) ... at TestInfiniteRecursion5.recursion(TestInfiniteRecursion5.java:10) unguard_memory for 0x194de6000 bsd_mprotect for 0x194de6000 unguard_memory for 0x102adb000 bsd_mprotect for 0x102adb000 guard_memory for 0x1944f8000 bsd_mprotect for 0x1944f8000 Summary: notice that for the user thread (0x194de6000) which we use to run the recursion we setup signal handler and call mprotect.
12-03-2013
Here is the corresponding output for test6, which is using the Appkit thread to run the recursion test: > java8 TestInfiniteRecursion6 install_signal_handlers for 0x1010ea000 set_signal_handler for 0x1010ea000 set_signal_handler for 0x1010ea000 set_signal_handler for 0x1010ea000 set_signal_handler for 0x1010ea000 set_signal_handler for 0x1010ea000 set_signal_handler for 0x1010ea000 guard_memory for 0x1010ea000 bsd_mprotect for 0x1010ea000 protect_memory for 0x1010ea000 bsd_mprotect for 0x1010ea000 set_native_priority for 0x192b08000 set_native_priority for 0x1010ea000 set_native_priority for 0x1010ea000 guard_memory for 0x192c0b000 bsd_mprotect for 0x192c0b000 set_native_priority for 0x1010ea000 guard_memory for 0x192d0e000 bsd_mprotect for 0x192d0e000 guard_memory for 0x193058000 bsd_mprotect for 0x193058000 set_native_priority for 0x193058000 set_native_priority for 0x1010ea000 guard_memory for 0x19315b000 bsd_mprotect for 0x19315b000 set_native_priority for 0x1010ea000 guard_memory for 0x19325e000 bsd_mprotect for 0x19325e000 guard_memory for 0x193361000 bsd_mprotect for 0x193361000 set_native_priority for 0x1010ea000 unguard_memory for 0x1010ea000 bsd_mprotect for 0x1010ea000 guard_memory for 0x7fff78cdd180 guard_memory for 0x1010ea000 bsd_mprotect for 0x7fff78cdd180 bsd_mprotect for 0x1010ea000 set_native_priority for 0x1010ea000 set_native_priority for 0x7fff78cdd180 >>>>>>>>>>>>>>> attached Appkit thread 0x78cdd180 Segmentation fault: 11 Summary: notice that we don't do anything at all for the Appkit thread (0x78cdd180) after it gets attached to the VM, and it ends up crashing with native signal.
12-03-2013
Any/all threads that attach to the VM have guard pages setup. I would suggest examining the pmap information to see exactly how the stack for the main thread is being setup compared to how we think it should be being setup. Given the lack of hs_err log it would seem that the VM's native signal handler is not being invoked. Is there a chance that on OSX this is generating a different signal? Can we extract from a core dump the information we would expect to find in the hs_err log?
11-03-2013
Attaching folder with test cases for this issues. TestInfiniteRecursion5 creates native thread, which it then attaches to JVM, calls into Java recursion method, which works correctly. TestInfiniteRecursion6 calls back into Cocoa main thread, which calls into Java recursion method, which fails with native crash on JDK7/8, but throws Java exception on JDK6 (most of the times) So the problem is indeed with Cocoa-main thread as pointed by Anthony, -XstartOnTheFirstThread is just a shortcut, and not necessary for this issue. On Mac we need to threat the 1st thread that Java is started on (the Cocoa main thread) specially and setup thread guard stack pages, regardless that the Java launcher normally only uses that 1st thread to spawn off its Java main thread. On other platforms that 1st thread has no special significance apparently (?), but on Mac OS X that is the most important thread, which the user can get back on and can try to call back into Java. So regardless of whether -XstartOnFirstThread is used or not, on Mac, we always need to setup the stack guard pages. Now I need to figure out how to do that.
11-03-2013
I'd like to point out that the -XstartOnFirstThread case is not the only one when this bug can be reproduced. You can replicate the issue by causing a stack-overflow on the event thread in a JavaFX application, too. And JavaFX does not use/require the -XstartOnFirstThread option at all. The crux of the bug is how the C-main() thread is handled by JVM on Mac regardless of whether the -XstartOnFirstThread is specified or not. This command-line option is only used to simplify the test case provided in the Description of the bug.
11-03-2013
Your test above, using attach() is invalid. The thread is already attached to the VM as it is executing Java code. The call to AttachCurrentThread is a no-op in that case. You would need to have your native method create a new native thread and have that attach to the VM and then invoke recursion(0). But then that would simply duplicate what already happens if you don't use -XstartOnFirstThread
08-03-2013
pthread_attr_setguardsize just changes the size of the stack guards. mprotect causes an access to the 1st page to result in an exception. Need to verify that this is indeed how we protect the stack and whether we use it in this case.
08-03-2013
On "Linux" we call "pthread_attr_setguardsize" in os::create_thread to guard stack, but not on "bsd" (ie. Mac OS X). Need to figure out how we protect the stack on Mac, then why it's not done in case of -XstartOnFirstThread
08-03-2013
Manually attaching the native thread to the VM (which I hoped would fix up the thread as any other Java threads) did not work. The test case: public class TestInfiniteRecursion4 { static native void attach(); static void recursion(int x) { recursion(++x); } public static void main(String[] args) { System.loadLibrary("test"); attach(); recursion(0); } } where the native "attach" method is: JNIEXPORT void JNICALL Java_TestInfiniteRecursion4_attach(JNIEnv env, jclass class) { (jVM)->AttachCurrentThread(jVM, (void **)&env, NULL); } still fails with native thread crash, not Java exception.
08-03-2013
To make sure it all really is about how the "main thread" is handled with -XstartOnFirstThread I created this test case: public class TestInfiniteRecursion3 { public static void main(String[] args) { java.lang.Runnable r =new java.lang.Runnable() { void recursion(int x) { recursion(x++); } public void run() { recursion(0); } }; java.lang.Thread t = new java.lang.Thread(r); t.start(); } } which creates another thread for the recursion test, and passes (as expected) even when started with -XstartOnFirstThread. The conclusion would be that with -XstartOnFirstThread the "main thread" must be indeed handled differently than other threads (ie. Java stack guard pages not setup as David Holmes suggests?)
08-03-2013
To make sure it's nothing related to "main" method itself I rewrote the test case as: public class TestInfiniteRecursion2 { static void recursion(int x) { recursion(x++); } public static void main(String[] args) { recursion(0); } } which fails as expected the same way the original test does.
08-03-2013
I can't see anything obvious in the setup of the thread that creates the VM. There doesn't seem to be any special handling of the "main thread". This will need further indepth investigation by someone with access to a Mac.
06-03-2013
To the person fixing this, please also test this on other platforms.
05-03-2013
The main thread is where C-main() is invoked on the Mac. Usually Java Launcher code parks this thread, creates a new one, and calls the Java-main() on that new thread. However, with the -XstartOnFirstThread specified, the Java Launcher will call Java-main() on the C-main() thread itself. There are reasons for this behavior on the Mac (GUI-related). Anyway, since Java-main() is a Java code, obviously, I assume the thread is attached to the JVM. See JVMInit() in src/macosx/bin/java_md_macosx.c. A crash log is attached to this bug.
05-03-2013
Sometimes on Mac these files are created in the java working directory when the app is crashing, however I don't se the one for this particular case.
05-03-2013
I'm not sure if there are hs_err logs on Mac. Petr, did you see one? The AppKit thread and the main thread are the same entity. It's the C-main() thread. The name AppKit comes from the fact that all GUI code should always run on this thread, and lots of GUI APIs on the Mac belong to the AppKit framework.
05-03-2013
I could not find any created hs_err. Looks like it is not created. Java prints Segmentation fault and thats it. The Appkit thread is the same as the C-main() thread, as I understand. Is is shown in a crash log as Dispatch queue: com.apple.main-thread. Additionally, no crash happens for the Apple JDK6 with the same test case. It shows the StackOverflowError as expected.
05-03-2013
But to clarify, what is the AppKit thread in relation to the main thread?
05-03-2013
Thanks for the info. If this is the original main thread of the process then it may be that there are no Java stack guard pages being allocated. The crash log is not a hs_err log. Was no hs_err log created?
05-03-2013
Is there a hsr_err log file created at the time of the crash?
04-03-2013
Is the AppKit thread a Java thread? Stackoverflow detection, resulting in StackoverflowError, only occurs for Java threads. Where is this AppKit thread created? Is it attached to the VM? We need more info on what this AppKit thread is.
04-03-2013
To get a StackOverflowError as we would get on any other thread. Not a native crash.
04-03-2013
What is the expected behavior for this infinite recursion test case?
04-03-2013
Petr is correct, the -XstartOnFirstThread is used here to simplify the test case. This must be a JVM bug, so I'm assigning the issue to the Hotspot team for evaluation.
04-03-2013
The -XstartOnFirstThread was provided here as an example of a simple test case to reproduce an issue. The problem occurs in JavaFX because it runs on the Appkit thread, and could affect AWT as a lot of Java code is run on the Appkit thread there too.
02-03-2013
-XstartOnFirstThread is a hack for environments such as SWT. I don't think it is supported for other usage.
02-03-2013