Bug ID: JDK-8237727 Mac: after we handle a crash, Apple's crash reporter is left with incorrect state

JDK-8237727 : Mac: after we handle a crash, Apple's crash reporter is left with incorrect state

Type: Enhancement
Component: hotspot
Sub-Component: runtime
Affected Version: 15

Priority: P4
Status: Closed
Resolution: Won't Fix
OS: os_x

Submitted: 2020-01-22
Updated: 2023-01-04
Resolved: 2023-01-04

Related Reports

Blocks :	JDK-8250637 - UseOSErrorReporting times out (on Mac and Linux)
Blocks :	JDK-8245833 - crash_with_sigfpe uses pthread_kill(SIGFPE) on macOS
Relates :	JDK-8237777 - "Dumping core ..." is shown despite claiming that "# No core dump will be written."
Relates :	JDK-8245833 - crash_with_sigfpe uses pthread_kill(SIGFPE) on macOS
Relates :	JDK-8274102 - Update VM crash reporting on macOS
Relates :	JDK-8250782 - UseOSErrorReporting causes stack space exception (Win)
Relates :	JDK-8220786 - Create new switch to redirect error reporting output to stdout or stderr
Relates :	JDK-8250637 - UseOSErrorReporting times out (on Mac and Linux)

Description

It would be nice if after a crash, a user could use both our own hs_err_pid log and Apple's crash reporter log. As it is, Apple crash report logs incorrect state of our own process handler, not the original crash:

Our hs_err_pid reports correctly:

V  [libjvm.dylib+0xc39db1]  VMError::controlled_crash(int)+0x3cf
V  [libjvm.dylib+0x6038fc]  JNI_CreateJavaVM+0x568
C  [libjli.dylib+0x3fc7]  JavaMain+0x10b
C  [libjli.dylib+0x6b5d]  ThreadJavaMain+0x9
C  [libsystem_pthread.dylib+0x5e65]  _pthread_start+0x94
C  [libsystem_pthread.dylib+0x183b]  thread_start+0xf

But Apple's crash reporter logs incorrectly:

Thread 2 Crashed:
0   libsystem_kernel.dylib        	0x00007fff70d067fa __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff70dc3bc1 pthread_kill + 432
2   libsystem_c.dylib             	0x00007fff70c8da1c abort + 120
3   libjvm.dylib                  	0x0000000109ad5737 os::abort(bool, void*, void const*) (.cold.1) + 125
4   libjvm.dylib                  	0x000000010975b611 os::abort(bool, void*, void const*) + 33 (os_bsd.cpp:1070)
5   libjvm.dylib                  	0x0000000109a3b0cd VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long) + 3057 (vmError.cpp:1624)
6   libjvm.dylib                  	0x0000000109a3a4b8 VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*, char const*, ...) + 148 (vmError.cpp:1291)
7   libjvm.dylib                  	0x0000000109a3b1a1 VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*) + 33 (vmError.cpp:1298)
8   libjvm.dylib                  	0x000000010975fda1 JVM_handle_bsd_signal + 787 (os_bsd_x86.cpp:791)
9   libjvm.dylib                  	0x000000010975d0cc signalHandler(int, __siginfo*, void*) + 89 (os_bsd.cpp:2631)
10  libsystem_platform.dylib      	0x00007fff70db842d _sigtramp + 29
11  ???                           	0x0000000000000080 0 + 128
12  libjvm.dylib                  	0x00000001094038fc JNI_CreateJavaVM + 1384 (jni.cpp:3935)
13  libjli.dylib                  	0x0000000107cabfc7 JavaMain + 267 (java.c:417)
14  libjli.dylib                  	0x0000000107caeb5d ThreadJavaMain + 9 (java_md_macosx.m:716)
15  libsystem_pthread.dylib       	0x00007fff70dc3e65 _pthread_start + 148
16  libsystem_pthread.dylib       	0x00007fff70dbf83b thread_start + 15

I.e. it reports our signal handler, which is a pity because otherwise it provides a wide range of info, which our own handler doesn't, such as back trace of all native threads (and line numbers to files instead of offsets, and list of binary images sorted by their addresses, and user friendly OS X version), which is super useful for tracking multithreaded issues.

Also, it would be nice if our hs_err_pid log could include a link to Apple's own crash log.

Just noticed: if "-XX:-CreateCoredumpOnCrash" was used, then we would exit the process via exit() API, so Apple's CrashReporter would see that as clean exit and wouldn't generate its own crash log at all.

Comments

Runtime Triage: This is not on our current list of priorities. We will consider this feature if we receive additional customer requirements.
04-01-2023
Comments from Thomas coming from his code review for JDK-8250637: - Your patch does not work for asserts, which makes kind of sense. UseOSErrorReporting will just cause the VM to continue after an assert. Easy to reproduce with java -XX:ErrorHandlerTest=1 -XX:+UseOSErrorReporting - we now just loop. - We should not do this for any signal other than synchronous error signals (SIGSEGV, SIGILL, SIGFPE, SIGBUS) where we can be sure that we re-crash right at the return pc. - Also, these signals should not have been raised with kill()/raise()/etc. In other words, they should be real faults. - We should not remove signal handlers for all signals. That may expose us to more problems. If the intent is to retrigger our fault, we only should install the default signal handler for that one fault.
04-11-2020
Decided to take out the change to test/hotspot/jtreg/runtime/ErrorHandling/CreateCoredumpOnCrash.java I made for this fix - it's better suited for core dumping related issue like JDK-8237777
07-02-2020
I don't know why "apple crash reporter cannot resolve the frame symbol name for "controlled_crash" in the before case, but can resolve it in the after case", but it's not something we can fix ourselves directly, other than by this proposed fix. The "intermittent situation which resolved itself in the time it took to write the error report" is an interesting point, which I haven't even considered yet, but even if that happens, then the hs_err_pid log still gets created as usual, so we don't loose any information. I will investigate "ErrorLogTimeout" mechanism though - thank you for pointing it out. I suppose the point here is that custom error handling is tricky at best (and strongly cautioned against by Apple tech support quoted), so this is our best effort and if we can help improve it by making sure that the native crash reporter kicks in and kicks in at the right state, then we're better off. I agree that this should be investigated on the other platforms - I simply noticed it on Mac first and it's my primary development platform, so that's where I started.
05-02-2020
Oh now I see it, in your first comment. The idea is neat. Sorry for not reading your comments more thoroughly. I see one problem though: If the crash has been spurious - caused by a intermittent situation which resolved itself in the time it took to write the error report - think eg. access to a temporarily protected or unmapped page which got unprotected or remapped again - the VM will now swallow the crash and live on. At least for a while, until the crash timeout kicks in (see ErrorLogTimeout) and kills the process. We also have this issue on other platforms, where a core dump would point to os::abort at the end of error handling. But I never thought this to be a real problem since error handling does not unwind the stack, just adds new stack frames. So, you could still go up the stack and search for the entry to VMError::report_and_die() and look for the real problem there. As you can in your report too, see your "before" crash report, frames 12 and 11. The real question is why the apple crash reporter cannot resolve the frame symbol name for "controlled_crash" in the before case, but can resolve it in the after case. The before case contains ??? for the real crashing frame. I think the idea is neat but may in rare cases cause the VM to live on in situations where it should be stopped, potentially causing further damage. It certainly improves readability since the crash report does not contain the error frames, but the crash report is not more precise than before since error handling ran and potentially a lot of state changed since the crash point. I also think if we want this, we should do this unconditionally on all Posix platforms.
05-02-2020
I have already implemented this and it is working (se the attached crash logs, i.e. "before" and "after" together with our hs_err_pid log) My proposal was stated in the bug, i.e. allow the process to re-crash (after having handled it first ourselves via hs_err_pid log and then de-registering our crash handler), which allows the native system to capture the crash site correctly. Dumping the core file works correctly with my fix (tested using Mach5 tier1,2,3,4,5,6,7) This issue only addresses Mac platform as stated in the bug. I have no experience with Windows at all, though I will be happy to look at Linux next (in a different issue).
05-02-2020
@gziemski: I dont see how you could solve this, what is your proposal? I think this problem is not Mac specific. We need to decide between System crash reporting and a good hs err file on all platforms. We have the choice between: A) writing the hs-err file first thing right after the crash, having access to a state which is as close as possible to the state causing the crash; and then writing a core dump or whatever the platform equivalent, which means state changed already. This is the default. B1) Non-windows: writing the core dump first, and since the only way to do this on Unix is calling abort(), doing nothing at all after that, so no hs-err file. B2) Windows: dumping the MiniDump first, and then continuing writing the hs-err file which is of reduced use since the world moved on and the VM state may be a lot different now The choice between A and B is SuppressFatalErrorMessage. Do you see a different way?
05-02-2020
Thank you Dan!
29-01-2020
vm/jni/FatalError/ferr001/ferr00101m1/ferr00101m1.html is a known failure: https://bugs.openjdk.java.net/browse/JCK-7313305 It is believed to be caused by the creation of a core file, i.e., it takes too long to create the core file...
29-01-2020
tier1,2,3,4,5,7 now pass, but there is an issue in tier6: vm/jni/FatalError/ferr001/ferr00101m1/ferr00101m1.html FATAL ERROR in native method: My Fatal Error at javasoft.sqe.tests.vm.jni.ferr001.ferr00101m1.ferr00101m1.ferr00101m1nm(Native Method) at javasoft.sqe.tests.vm.jni.ferr001.ferr00101m1.ferr00101m1.check(ferr00101m1.java:21) at javasoft.sqe.tests.vm.jni.ferr001.ferr00101m1.ferr00101m1.testChecks(ferr00101m1.java:38) at javasoft.sqe.tests.vm.jni.ferr001.ferr00101m1.ferr00101m1.run(ferr00101m1.java:93) at javasoft.sqe.tests.vm.jni.ferr001.ferr00101m1.ferr00101m1.main(ferr00101m1.java:98) Current thread is 7171 Dumping core ... result: Error. Program `/scratch/mesos/jib-master/install/2020-01-28-2055165.gerard.ziemski.jdk15/macosx-x64-debug.jdk/jdk-15/fastdebug/bin/java' interrupted! (timed out?)
29-01-2020
runtime/ErrorHandling/ErrorFileRedirectTest.java fixed. It kept creating core file, even though -XX:-CreateCoredumpOnCrash was used. Fixed by manipulating ulimit via API. Probably fixes the JDK-8237777 as well (on Mac)
29-01-2020
Mach5 test tier1-5 came back with one failure: runtime/ErrorHandling/ErrorFileRedirectTest.java test result: Error. Agent error: java.lang.Exception: Agent 3 timed out with a timeout of 480 seconds; check console log for any additional details Linked to https://bugs.openjdk.java.net/browse/JDK-8220786
28-01-2020
Here is an output of crashed thread from Apple crash reporter with my hacked fix: Thread 2 Crashed: 0 libjvm.dylib.......................0x000000010ed72ac9 VMError::controlled_crash(int) + 975 (vmError.cpp:1840) 1 libjvm.dylib.......................0x000000010eca2140 JNI_CreateJavaVM + 1384 (jni.cpp:3935) 2 libjli.dylib..........................0x000000010d3ebfc7 JavaMain + 267 (java.c:417) 3 libjli.dylib..........................0x000000010d3eeb5d ThreadJavaMain + 9 (java_md_macosx.m:716) 4 libsystem_pthread.dylib...0x00007fff70dc3e65 _pthread_start + 148 5 libsystem_pthread.dylib...0x00007fff70dbf83b thread_start + 15 compared to ours from hs_err_pid: V [libjvm.dylib+0xc39db1] VMError::controlled_crash(int)+0x3cf V [libjvm.dylib+0x6038fc] JNI_CreateJavaVM+0x568 C [libjli.dylib+0x3fc7] JavaMain+0x10b C [libjli.dylib+0x6b5d] ThreadJavaMain+0x9 C [libsystem_pthread.dylib+0x5e65] _pthread_start+0x94 C [libsystem_pthread.dylib+0x183b] thread_start+0xf Notice that Apple's crash reporter log shows files and line numbers, which ours does not.
22-01-2020
Our abort implementation: void os::abort(bool dump_core, void* siginfo, const void* context) { os::shutdown(); if (dump_core) { #ifndef PRODUCT fdStream out(defaultStream::output_fd()); out.print_raw("Current thread is "); char buf[16]; jio_snprintf(buf, sizeof(buf), UINTX_FORMAT, os::current_thread_id()); out.print_raw_cr(buf); out.print_raw_cr("Dumping core ..."); #endif ::abort(); // dump core } ::exit(1); } And quote from man pages: DESCRIPTION The abort() function causes abnormal program termination to occur, unless the signal SIGABRT is being caught and the signal handler does not return. Any open streams are flushed and closed. IMPLEMENTATION NOTES The abort() function is thread-safe. It is unknown if it is async-cancel-safe. The abort() function causes a report to be generated by Crash Reporter. If you wish to terminate without generating a crash report, use exit(3) instead.
22-01-2020
From https://forums.developer.apple.com/thread/113742 : "Finally, there’s the question of how to exit from your signal handler. You must not call exit. There’s two problems with doing that: - exit is not async signal safe. In fact, exit can run arbitrary code via handlers registered with atexit. If you want to exit the process, call _exit. - Exiting the process is a bad idea anyway, because it will either prevent the Apple crash reporter from running or cause it to log incorrect state (the state of your signal handler rather than the state of the crashed thread). A better solution is to unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."
22-01-2020