JDK-8193560 : Thread-SMR output during VMError reporting can be racy
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 10
  • Priority: P4
  • Status: Resolved
  • Resolution: Duplicate
  • OS: generic
  • CPU: generic
  • Submitted: 2017-12-14
  • Updated: 2021-02-11
  • Resolved: 2021-02-11
Related Reports
Duplicate :  
Relates :  
Description
We output info about Thread-SMR during an error report.
Normally Thread-SMR output is generated while the Threads_lock
is held so the info is stable. However, during error reporting
grabbing the Threads_lock is deadlock prone so we skip it. Of
course, this means that we're racy during error reporting.

Robbin reported this via email:

Hi Dan,

I sometimes see this, it seem like Threads::print_on_error uses a freed list. I think a thread removes itself exactly the same time as we get a signal and start doing error reporting.

/Robbin

==8527== Thread 79:
==8527== Invalid read of size 4
==8527==    at 0x52071BB: ThreadsList::length() const (threadSMR.hpp:173)
==8527==    by 0x6424C9E: ThreadsSMRSupport::print_info_elements_on(outputStream*, ThreadsList*) (threadSMR.cpp:1059)
==8527==    by 0x64248E2: ThreadsSMRSupport::print_info_on(outputStream*) (threadSMR.cpp:1002)
==8527==    by 0x641AEEA: Threads::print_on_error(outputStream*, Thread*, char*, int) (thread.cpp:4685)
==8527==    by 0x64AA164: VMError::report(outputStream*, bool) (vmError.cpp:807)
==8527==    by 0x64ABB35: VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long) (vmError.cpp:1394)
==8527==    by 0x64AB4DC: VMError::report_and_die(Thread*, char const*, int, char const*, char const*, __va_list_tag*) (vmError.cpp:1239)
==8527==    by 0x59F6B37: report_vm_error(char const*, int, char const*, char const*, ...) (debug.cpp:214)
==8527==    by 0x5B3F866: Thread::stack_base() const (thread.hpp:615)
==8527==    by 0x64A9838: VMError::report(outputStream*, bool) (vmError.cpp:672)
==8527==    by 0x64ABB35: VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long) (vmError.cpp:1394)
==8527==    by 0x64AB4DC: VMError::report_and_die(Thread*, char const*, int, char const*, char const*, __va_list_tag*) (vmError.cpp:1239)
==8527==  Address 0x3ed84da8 is 40 bytes inside a block of size 80 free'd
==8527==    at 0x4C2FD18: free (vg_replace_malloc.c:530)
==8527==    by 0x61F5DC3: os::free(void*) (os.cpp:777)
==8527==    by 0x5206D42: FreeHeap(void*) (allocation.inline.hpp:89)
==8527==    by 0x620F85A: CHeapObj<(MemoryType)2>::operator delete(void*) (allocation.inline.hpp:142)
==8527==    by 0x6423CA9: ThreadsSMRSupport::free_list(ThreadsList*) (threadSMR.cpp:732)
==8527==    by 0x6424452: ThreadsSMRSupport::remove_thread(JavaThread*) (threadSMR.cpp:893)
==8527==    by 0x6419BBC: Threads::remove(JavaThread*) (thread.cpp:4391)
==8527==    by 0x52152A2: SingleTestThread<void (*)(Thread*)>::postrun() (test_concurrentHashtable.cpp:61)
==8527==    by 0x5214E0E: SingleTestThread<void (*)(Thread*)>::run() (test_concurrentHashtable.cpp:57)
==8527==    by 0x61FEF79: thread_native_entry(Thread*) (os_linux.cpp:707)
==8527==    by 0x8C6536C: start_thread (in /usr/lib64/libpthread-2.25.so)
==8527==    by 0x8799B9E: clone (in /usr/lib64/libc-2.25.so)
==8527==  Block was alloc'd at
==8527==    at 0x4C2EB6B: malloc (vg_replace_malloc.c:299)
==8527==    by 0x61F58AC: os::malloc(unsigned long, MemoryType, NativeCallStack const&) (os.cpp:676)
==8527==    by 0x5206C83: AllocateHeap(unsigned long, MemoryType, NativeCallStack const&, AllocFailStrategy::AllocFailEnum) (allocation.inline.hpp:58)
==8527==    by 0x620F9FB: CHeapObj<(MemoryType)2>::operator new(unsigned long, NativeCallStack const&) (allocation.inline.hpp:95)
==8527==    by 0x620F751: CHeapObj<(MemoryType)2>::operator new(unsigned long) (allocation.inline.hpp:103)
==8527==    by 0x6422CB4: ThreadsList::add_thread(ThreadsList*, JavaThread*) (threadSMR.cpp:394)
==8527==    by 0x642398E: ThreadsSMRSupport::add_thread(JavaThread*) (threadSMR.cpp:655)
==8527==    by 0x6419AD1: Threads::add(JavaThread*, bool) (thread.cpp:4373)
==8527==    by 0x5215259: SingleTestThread<void (*)(Thread*)>::prerun() (test_concurrentHashtable.cpp:50)
==8527==    by 0x5214DEE: SingleTestThread<void (*)(Thread*)>::run() (test_concurrentHashtable.cpp:55)
==8527==    by 0x61FEF79: thread_native_entry(Thread*) (os_linux.cpp:707)
==8527==    by 0x8C6536C: start_thread (in /usr/lib64/libpthread-2.25.so) 
Comments
Closing as a duplicate of JDK-8231627.
11-02-2021

[~rehn] - Thanks for the confirmation.
11-02-2021

[~dcubed] Yes, I think closing a dup is the right thing here.
10-02-2021

The failure mode reported by [~rehn] is a crash here: ==8527== at 0x52071BB: ThreadsList::length() const (threadSMR.hpp:173) ==8527== by 0x6424C9E: ThreadsSMRSupport::print_info_elements_on(outputStream*, ThreadsList*) (threadSMR.cpp:1059) ==8527== by 0x64248E2: ThreadsSMRSupport::print_info_on(outputStream*) (threadSMR.cpp:1002) ==8527== by 0x641AEEA: Threads::print_on_error(outputStream*, Thread*, char*, int) (thread.cpp:4685) ==8527== by 0x64AA164: VMError::report(outputStream*, bool) (vmError.cpp:807) so we're in a VMError::report() and run into problems when we call ThreadsSMRSupport::print_info_on() which calls ThreadsSMRSupport::print_info_elements_on() on a ThreadsList that has been freed. I believe this issue was fixed via the following bug fix: JDK-8231627 ThreadsListHandleInErrorHandlingTest.java fails in printing all threads Here's the PR description for the fix: > A small robustness fix in ThreadsSMRSupport::print_info_on() to reduce the > likelihood of crashes during error reporting. Uses Threads_lock->try_lock() > for safety and restricts some reporting to when the Threads_lock has been > acquired (depends on JDK-8256383). Uses a ThreadsListHandle to make > the current ThreadsList safe for reporting (depends on JDK-8258284). Also > detects when the system ThreadsList (_java_thread_list) has changed and > will warn that some of the reported info may now be stale. The specific changes in JDK-8231627 that I believe address this bug: - Uses a ThreadsListHandle to make the current ThreadsList safe for reporting (depends on JDK-8258284). - The first print_info_elements_on() call using saved_threads_list is now made using a ThreadsList that is protected by a ThreadsListHandle. - Uses Threads_lock->try_lock() for safety and restricts some reporting to when the Threads_lock has been acquired (depends on JDK-8256383). - The second print_info_elements_on() call using _to_delete_list is now made only if the calling Thread owns the Threads_lock. Since the Threads_lock is owned, the _to_delete_list cannot be modified. So I think that this issue is resolved and should be closed as a duplicate of: JDK-8231627 ThreadsListHandleInErrorHandlingTest.java fails in printing all threads [~rehn] - Please let me know if you concur with my analysis here.
08-02-2021

Hmmmm... what about recognizing VMError::is_error_reported() in the ThreadsList freeing code and not freeing a ThreadsList in that case? We leave it on the to-be-freed list... so it's not "leaked", but it's not deleted either. Of course, that's still racy... it would be possible for the ThreadsList deleting thread to see '!VMError::is_error_reported()' and do the 'free' just as the crash reporting thread sees VMError::is_error_reported() and skips the Threads_lock grab...
14-12-2017

So this function: open/src/hotspot/share/runtime/thread.cpp: // Threads::print_on_error() is called by fatal error handler. It's possible // that VM is not at safepoint and/or current thread is inside signal handler. // Don't print stack trace, as the stack may not be walkable. Don't allocate // memory (even in resource area), it might deadlock the error handler. void Threads::print_on_error(outputStream* st, Thread* current, char* buf, int buflen) { ThreadsSMRSupport::print_info_on(st); st->cr(); which we clearly recognize as dangerous... and it results in a call to: open/src/hotspot/share/runtime/threadSMR.cpp: // Print Threads class SMR info. void ThreadsSMRSupport::print_info_on(outputStream* st) { // Only grab the Threads_lock if we don't already own it // and if we are not reporting an error. MutexLockerEx ml((Threads_lock->owned_by_self() || VMError::is_error_reported()) ? NULL : Threads_lock); st->print_cr("Threads class SMR info:"); where we skip grabbing the Threads_lock when we are reporting errors (because of the deadlock risk) and we call this function: open/src/hotspot/share/runtime/threadSMR.cpp: print_info_elements_on(st, _java_thread_list); Here's we're accessing the system JavaThread ThreadsList and we aren't using a ThreadsListHandle (because a nested ThreadsList would results in a Threads_lock grab...). So if we're reporting an error, then we're running the risk of accessing the '_java_thread_list' field without a Threads_lock. We might have just started our print_info_elements_on() call on a _java_thread_list value that was just replaced and freed. I know we ran into this during the last couple of months of testing, but we didn't have any good solutions. We don't want to always grab the Threads_lock because that can lead to deadlocking during error reporting... No good solutions are coming to mind... error reporting is such a dangerous place to be...
14-12-2017