Bug ID: JDK-8218975 Bug in macOSX kernel's pthread support

JDK 13
13 b13Fixed

The timeout scenario is just a different manifestation on the same kernel issue. Trace: ThreadA grabs mutex(which has the stale entry in the kernel). ThreadB, ThreadC and ThreadD try to grab the mutex and end up in __psynch_mutexwait() spinning. ThreadA goes to pthread_cond_wait() which will fail releasing the mutex and will just return. Upon return to the userspace side of the pthread library _pthread_cond_updateval(cond, 0, updateval) will get executed which doesn't affect the mutex state. ThreadA tries to regain the lock by calling pthread_mutex_lock() but ends up in the same __psynch_mutexwait(). Replacing phase2 in the reproducer (stale_entry_test.c) from sleep(1) to the following makes the test hang: clock_gettime(CLOCK_REALTIME, &abst); abst.tv_sec += 1; pthread_cond_timedwait(ctx->worker_condvar, ctx->worker_mutex, &abst);
15-03-2019
This turned out to be a bug in the macOSX kernel's pthread support (which may be fixed in OSX 10.14 (Mojave)). When using a condvar, an associated entry may be added to a kernel table and left there after the last use of the condvar. If we then destroy and delete the condvar and later happen to allocate and init a mutex at the same address (this really does happen), that stale condvar entry in the kernel table confuses operations on the mutex, causing crashes and hangs. The OWSTTaskTerminator is subject to this problem because a new HotSpot Monitor object is allocated for each terminator. Terminators are allocated for various parallel tasks during garbage collection, so there are quite a few Monitor objects (and their contained PlatformMonitors) being created and deleted, so many opportunities to get the kernel into the bad state and then later reuse a previous condvar address for a mutex. That was enough for the stress test (gc/stress/TestStressIHOPMultiThread.java) to hit the problem occasionally. This problem would not have shown up before JDK-8210832, which was made shortly before the first sighting of the stress test failure. Before that change we re-used park events in the implementation of HotSpot Mutex/Monitor. That meant we never had a mutex get allocated at the same address as one of our former condvars.
13-03-2019
Added apple_bugreport.txt with description about the bug sent to Apple and stale_entry_test.c to reproduce the failure outside the JVM.
13-03-2019
[~dholmes] The worker threads in the workgang are using the monitor. They've all completed their task and quiesced before the task (and the contained monitor) get deleted. Good thought, and it could also be similar to https://sourceware.org/bugzilla/show_bug.cgi?id=13690, but that doesn't look like where we are.
28-02-2019
Which threads can access this monitor? My concern would be it being deleted whilst a previous thread that used it is still returning from the unlock path.
27-02-2019
The TaskTerminator in use is a member of the G1ParTask, i.e. the AbstractGangTask, and has the same lifecycle. The G1ParTask is created in the vm thread; it is destroyed when control returns back to it. While the cause may be memory corruption, the issue does not seem to be the TaskTerminator/Monitor lifecycle.
25-02-2019
Error message printed by pthread library when exiting is "__psynch_mutexdrop failed with error 22". Error 22 is EINVAL, so either values passed to __psynch_mutexdrop are correct and problem is on the pthread implementation or values are incorrect which might indicate some memory corruption. I was able to build the pthread library from source code and was able to reproduce the failure after a couple of hundreds of attempts. Added logging in several places of the library and currently waiting for the test to fail.
25-02-2019
What is the lifecycle of the TaskTerminator that contains the Monitor? Some kind of memory corruption would seem most likely given the symptoms.
25-02-2019
Issues seems independent of OS version (fails on 10.10 and 10.13), on many different machines.
20-02-2019
Using PTHREAD_MUTEX_ERRORCHECK for mutex initialization does not make a difference.
19-02-2019
Investigated the timeouts a bit which is, after looking at a few of them, basically always look the same: One thread is wait'ing on the OWSTTaskTerminator::_blocker lock in owstTaskTerminator.cpp:77 executing _blocker->wait(true, WorkStealingSleepMillis); // 1ms while the others are blocking on that same lock using one of the MutexLockerEx'es, i.e. owstTaskTerminator.cpp:170 or :45. Typically it is one thread at 170 and two at 45.
19-02-2019
By looking at the JVM arguments used for the tests that failed I tried changing G1HeapRegionSize, since it seems the one that matters most in terms of reproduction. I run the test with the JVM arguments used in the first run but changing only G1HeapRegionSize. With values of 1m or 2m test fails ~ 2 in 300. With a value of 4m ~ 1 in 300. With a value of 8m I run the test about 600 times but didn't see a failure.
19-02-2019
Can repro with JDK-8210832 (1 in 900) but not without JDK-8210832 (0 in 900) with that exact change in or out. Note that this one failure has been a timeout, not ExitCode 134 which may be something different.
16-02-2019
Reproduction frequency is like 1 run in 200, only OSX (sample size 1000 runs; I stopped testing other platforms at some point).
15-02-2019
Without JDK-8217330 (but with JDK-8210832; tag jdk-13+7) the test seems to time out at around the same frequency, and sometimes getting the "ExitCode 134". Need to investigate the core files still. Neither jdk-13+6 (without JDK-8210832) nor jdk-13+7 runs with the old task terminator show timeouts or "ExitCode 134" crashes. Either there is a problem with the OWST task terminator that has been uncovered with JDK-8210832 or JDK-8210832 itself is somehow responsible (barring that there still needs to be confirmation that JDK-8210832 is the cause since this has only been narrowed down to repo tags).
15-02-2019
The second failure shows a different situation with the same result: The SemaphoreGangDispatcher: (lldb) print this (SemaphoreGangTaskDispatcher) $2 = { _task = 0x000000012ea8f568 _started = 2 _not_finished = 4 _start_semaphore = 0x00007fe4abc09a70 _end_semaphore = 0x00007fe4abc09a80 } Two threads are already offering termination, one of them getting the same _pthread_abort (#5) the other is waiting (#21): thread #5, stop reason = signal SIGSTOP frame #0: 0x00007fff8c3d2286 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff8a86c9f9 libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x00007fff8db129a3 libsystem_c.dylib`abort + 129 frame #3: 0x00007fff8a86de1f libsystem_pthread.dylib`__pthread_abort + 49 frame #4: 0x00007fff8a86ded3 libsystem_pthread.dylib`__pthread_abort_reason + 180 frame #5: 0x00007fff8a868aa1 libsystem_pthread.dylib`pthread_mutex_unlock + 246 frame #6: 0x0000000102be125f libjvm.dylib`OWSTTaskTerminator::offer_termination(this=0x00007fe4abfd77a0, terminator=0x0000000000000000) at owstTaskTerminator.cpp:60 [opt] frame #7: 0x00000001028e2a1f libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] ParallelTaskTerminator::offer_termination(this=<unavailable>) at taskqueue.hpp:488 [opt] frame #8: 0x00000001028e2a17 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] G1ParEvacuateFollowersClosure::offer_termination() at g1CollectedHeap.cpp:3233 [opt] frame #9: 0x00000001028e29ca libjvm.dylib`G1ParEvacuateFollowersClosure::do_void(this=0x000000012b900d18) at g1CollectedHeap.cpp:3248 [opt] ... thread #21, stop reason = signal SIGSTOP frame #0: 0x00007fff8c3d2166 libsystem_kernel.dylib`__psynch_mutexwait + 10 frame #1: 0x00007fff8a868696 libsystem_pthread.dylib`_pthread_mutex_lock + 480 frame #2: 0x0000000102bb063f libjvm.dylib`Monitor::lock_without_safepoint_check() [inlined] Monitor::lock_without_safepoint_check(this=0x00007fe4abfdac90, self=<unavailable>) at mutex.cpp:98 [opt] frame #3: 0x0000000102bb0636 libjvm.dylib`Monitor::lock_without_safepoint_check(this=0x00007fe4abfdac90) at mutex.cpp:104 [opt] frame #4: 0x0000000102be11e9 libjvm.dylib`OWSTTaskTerminator::offer_termination(this=0x00007fe4abfd77a0, terminator=0x0000000000000000) at owstTaskTerminator.cpp:45 [opt] frame #5: 0x00000001028e2a1f libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] ParallelTaskTerminator::offer_termination(this=<unavailable>) at taskqueue.hpp:488 [opt] frame #6: 0x00000001028e2a17 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] G1ParEvacuateFollowersClosure::offer_termination() at g1CollectedHeap.cpp:3233 [opt] frame #7: 0x00000001028e29ca libjvm.dylib`G1ParEvacuateFollowersClosure::do_void(this=0x000000012f73ed18) at g1CollectedHeap.cpp:3248 [opt]
14-02-2019
Digging into thread #10 (VM thread) shows: (lldb) frame select 4 frame #4: 0x000000010f107269 libjvm.dylib`WorkGang::run_task(this=0x00007ffcd370d280, task=0x000000013be2d470, num_workers=4) at workgroup.cpp:284 [opt] (lldb) frame variable (WorkGang ) this = 0x00007ffcd370d280 (G1ParTask ) task = 0x000000013be2d470 (uint) num_workers = 4 (uint) old_num_workers = 4 (lldb) print this (WorkGang) $2 = { AbstractWorkGang = { _workers = 0x00007ffcd370d3f0 _total_workers = 4 _active_workers = 4 _created_workers = 4 _name = 0x000000010f178fe9 "GC Thread" _are_GC_task_threads = true _are_ConcurrentGC_threads = false } _dispatcher = 0x00007ffcd370d2f0 } (lldb) print _dispatcher (GangTaskDispatcher) $3 = {} (lldb) print _dispatcher (SemaphoreGangTaskDispatcher ) $4 = 0x00007ffcd370d2f0 (lldb) print (SemaphoreGangTaskDispatcher ) $4 (SemaphoreGangTaskDispatcher) $5 = { _task = 0x000000013be2d470 _started = 4 _not_finished = 3 _start_semaphore = 0x00007ffcd370d350 _end_semaphore = 0x00007ffcd370d3a0 } So thread #5, which is also a worker thread (thread #7 is a marking thread) already seems to have finished. Looking at the TaskTerminator instance (via thread #24): (lldb) frame select 7 frame #7: 0x000000010ee956e1 libjvm.dylib`OWSTTaskTerminator::offer_termination(this=0x00007ffcd35bbdf0, terminator=0x0000000000000000) at owstTaskTerminator.cpp:80 [opt] (lldb) frame variable (OWSTTaskTerminator ) this = 0x00007ffcd35bbdf0 (TerminatorTerminator ) terminator = 0x0000000000000000 (GangWorker ) the_thread = 0x00007ffcd5032800 (size_t) tasks = <variable not available> (lldb) print *this (OWSTTaskTerminator) $12 = { ParallelTaskTerminator = { _n_threads = 4 _queue_set = 0x00007ffcd370ac78 _pad_buf0 = { [0] = '?' [1] = '?' [2] = '?' ... } _offered_termination = 4 _pad_buf1 = { [0] = '?' [1] = '?' [2] = '?' [3] = '?' [4] = '?' [5] = '?' ... } } _blocker = 0x00007ffcd35bbf50 _spin_master = 0x00007ffcd4027800 } this seems okay (_offered_termination == 4).
14-02-2019
Linked JDK-8204947 Shenandoah Task Terminator and JDK-8210832 Remove sneaky locking in class Monitor . The former mainly because it's "new", and the latter significantly rewrote monitors in the VM.
14-02-2019
In the meantime, could not reproduce in 500+ runs (across all platforms).
14-02-2019
Initial investigation of the failures: failure occurs at the end(?) of the actual evacuation (G1ParTask). One worker thread gets aborted in the pthread_mutex_unlock call when offering termination (in the two cases it's either line 60 or line 80 of owstTaskTerminator.cpp). Interesting is that while the AbstractGangTask is always called with num_workers == 4, only a part of that amount of worker threads are offering termination. All other threads are waiting for a task from the task dispatcher. E.g. for the failure in the '376 build: Thread #24 gets a __pthread_abort when trying to offer termination: * thread #24, stop reason = signal SIGSTOP * frame #0: 0x00007fff8d675286 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff946c99f9 libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x00007fff8a82c9a3 libsystem_c.dylib`abort + 129 frame #3: 0x00007fff946cae1f libsystem_pthread.dylib`__pthread_abort + 49 frame #4: 0x00007fff946caed3 libsystem_pthread.dylib`__pthread_abort_reason + 180 frame #5: 0x00007fff946c5aa1 libsystem_pthread.dylib`pthread_mutex_unlock + 246 frame #6: 0x000000010ee8a0cf libjvm.dylib`os::PlatformMonitor::unlock(this=<unavailable>) at os_posix.cpp:2227 [opt] frame #7: 0x000000010ee956e1 libjvm.dylib`OWSTTaskTerminator::offer_termination(this=0x00007ffcd35bbdf0, terminator=0x0000000000000000) at owstTaskTerminator.cpp:80 [opt] frame #8: 0x000000010e98ac3c libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] ParallelTaskTerminator::offer_termination(this=<unavailable>) at taskqueue.hpp:488 [opt] frame #9: 0x000000010e98ac34 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] G1ParEvacuateFollowersClosure::offer_termination() at g1CollectedHeap.cpp:3227 [opt] frame #10: 0x000000010e98ac10 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void(this=0x000000013cb81d58) at g1CollectedHeap.cpp:3242 [opt] frame #11: 0x000000010e991ee7 libjvm.dylib`G1ParTask::work(this=0x000000013be2d470, worker_id=0) at g1CollectedHeap.cpp:3297 [opt] Some other threads are also offering termination: thread #25, stop reason = signal SIGSTOP frame #0: 0x00007fff8d675166 libsystem_kernel.dylib`__psynch_mutexwait + 10 frame #1: 0x00007fff946c5696 libsystem_pthread.dylib`_pthread_mutex_lock + 480 frame #2: 0x000000010ee8a095 libjvm.dylib`os::PlatformMonitor::lock(this=<unavailable>) at os_posix.cpp:2222 [opt] frame #3: 0x000000010ee3644a libjvm.dylib`Monitor::lock_without_safepoint_check(this=0x00007ffcd35bbf50, self=0x00007ffcd4027800) at mutex.cpp:98 [opt] frame #4: 0x000000010ee95a70 libjvm.dylib`OWSTTaskTerminator::do_spin_master_work(TerminatorTerminator) [inlined] MonitorLockerEx::MonitorLockerEx(this=0x000000010f4168a0, monitor=0x00007ffcd35bbf50, no_safepoint_check=true) at mutexLocker.hpp:252 [opt] frame #5: 0x000000010ee95a5f libjvm.dylib`OWSTTaskTerminator::do_spin_master_work(TerminatorTerminator) [inlined] MonitorLockerEx::MonitorLockerEx(this=0x000000010f4168a0, monitor=0x00007ffcd35bbf50, no_safepoint_check=true) at mutexLocker.hpp:253 [opt] frame #6: 0x000000010ee95a5f libjvm.dylib`OWSTTaskTerminator::do_spin_master_work(this=<unavailable>, terminator=0x0000000000000000) at owstTaskTerminator.cpp:170 [opt] frame #7: 0x000000010ee9566d libjvm.dylib`OWSTTaskTerminator::offer_termination(this=0x00007ffcd35bbdf0, terminator=0x0000000000000000) at owstTaskTerminator.cpp:62 [opt] frame #8: 0x000000010e98ac3c libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] ParallelTaskTerminator::offer_termination(this=<unavailable>) at taskqueue.hpp:488 [opt] frame #9: 0x000000010e98ac34 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] G1ParEvacuateFollowersClosure::offer_termination() at g1CollectedHeap.cpp:3227 [opt] frame #10: 0x000000010e98ac10 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void(this=0x000000013cc84d58) at g1CollectedHeap.cpp:3242 [opt] frame #11: 0x000000010e991ee7 libjvm.dylib`G1ParTask::work(this=0x000000013be2d470, worker_id=2) at g1CollectedHeap.cpp:3297 [opt] frame #12: 0x000000010f107516 libjvm.dylib`GangWorker::run_task(this=0x00007ffcd4027800, data=WorkData @ 0x00007fafeab55120) at workgroup.cpp:335 [opt] [...] thread #26, stop reason = signal SIGSTOP frame #0: 0x00007fff8d675166 libsystem_kernel.dylib`__psynch_mutexwait + 10 frame #1: 0x00007fff946c5696 libsystem_pthread.dylib`_pthread_mutex_lock + 480 frame #2: 0x00007fff946c85ee libsystem_pthread.dylib`_pthread_cond_wait + 835 frame #3: 0x000000010ee8a1ae libjvm.dylib`os::PlatformMonitor::wait(this=<unavailable>, millis=100000000000) at os_posix.cpp:2251 [opt] frame #4: 0x000000010ee369d5 libjvm.dylib`Monitor::wait(this=0x00007ffcd35bbf50, no_safepoint_check=<unavailable>, timeout=1, as_suspend_equivalent=<unavailable>) at mutex.cpp:178 [opt] frame #5: 0x000000010ee9563f libjvm.dylib`OWSTTaskTerminator::offer_termination(this=0x00007ffcd35bbdf0, terminator=0x0000000000000000) at owstTaskTerminator.cpp:77 [opt] frame #6: 0x000000010e98ac3c libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] ParallelTaskTerminator::offer_termination(this=<unavailable>) at taskqueue.hpp:488 [opt] frame #7: 0x000000010e98ac34 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void() [inlined] G1ParEvacuateFollowersClosure::offer_termination() at g1CollectedHeap.cpp:3227 [opt] frame #8: 0x000000010e98ac10 libjvm.dylib`G1ParEvacuateFollowersClosure::do_void(this=0x000000013cd87d58) at g1CollectedHeap.cpp:3242 [opt] frame #9: 0x000000010e991ee7 libjvm.dylib`G1ParTask::work(this=0x000000013be2d470, worker_id=3) at g1CollectedHeap.cpp:3297 [opt] frame #10: 0x000000010f107516 libjvm.dylib`GangWorker::run_task(this=0x00007ffcd402a800, data=WorkData @ 0x00007fafeab613d0) at workgroup.cpp:335 [opt] [...] Other gang worker threads are still waiting for work: thread #5, stop reason = signal SIGSTOP frame #0: 0x00007fff8d67051a libsystem_kernel.dylib`semaphore_wait_trap + 10 frame #1: 0x000000010ef482cb libjvm.dylib`OSXSemaphore::wait(this=0x00007ffcd370d358) at semaphore_bsd.cpp:63 [opt] frame #2: 0x000000010f107dde libjvm.dylib`SemaphoreGangTaskDispatcher::worker_wait_for_task() [inlined] Semaphore::wait(this=<unavailable>) at semaphore.hpp:56 [opt] frame #3: 0x000000010f107dd5 libjvm.dylib`SemaphoreGangTaskDispatcher::worker_wait_for_task(this=0x00007ffcd370d2f0) at workgroup.cpp:158 [opt] frame #4: 0x000000010f1075ce libjvm.dylib`GangWorker::loop() [inlined] GangWorker::wait_for_task() at workgroup.cpp:324 [opt] frame #5: 0x000000010f1075bd libjvm.dylib`GangWorker::loop(this=0x00007ffcd500d000) at workgroup.cpp:343 [opt] [...] thread #7, stop reason = signal SIGSTOP frame #0: 0x00007fff8d67051a libsystem_kernel.dylib`semaphore_wait_trap + 10 frame #1: 0x000000010ef482cb libjvm.dylib`OSXSemaphore::wait(this=0x00007ffcd350d058) at semaphore_bsd.cpp:63 [opt] frame #2: 0x000000010f107dde libjvm.dylib`SemaphoreGangTaskDispatcher::worker_wait_for_task() [inlined] Semaphore::wait(this=<unavailable>) at semaphore.hpp:56 [opt] frame #3: 0x000000010f107dd5 libjvm.dylib`SemaphoreGangTaskDispatcher::worker_wait_for_task(this=0x00007ffcd350c680) at workgroup.cpp:158 [opt] frame #4: 0x000000010f1075ce libjvm.dylib`GangWorker::loop() [inlined] GangWorker::wait_for_task() at workgroup.cpp:324 [opt] frame #5: 0x000000010f1075bd libjvm.dylib`GangWorker::loop(this=0x00007ffcd4007800) at workgroup.cpp:343 [opt] [...] However the work gang dispatcher looks as if it had already spun up enough workers, waiting for finish: thread #10, stop reason = signal SIGSTOP frame #0: 0x00007fff8d67051a libsystem_kernel.dylib`semaphore_wait_trap + 10 frame #1: 0x000000010ef482cb libjvm.dylib`OSXSemaphore::wait(this=0x00007ffcd370d3a8) at semaphore_bsd.cpp:63 [opt] frame #2: 0x000000010f107d68 libjvm.dylib`SemaphoreGangTaskDispatcher::coordinator_execute_on_workers(AbstractGangTask*, unsigned int) [inlined] Semaphore::wait(this=<unavailable>) at semaphore.hpp:56 [opt] frame #3: 0x000000010f107d5f libjvm.dylib`SemaphoreGangTaskDispatcher::coordinator_execute_on_workers(this=0x00007ffcd370d2f0, task=<unavailable>, num_workers=<unavailable>) at workgroup.cpp:147 [opt] frame #4: 0x000000010f107269 libjvm.dylib`WorkGang::run_task(this=0x00007ffcd370d280, task=0x000000013be2d470, num_workers=4) at workgroup.cpp:284 [opt] frame #5: 0x000000010e989fcf libjvm.dylib`G1CollectedHeap::evacuate_collection_set(this=0x00007ffcd5000020, per_thread_states=0x000000013be2d7f0) at g1CollectedHeap.cpp:3771 [opt] [...] (see attached backtrace-all.txt) Notice that the threads 24/25/26 have assigned worker id 0/2/3 respectively. The thread for worker id 1 is absent (still?) (see also attached threads-24-25-26.txt for more contents of the stack frames of these threads)
14-02-2019

Duplicate :	JDK-8220472 - compiler/uncommontrap/TestDeoptOOM.java timed out
Relates :	JDK-8220805 - Remove macOSX pthread bug workaround
Relates :	JDK-8204947 - Port ShenandoahTaskTerminator to mainline and make it default
Relates :	JDK-8217330 - Split G1CollectionSetChooser into collection set candidate container and the chooser algorithm
Relates :	JDK-8218680 - G1 crashes during calculation of old collection set candidates
Relates :	JDK-8210832 - Remove sneaky locking in class Monitor