JDK-8156708 : Intermittent hangs/timeouts on SPARC in G1 since 9-b99
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 9
  • Priority: P2
  • Status: Closed
  • Resolution: Duplicate
  • OS: solaris
  • CPU: sparc
  • Submitted: 2016-05-10
  • Updated: 2016-05-20
  • Resolved: 2016-05-13
Related Reports
Duplicate :  
Relates :  
Description
We've intermittently seen long timeouts on various startup benchmarks when running with G1 on SPARC since 9-b99.

From pstack output from a core dump taken from a java -version hanging on SPARC on a more recent binary:

ffffffff683b51e8 JVM_MonitorWait (100a529c8, 0, ea60, 100a52000, ffffffff68ede750, 100a529d0) + 2c8
 ffffffff57c102d8 * java/lang/Object.wait(J)V+-8672
 ffffffff57c1027c * java/lang/Object.wait(J)V+-2760
 ffffffff57c08b88 * java/lang/ref/ReferenceQueue.remove(J)Ljava/lang/ref/Reference;+59
 ffffffff57c088f8 * jdk/internal/ref/CleanerImpl.run()V+65
 ffffffff57c08bd4 * java/lang/Thread.run()V+11
 ffffffff57c08b88 * jdk/internal/misc/InnocuousThread.run()V+20
 ffffffff57c0036c * StubRoutines (1)

Seems something might be up with the cleaner introduced in JDK-8138696 (and further improved in JDK-8146038), which now features an innocuous and tenacious thread that doesn't like being interrupted:

            try {
                // Wait for a Ref, with a timeout to avoid getting hung
                // due to a race with clear/clean
                Cleanable ref = (Cleanable) queue.remove(60 * 1000L);
                if (ref != null) {
                    ref.clean();
                }
            } catch (Throwable e) {
                // ignore exceptions from the cleanup action
                // (including interruption of cleanup thread)
            }

Trivial reproducer (on a SPARC T4-1 this is very likely to hang for ~60s after around 10-20 executions):

for i in {1..100}; do java -version; done
Comments
A number of things came together to cause this problem. I'm now seeing a more complete picture. First, we added the TLS destructor: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/469835cd5494 https://bugs.openjdk.java.net/browse/JDK-8033696 This was only added to Linux and OSX (in the BSD code) and the potential infinite loop when threads never detached did not manifest. But the freeBSD folk found that after the four iterations a warning was printed which is how the underlying problem I later introduced became visible. Second, when I made the compiler-based thread-local changes I factored out the pthread_set/getspecific TLS library code and applied it to all POSIX platforms - so adding AIX and Solaris to the mix. But I never thought to check whether those platforms limited the potential infinite loop. Third, I elided the clearing of TLS explicitly when I made the TLS changes because I was assuming either the thread destructor would handle it, or else it didn't matter as the threads involved (GC mainly) only terminated when the VM was shutting down anyway (and racing with shutdown at that). But it was exactly those GC threads that caused the problem - they don't have their thread destructor called and they seem to be get terminated as a group during VM shutdown. Hence we now had a group of terminating threads that didn't detach from the VM (ie they didn't clear TLS) and so they hit the looping TLS destructor behaviour. It seems that on Solaris the looping is not bounded and so we would see the VM appear to hang for quite some time - before the thread terminating the VM actually got a chance to blow away the process. I'm not completely clear where that thread got "stuck" however. So this code will cause problems for any badly behaved native apps that fail to detach threads from the VM - they will go into an infinite loop, fail to terminate and cause a termination bottleneck on this TSD related lock. I'm not even sure now that this TLS destructor trick even works as originally intended ... it seems to be caused by an implementation detail of pthreads on linux/bsd (where it explicitly deallocates/clears TLS variables during thread termination - which Solaris does not do) and it seems to rely on the key for "current thread" being processed first - which is true for our launcher, but not necessarily true in general. Update: the order of TLS destructor invocations does not matter. If the detaching constructor runs first then ~Thread clears the thread-current TLS and so its destructor is not run.
20-05-2016

Thanks for getting to the bottom of this, David!
13-05-2016

I have determined what is happening. The problem relates only to the use of library-based TLS: pthread_set/getspecific. The code for this can be seen here: http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libc/port/threads/tsd.c In JDK-8132510 we made a number of changes related to Thread::current() implementation. The primary change moved all platforms to use compiler-based thread-local-storage - that has nothing to do with the current issue. We still needed to use library-based TLS in addition to compiler-based because we need to access the current thread in a signal handler and the compiler-based TLS is not async-signal-safe (see comments in 8132510). As part of 8132510 we factored out the library-based TLS for all POSIX platforms and so Solaris started to use that code; and we (erroneously) stopped clearing current-thread (via pthread_setspecific(key, NULL)) explicitly for terminating GC threads - mistakenly assuming that either it would happen when the thread was deleted (but it doesn't get deleted), or else it didn't matter as the thread was gone anyway. The POSIX TLS code, that we adopted from Linux, includes the use of a TSD destructor - which is needed to guard against a problem with application TSD destructors being used to detach the current thread as it terminates. This combination of a non-null destructor and a non-null current-thread value means that we hit this code in tsd_exit: 395 lmutex_lock(&tsdm->tsdm_lock); 396 397 do { 398 __ recheck = 0; 399 400 __ for (key = 1; key < TSD_NFAST && 401 _____ key < tsdm->tsdm_nused; key++) { 402 _____ if ((func = tsdm->tsdm_destro[key]) != NULL && 403 ________ func != TSD_UNALLOCATED && 404 ________ (val = self->ul_ftsd[key]) != NULL) { 405 _________ self->ul_ftsd[key] = NULL; 406 _________ lmutex_unlock(&tsdm->tsdm_lock); 407 _________ (*func)(val); 408 _________ lmutex_lock(&tsdm->tsdm_lock); 409 _________ recheck = 1; 410 _______ } 411 ____} so we always grab the lock first, but then we release it, execute the destructor and re-acquire the lock. This is what causes the bottleneck. In JDK-8154715 we reinstated the explicit clearing of thread-current, so the "if" condition above is no longer true - no destructor gets executed for a NULL value - so we don't do the unlock/re-lock pair and so we avoid the lock contention. This is not likely restricted to sparc, but was simply more evident on the large sparc systems.
13-05-2016

b78 - no delays, but performance is around 50% of most recent builds (43 GC threads) b79 - ditto b98 - ditto b99 - hangs observed, otherwise performance the same (still 43 GC threads) b118 - hangs observed, performance more than 2x better than earlier buuilds (only 25 GC threads) By decreasing the number of GC threads we can make the delay go away. There is no doubt in my mind that the tsd_exit path is causing a thread termination bottleneck, and that it is the number of GC threads that affects this as they seem to get told to terminate during VM shutdown. I suspect this is fixed by 8154715 because we have reinstated the explicit clearing of TLS by the GC threads, that was lost as part of the 8132510 changes in b99. My theory is that when the TLS variable has been explicitly nulled, the _thrp_exit code either doesn't need to explicitly call tsd_exit or else tsd_exit does not need to perform the locking that forms the bottleneck.
13-05-2016

The pstack output seems to show a number of threads piling up on an external mutex in the threading library as they terminate. ----------------- lwp# 2 / thread# 2 -------------------- ffffffff7eae2538 lwp_park (0, 0, 0) ffffffff7eadb0d4 lmutex_lock (0, 1, 0, ffffffff7ec873cc, ffffffff7ec87580, ffffffff7c700240) + d4 ffffffff7eae1fd8 tsd_exit (ffffffff7c700240, ff000000, ffffffff7ec84cc0, 28c0, ffffffff7ea99250, ffffffff7cd00400) + 74 ffffffff7eade30c _thrp_exit (2800, 2800, ffffffff7ec7e000, ffffffff7ec84cc0, ffffffff7c700240, 0) + 120 ffffffff7eae66ac _t_cancel (ffffffff7c8ff751, ffffffff79c083d0, ffffffff7c700240, ffffffff7c700240, 3, ffffffff7ec7e000) + 1bc ffffffff7eade680 _thrp_exit_common (0, 0, 19fa5c, ffffffff7ec7e000, 0, ffffffff7c700240) + e8 ffffffff7eae24c4 _lwp_start (0, 0, 0, 0, 0, 0) + 8 There is nothing to indicate this lock may be held by any active thread - all active thread stacks seem perfectly healthy. It is hard to see how this could be caused by the jdk.internal.ref.CleanerImpl changes, nor how it could be fixed by the changes in 8154715. The fact this is only seen on sparc also makes it less likely this is in any way related to code in the VM/JDK itself. The G1 connection seems more likely related to the number of threads involved than anything GC specific.
13-05-2016

The fact that this showed up in b99 and involves tsd_exit suggests that JDK-8132510 may be to blame - though the main switch to compiler-based TLS on Solaris occurred earlier, under JDK-8130212, in b79.
12-05-2016

I would be very suspicious about closing this as a duplicate without understanding the issue. The fix for 8154715 should have no practical impact on anything observable in the VM - the only observable part of this issue was the warning issued on BSD. The only way I can see 8154715 impacting things is if the destruction of the VMThread on the VM exit path (which has now been removed) was being delayed due to interactions with other threads somehow. But really you need to see where all the threads are executing during these hangs to confirm what the problem is. It may be that we have just shifted a timing window.
11-05-2016

I ran the reproducer 10+ times with a build containing the fix for JDK-8154715 and cannot reproduce any intermittent one minute hangs. Closing this as a duplicate.
11-05-2016

Investigating if this might be related to (and possibly already fixed by) JDK-8154715
11-05-2016