Bug ID: JDK-8152196 SuspendibleThreadSet::yield scales poorly

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 9

Priority: P4
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2016-03-18
Updated: 2016-04-21
Resolved: 2016-03-25

JDK 9
9 b115Fixed

When a requesting thread calls SuspendibleThreadSet::synchronize, it sets the request flag (_suspend_all) and then waits on the STS_lock until all the threads in the set have either left the set or called yield() (determined by comparing _nthreads_stopped to _nthreads).

When a suspendible thread calls yield() while there is an active request, it updates the yield counter (_nthreads_stopped), calls notify_all on the STS_lock (to wake up the requesting thread so it can recheck and either rewait or proceed), and then waits on the STS_lock until there is no longer an active request.

The problem with this is the suspendible thread's notification of the STS_lock to alert the requesting thread of a state change also wakes up any previously yielded suspendible threads. These wakeups are not necessary to the semantics of the operation; they are pure overhead. In the worst case, the number of excess wakeups before the requestor can proceed scales as the square of the number of suspendible threads that yield for the request.

[In practice the number of excess wakeups seems to usually be lower (though can still be significant), because some of the threads that have not yet yielded are often competing for the STS_lock with those that have been woken up from their waits, and so may perform the notification before the awakened threads have resumed waiting.]

Having a suspendible thread call leave() rather than yield() has a similar effect of waking up any already yielded threads and the requestor, unnecessarily unless that leaving thread was the last non-yielded thread in the set.

Note that the table in the earlier comment might understate the benefit of this change. Using the monitor for wakeup, the penultimate yielding thread ensures that all of the threads except the final yielding thread want the lock. The final yielding thread could be delayed from entering the critical section by competition with the unnecessarily awakened threads. Such delays are not included in the table. With the semaphore-based version the final yielding thread may be delayed from entering the critical section by earlier arrivals, but it won't be competing with unnecessarily awakened threads.

25-03-2016

Results of running: - specjbb2015 for warmup + 2 minutes PRESET (so 10 minutes total) - with instrumented STS to report latency for waking up requestor - with 18 concurrent refinement threads added to the mix as part of fixing JDK-8151670 - on my dev machine (Linux, Xeon 2.6G, 24 threads) I collected time from notifying the requestor until it exited wait. Note that not all safepoints involve a requestor wait; there might not be any threads in the suspendible set because they aren't active at that point for other reasons. Collected for: - semaphore: new implementation using a semaphore to wake up requestor - monitor: monitor-based with yielding concurrent refinement threads - mon_norefine: baseline, monitor-based, refinement threads not using yield Although the monitor results were with the existing implementation, these tests should be no worse than what would arise from the improved version that only notifies the monitor once. These tests don't attempt to measure the additional impact of the other unnecessary wakeups. semaphore: 250 safepoint waits collected monitor: 211 safepoint waits collected mon_norefine: 212 safepoint waits collected All times in seconds. | | semaphore | monitor | monitor/sem | mon_norefine | mon_norefine/sem | |------------|--------------------|-------------------|-------------------|--------------------|--------------------------| | average | 0.000014090 | 0.000245393 | 17.41573968 | 0.000056720 | 4.02549548 | | median | 0.000009778 | 0.000020184 | 2.064225813 | 0.000010546 | 1.078543669 | | stddev | 0.000021254 | 0.000712454 | 33.5208877 | 0.000338555 | 15.92899677 | | max | 0.000262114 | 0.006117748 | 23.34002762 | 0.0045283 | 17.27607072 | | min | 0.000003283 | 0.000001537 | 0.468169357 | 0.000001956 | 0.595796528 | [kab: Not sure how to get a nicely formatted table here.] Although the semaphore-based method has a worse minimum time [1], overall it's a pretty clear winner, especially as the number of threads increases. The monitor results get signifcantly worse with the addition of 18 concurrent refinement threads to the potentially yielding set, and will get increasingly bad as the number of threads (marking and refinement) increases. The difference in the maximum wakeup times is particularly of note. [1] The best-case wakeup of a thread waiting on a semaphore is worse than the best-case wakeup of a thread waiting on a monitor, apparently by about a factor of two.

18-03-2016

This scaling effect could be mitigated by only having leave() and yield() notify when the requestor will be allowed to proceed, e.g. only one notification per request. However, this will still wake up all the previously yielded threads, who will then all compete for the lock with the requestor, possibly delaying the requestor from proceeding. And those yielding threads will find the request still in effect and go back to waiting, making their wakeup a complete waste of effort. (Unless the requestor very quickly completes its task and clears the request before some of those suspendible threads have regained the lock and resumed waiting. I've seen this happen with a "remark" then "cleanup" sequence, for example.) However, a better solution would be to separate the notification of the requestor that it can proceed from the notification of the suspendible threads that the suspension request has ended. We can do that by using a semaphore for the notification of the requestor, though the analysis is a little complicated. (It would be much easier with a mutex and two condition variables, but we don't have those in Hotspot. I think it might be possible with two Hotspot Monitors, but would be at least as complicated to analyze and uglier to code than the semaphore-based solution.)

18-03-2016

Blocks :	JDK-8151670 - Unexpected concurrent refinement deactivation and reactivation
Relates :	JDK-8152199 - Restructure SuspendibleThreadSet::synchronize