Bug ID: JDK-8046418 High system time for PLAB allocations in highly threaded applications

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

In some applications with a lot of threads we sometimes notice high system time during GC; investigation showed that at least one cause for this can be PLAB allocation.

If many threads try to refill their PLAB at approximately the same time, the heap lock will be taken, causing lots of contention.

The attached file shows high pause time spikes caused by that on a micro benchmark.

Find ways to fix this.

Following are the test cases for the given figure:

[0] -XX:-ResizePLAB
[1] -XX:+ResizePLAB
[2] -XX:-ResizePLAB -XX:OldPLABSize=2097152 -XX:YoungPLABSize=2097152
[3] -XX:-ResizePLAB -XX:OldPLABSize=0 -XX:YoungPLABSize=0

For this test, [2] has the best results. However, we can not give too big PLAB size, due to too much wasted space.

May be impact of NUMA, depending on the run average pause time is either X or Y ms in a very reproducable manner on e.g. pjbb2005. When limiting that benchmark to a single NUMA node, average pause time is very reproducable and almost always exactly the same. (This is a possible answer to the observation that average pause time is either 80 or 100ms, not the original problem)
01-09-2015
Just to be clear: there is still high system time in these cases (taking 10 to sometimes 20ms), but this is typically negligible to the user time spent. I cannot observe spiky behavior.
15-12-2014
Cannot reproduce the high system time here. However I found that at least for this benchmark, the average gc pause time may vary between runs by 10-20% (say, 100ms vs. 80ms). This variation seems to be higher with more threads. Cursory evaluation shows that there is a problem with the allocate_slow() path, taking up to 10ms for some threads in total per gc. It does not explain the whole difference.
15-12-2014
Potential solutions: - size the first PLABs differently for the threads so that they get to refill at different points in time - another option is to preallocate/use multiple allocation regions that can be handed out in reserve (potentially in different sets) so that the threads do not need to take the (same heap allocation lock - a heap region allocation path during GC that does not use the heap lock
10-06-2014