Bug ID: JDK-8131668 Contention on allocating new TLABs constrains throughput on G1

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Some benchmarks (Stress BPM, possibly SPECjbb*) indicate that there is a significant amount of contention on the freelist_lock to retrieve new mutator allocation regions, particularly if the TLAB size is already quite large and the number of threads is very high.

Look at possibilities to avoid this overhead.

I (and [~brutisso] ) have been instrumenting the time it takes to enter this lock in G1CollectedHeap::attempt_allocation_slow(). I have been running Specjbb2005 and SPL4. Almost all lockentries takes less than 0.01 ms. However, we have a few occurrences where it takes multiple milliseconds. Almost all of these are right after a young collection. It typically looks like this: ... 884M->573M(1024M), 0,0230163 secs] 0x7fdf28658000, 23,726 0x7fdf28659000, 23,713 0x7fdf28643800, 0,001 (First value is ThreadP, other is waiting time in ms) That is, these 2 threads have been waiting to allocate over a young collection, which is to be expected. Sometimes there are more than 2 threads: ... 883M->560M(1024M), 0,0357457 secs] 0x7fdf28676800, 36,350 0x7fdf28658000, 36,429 0x7fdf28665000, 36,387 0x7fdf28660000, 0,001 So I'm not sure that this lock is very contended. Could it be that the performance increase with randomizing TLAB sizes is not because of less contention on the lock, but some other effect that comes of different TLAB sizes?
20-08-2015
On the ArraysSort benchmark (see JDK-8062128) randomizing TLAB sizes gives 2-3% of throughput.
12-08-2015
- Another option could be multiple allocation regions/lists the regions are retrieved from, with separate locks
04-08-2015
There are some options here if it is : - randomize TLAB sizes to a certain degree on highly threaded machines - remove the locking in this path
04-08-2015
Another benchmark that performs badly due to that particularly on small heap sizes is Doug Lea's SPL4 test from the JSR166 (attached). One particular aspect is that it performs worse with 1M regions than with 32M regions. The problem seems to be that since all TLABs are the same size, and all threads get to refill their TLAB at the same time, there is huge contention on the freelist_lock lock (or whatever lock is used here).
03-08-2015

Relates :	JDK-8133055 - Investigate G1 performance on SPL4
Relates :	JDK-8227174 - Lazily set card table of allocated regions to correct values in G1
Relates :	JDK-8159429 - Contention on FreeList_lock during allocating GC regions
Relates :	JDK-8062128 - Dramatic difference between UseConcMarkSweepGC and UseG1GC