JDK-8034948 : Back out JDK-6976350 since it does not fix any issue
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8u20,9
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2014-02-14
  • Updated: 2014-07-29
  • Resolved: 2014-02-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
8u20Fixed 9 b04Fixed
Related Reports
Relates :  
Relates :  
JDK-6976350 changed the PLAB allocation to use two buffers per thread instead of one with the intent of decreasing PLAB waste.

Recently we have analyzed some high-thread, high-fragmentation inducing applications (many different object sizes used) as part of JDK-8030849, and seen a few applications behaving worse than without that patch.

In particular, specjbb2013 and CRM Fuse with many threads (few hundred for specjbb) run significantly faster without that change.

Attached a few figures to show the "difference" between the two versions in more detail on some dacapo benchmarks.

backout_dacapo_overall.png shows total PLAB buffer use for all DaCapo benchmarks (on the x-axis; t04 means with 4 GC threads, bX means that X number of buffers were used). Results with one and two buffers are shown interleaved. Y-axis shows the number of bytes used by PLAB allocation, divided into allocation of objects (blue), waste at the end of the PLAB buffers (red) and unused space at the end of PLABs because of unfilled buffers at the end of GC.

The figure basically shows no real difference across applications, sometimes one or the other is ahead in space used.

backout-eclipse|tradesoap-increasing-threads.png and backout-eclipse/tradesoap-increased-threads-gc.png show more detail for the eclipse and tradesoap benchmarks respectively.

The first figure (backout-eclipse|tradesoap-increasing-threads.png) show the amount of PLAB allocation for eclipse when increasing the number of parallel gc threads from 1 to 32. The left half of the figure shows the behavior when using a single buffer, the right half when using two buffers.

There is *no* real difference at all.

backout-eclipse/tradesoap-increased-threads-gc.png show the number of GCs and total allocated bytes with an increasing number of threads. Again, on the left, using one buffer (original code) and on the right using two buffers.

Again, no significant difference.

Since the change effects nothing for these applications (and we have numbers for larger applications which show a slight disadvantage of using two buffers), and this change did not implement the idea suggested in JDK-6976350, this code just complicates the codebase.
Random trivia: in the above mentioned CRM Fuse runs, backout of JDK-6976350 increases average eden size from around 120M to 130M.

Fixed figures - in the previous figures "allocated" already contained wasted and unused once

Just to make clear what kind of impact the PLAB retirement has (as shown in the figures): in eclipse and tradesoap around 30% of space in the promoted areas is just waste. You could say, 30% of the (old gen) heap is for nothing. Logs from CRM Fuse indicate averages of 40%, with as high up as 70% with non-trivial absolute amount of promotion and waste. The problem is, that with a higher amount of threads, the individual thread does less promotion. Now with two buffers per thread it often occurs that at the end of the GC the second buffer is almost empty, just containing a few objects, leading to excessive waste of space. This particularly hits applications that promote objects with a relatively large range of object sizes (and more than usual amount of them).

After backing out JDK-6976350 on a 32 thread x64: +3.4% end-to-end performance on specjbb2013. (I never really looked at the impact of this change in particular, because the next step was simply disabling all PLAB resizing which yielded even more gain - more in a future CR). There is strong suspicion that on Sparc M6 with 768 threads this change is the root cause for a "not being able to reach 200k injection rate" problem on specjbb2013 in some recent testing with latest jdk8 (>480 gc threads). I.e. some early JDK8 had no problems reaching that high IR. CRM Fuse improves ART from 0.38 to 0.36 (compared to a jdk8b129 without that change), with 2% decrease in young only GCs, 7% decrease in mixed GCs, and 17% less concurrent remarks (at 18 gc threads). There is no (immediately perceptible) difference on machines with maybe <= 8 threads; also the programs chosen for the figures (eclipse, tradesoap) are just too small to show any changes at higher thread count. The separate GC pauses seem to be slightly longer, but this is easily explained because this change simply allows a significantly larger young gen (within the same pause time goal). The correct way to handle this situation is simply decreasing the pause time goal (or max young gen size), not artificially limit the GC and application performance. Let me try to explain in other words why this change has absolutely no benefits (and tends to be detrimental actually): JDK-6976350 was about decreasing fragmentation, i.e. wasted space at the end of the PLABs. Large objects that just did not fit the current PLAB should be handled by the second. However, if you look at the diagrams I provided, there is no problem with waste at the end of the PLABs (look for the red bars - yes, they are so small you cannot see them). The problem is/was that at the end of GC, large parts of the PLAB is unused and needs to be retired (the yellow bars). So *adding* another PLAB buffer simply only increases the number of PLAB buffers that might contain just a few objects, and needs to be retired later. The existing method to decrease the inner fragmentation, ie. direct allocation in the region if there is too much inner fragmentation works extremely well; since in a regular Java application the majority of objects are very small (compared to any PLAB size), there will be no performance issue. Also, the more unused space in PLABs decreases the space available for the young gen in the heap (i.e. eden size), increasing the amount of objects that survive because they now have less time to die (smaller eden!). Which in turn increases the amount of wasted space due to retiring buffers because we do more copying. Since this decreases throughput, less objects are allocated too by the application, so at some point some equilibrium is reached. The real problem with have with PLAB sizing is not inner fragmentation (nor outer fragmentation that occurs when we cannot fit the PLAB in a region - the regions are typically so large anyway), but too large and too many PLABs. This change tries to address the "too many" aspect, trying to get this amount back to at least to values before that change. The only literature which mentions that it is beneficial to have multiple (per-thread) buffers is Blackburn's Immix paper: but they do not have the direct allocation into the regions capability. Thinking about possible fixes, none of ones I came up with involve having multiple statically sized buffers per thread.

What was the decrease in performance with specjbb2013? Was there an observed regression in GC performance?

Key for the x-axis labels: <benchmark>.t<number-of-threads>.b<number-of-buffers-per-thread>.resize.waste