JDK-8027547 : G1 should not push already forwarded objects into the object task queue but special case them
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: hs25,8
  • Priority: P2
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2013-10-30
  • Updated: 2021-07-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
Measurements on specjbb2005 show (see JDK-8020306) that G1 object copy times are much larger than parallel gc's.

One important problem is that the number of items pushed into the object copy task queue in G1 is much larger than for parallel gc. Investigation of this behavior showed that parallel gc has a special fast path handling references to already forwarded objects: it does not push them onto the work queue, but fixes them up and does any missing processing (remembered set) inline.

E.g.

inline void PSPromotionManager::claim_or_forward_internal_depth(T* p) {
  if (p != NULL) { // XXX: error if p != NULL here
    oop o = oopDesc::load_decode_heap_oop_not_null(p);
    if (o->is_forwarded()) {      // <--  "fast path" here
      o = o->forwardee();
      // Card mark
      if (PSScavenge::is_obj_in_young(o)) {
        PSScavenge::card_table()->inline_write_ref_field_gc(p, o);
      }
      oopDesc::encode_store_heap_oop_not_null(p, o);
    } else {
      push_depth(p); // push into work queue
    }
  }

Depending on the amount of already forwarded objects, this saves a lot of work as task queue management is relatively expensive: pushing/popping the element, possibly using locality, additional use of memory in the overflow queue, more (slow) stealing work (including possibly contended one) that is done etc.

Comments
JDK-6672778 is highly relevant to this. Before that change the prefetch before push was apparently being mostly wasted and hurting performance. Also some discussion here: https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2019-October/027366.html
29-07-2021

As discussed in JDK-8246718, the object already forwarded check turned out to be not so good for ParallelGC after all, and has been removed. I'm leaving this RFE open for now, as a placeholder for further investigations in this area, since there's already some useful information here. I think that for it to be beneficial for G1, it might need to have a significantly higher rate of encountering such objects, and I don't see a reason for that to happen. The only place I can think of where ParallelGC and G1 could differ is in remembered set processing, if G1 were to process some cards repeatedly. But I don't think there should be much (if any) difference there either. If the cost of copying were much higher for G1 that could also tip the balance. Having the prefetch before push does seem to be somewhat beneficial compared to no prefetch at all. But having the prefetch on the push side of the queue seems pretty wasteful, particularly early in the cycle. As we scan an object, we'll prefetch all the objects it references, then start working on one and scan/prefetch/push all the objects it references, and so on. So by the time we get back to those referenced by the first object we may well have purged them from the cache, wasting the prefetch. I've done some experiments with putting the prefetch on the pop side, but so far haven't managed to do any better than not regressing.
14-06-2020

More an investigation on what can be done here.
28-10-2015

Very good point re: "prefetching becoming less effective towards the end of the GC". Not sure what to do about that, though. Maybe only prefetch if your task queue is over a certain length (less chance of other threads stealing, given that they steal from the bottom)?
14-02-2014

I think the situation is as follows: typically at the end of GC the frequency of forwarded references increases. There are two classes of tasks in the queue, one that is fast to process and another one that is much slower to process (since handling forwarded references is much faster than the normal processing (call to copy_to_survivor_space()). Mainly to-be-forwarded references, so the task that is stolen is typically a reference that needs to be forwarded.The threads are simply able to try more often as they are waiting for a single thread to complete a "real" reference (of which there are probably just a few now and then at this point). Also, after completing an already forwarded reference, the thread will be required to go steal again right away as forwarded references do not create new references to process. Another issue, with that high (it does not seem bad in this particular example, might dig out the specjvm2008.compiler logs) amount of stealing the prefetching may be useless, as the reference we think this thread is going to process soon is stolen in a lot of cases.
30-01-2014

Hey Thomas, Thanks for the detailed reply. I was not saying that it's not worth trying this! I just pointed out that the higher cache miss rate might cancel out any benefits the change might have (at least, in some cases). I wonder why stealing goes up when there are more operations on the queues. Maybe, when some threads terminate earlier than others, there are simply more items available for stealing? Tony
30-01-2014

I saw the comment in the G1ParScanClosure::do_oop_work_nv (or so) method. The problem is that we are not just pushing/popping an object from a queue, managing the queues takes time too. Then there's the synchronization overhead of other threads trying to access and steal from the work queues. Given the task queue stats above, the stealing seems to grow unproportionally with the number of operations on the queues. Also, there is quite a lot of work to do (a few branches at least) from popping a reference from the queue until we are back to the processing code - G1 does not exactly have the most lightweight closures - even if everything is in the caches. I wrote elsewhere I think that G1 is decode bounds (too long code paths all around) too according to the tool. From measurements with update RS phase changes (no CR yet), the executed code to actually do the forwarding and then update the remembered set entry is typically really small (rset filters work well). As the first comment indicates, although G1 only does 13% more pushes and pops, according to VTune it spends more than twice the time there (4/1.8) (specjbb2005, this is repeatable, but as you know when working with these tools this might be a sampling problem) Iirc most applications also have a higher amount of forwarded references to process than specjbb, so the saving should generally be larger than there (I've gotten better than expected improvement in specjbb scores in a few quick runs with the change though, probably something wrong with the test). There are other optimizations already in the bug system that work better with specjbb than this one. Sure, the change needs to be measured well. :)
30-01-2014

Thomas, There's a very interesting trade-off here. On one hand, checking whether the object is forwarded or not might cause an immediate cache miss (and you have an extra conditional in your fast path). On the other hand, pushing on / popping from the taskqueue might be not too bad cache performance-wise (the top of the taskqueue will be very hot and probably already in the cache). I'm very interested to see whether the extra check will have an impact on GC times and/or cache miss rate (it'd be worth doing a couple of runs with your favorite profiler to check the latter). Tony
30-01-2014

Some statistics about the amount of forwarded references G1 pushes into the task queue: specjbb2005: 8.96% specjbb2013: 8.96% specjvm2008.compiler: 44.6% (!) specjvm2008.compress: 31.2% specjvm2008.crypto: 30.9% specjvm2008.derby: 37.1% specjvm2008.mpeg: 9.59% specjvm2008.scimark: 13.0% specjvm2008.serial: 29.5% specjvm2008.startup: 32.2% specjvm2008.sunflow: 20.5% specjvm2008.xml: 17.4% specjvm98: 17.9%
24-01-2014

Here's the output of task queue stats for parallel gc on specjbb2005: thr qpush qpop qpop-s qattempt qsteal opush omax --- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 0 795448 795303 66 28 27 0 0 1 678503 678391 104 75 74 0 0 2 612772 612615 110 52 51 0 0 3 556778 556735 242 306 305 0 0 4 0 0 0 0 0 0 0 --------masked------- arrays array thr push steal chunked chunks --- ---------- ---------- ---------- ---------- 0 1409 12 4 1409 1 673 3 1 673 2 676 11 2 676 3 740 15 20 740 4 0 0 0 0 and here a similar one for G1: GC Task Stats thr qpush qpop qpop-s qattempt qsteal opush omax --- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 0 744163 743894 375 236 233 0 0 1 774933 774545 313 202 200 0 0 2 850281 850104 275 311 309 0 0 3 632904 632758 778 240 238 0 0 tot 3002281 3001301 1741 989 980 0 0 E.g. in total PS scavenge pushes 2.64M references, while G1 pushes 3M; adding some code that prints out the number of forwarded references pushed at that time (in G1ParScanThreadState::push_on_queue), we can see that this is about 360k, i.e. almost exactly the difference (accounting for slight differences in the point in time the measurement has been taken). 0 forwarded: 124482 1 forwarded: 75292 2 forwarded: 76951 2 forwarded: 86251 The values also show the difference in the number of slow pops, attempts and steals. Vtune shows a (total) ratio of cycles spent in the worker task queue of around 1.8 (parallel)/4 (g1).
30-10-2013