Bug ID: JDK-6938732 Ergonomify (parallel) reference processing

Other
tbdUnresolved

Created: JDK-8155257 : ParNew/CMS: Clean up promoted object tracking for the promoted object tracking clean up.
27-04-2016
I think the first task is to clean up when the parallel worker PromotionInfo lists are enabled to track objects (i.e., only during the main ParNew parallel phase and not during ref processing as it's not needed; see earlier comment) and make sure the lists are torn down in parallel and not serially (turns out this is a simple one-line patch). This is somewhat orthogonal to the ergo ref processing changes, but it will make them easier to implement for ParNew/CMS. I'll create a separate CR for it.
26-04-2016
I implemented a quick prototype that switches between serial and parallel ref processing per reference type (apart from JNI Weak Refs; those are always done serially even when parallel ref processing is enabled). There are two somewhat orthogonal things that needed to be implemented for this: a) the policy on when to do serial or parallel ref processing, and b) the code changes that allow both the serial and parallel ref processing code to be called during the same GC (you'd think this would be straightforward but it's not, at least not for ParNew/CMS; see below). For the policy I just implemented a very simple heuristic: if ref count is over a certain limit do it in parallel, otherwise do it serially. For a test tailored for that simple policy I'm getting encouraging results: the ergonomic policy outperforms both the totally serial and totally parallel policies in terms of ref processing overhead (so this is work worth pursuing IMHO). I see more benefit from the ergonomic policy when there's a non-trivial number of GC threads (>8). That is not surprising, as I assume the more GC threads there are, the more overhead there is when starting the parallel workers. I only tested the changes with ParNew/CMS where changing between serial and parallel ref processing during the same GC was a bit tricky. The reason is that the serial and parallel phases use different mechanisms for tracking and iterating over copied objects (even when the main GC phase is done in parallel) which somewhat conflict with each other. I'll describe the differences next (along with some additional complexities) so we have them written down somewhere. ParNew Serial Ref Processing : It relies on the "save marks" mechanism to discover objects that were just copied so that it can discover them and iterate over them. For the young gen, this is straightforward: checkpoint top, copy some objects, then iterate between the old and current version of top. This doesn't work for the CMS old gen, as there's no contiguous allocation. Instead, a PromotionInfo object maintains a list of object promoted to the old gen (objects are chained through their mark word it seems, and non-prototypical mark words are preserved; hmmm... where have I heard that one before?). Objects are added to that list as they are promoted, then an iteration walks over the list and removes objects from it as they are visited (and re-instates their mark word when necessary). ParNew Parallel Ref Processing : It doesn't rely on the "save marks" mechanism and instead the paralell workers keep track of objects to visit using their taskqueues. There's an additional fly-in-the-ointment : Each parallel worker also has a PromotionInfo object with which it tracks all the objects it promotes during a GC. This seems a bit counter-intuitive given that the promotion list is not needed for a parallel worker to keep track of which objects it's copied (it has its taskqueue for that). There is, however, a subtle reason for this. When promoted objects are added to a PromotionInfo list they are also marked specially so that they can be differentiated from objects that were not promoted during the GC. This allows the card table scanning code to skip objects that were just promoted and happened to end up on a dirty card. So, the PromotionInfo lists are populated during the main ParNew parallel phase. But, how are they torn down given that they are not iterated over? Serially, after the main ParNew parallel phase is done (it has to be, they can only be torn down after all card table scanning is done) and the lists tearing down time is attributed to... drum roll reference processing! For serial ref processing it is done before ref processing starts and for parallel ref processing it is done before processing the JNI Weak references (remember: JNI Weak processing is always done serially and after all other reference types have been processed). So, if we want to interleave serial and parallel ref processing phases what do we need to watch out for: Serial Ref Processing -> Parallel Ref Processing : Serial ref processing calls save_marks() often (it relies on it; see above) which unfortunately also tells the parallel worker PromotionInfo objects to start tracking promoted objects. So, any objects that are promoted during a subsequent parallel ref processing phase will be added to PromotionInfo lists. Given we don't scan the card table during ref processing, I think this is unnecessary and just creates more work later when those lists need to be torn down. Parallel Ref Processing -> Serial Ref Processing : Parallel workers use PLABs on the survivor space to copy objects. Currently, the serial phase only calls save_marks() at the end of a copying phase so at the start the saved mark could be wrong for the survivor space (and it might include half-full PLABs and objects that were copied by the previous parallel phase). Thankfully, this is a simple fix: also call save_marks() at the start of each serial ref processing phase. What complicates things further is of course how heavily abstracted the code is (GC-specific executors for parallel ref processing, ParNew accessing the CMS old gen through the Generation abstraction, etc.). So, even simple changes need some effort on how to implement them. Sigh. I should also point out: I only looked at how reference processing is implemented in ParNew/CMS. Hopefully, it will simpler in other GCs... he said! :-)
26-04-2016
Firing up threads that do nothing (or very little) potentially taking a lot of time is a common (known) issue. They can mostly be attributed to lock contention and scheduling. There are other CRs that try to fix this issue for various phases in G1 (e.g. JDK-8153745), but many others can be found, and naturally the other collectors are as affected when doing parallel work in their respective phases.
13-04-2016
A few thoughts on this: - First, I think it will be helpful to decide separately for each Reference type whether to do the ref processing serially or in parallel. If an app is using, say, very few SoftReferences but a lot of WeakReferences we should do the right thing for each Reference type. This could actually result in achieving shorter ref processing times than if it's done entirely serially or entirely in parallel. - The obvious way to make the decision is based on the numbers of References that need to be processed. However, the amount of work needed per Reference could vary wildly for Reference types (all but WeakReferences) that require / might require referents and their transitively closure to be marked. I.e., even if only a small number of References need to be processed, one of them might require a large percentage of the heap to be processed. We could try to run some experiments on "typical" workloads to work out the "typical" relative cost of processing one Reference for each Reference type and use that as a weight on the decision logic. But numbers from "typical" workloads will most likely be incorrect for many apps. - One strange behavior we're seeing right now is that, in some cases, parallel SoftReference processing takes 1-2ms even when there are 0 SoftReferences to process. (As if it's possible to dislike SoftReferences even more!) So, we should at least have a short-cut to bail out early when the the Reference count of a particular type is 0, because for SoftReferences we could fire the parallel workers 3 times for basically nothing.
13-04-2016
Assigned to you. We can always change the fix version depending on when it is pushed later.
13-04-2016
This change will be of interest to us (Twitter). Parallel ref processing seems to have helped a couple of our services. However, the majority of our services do not use References / finalizers heavily and enabling parallel ref processing for them causes young GCs to be slower. The difference is typically only a few ms. However young GC times are often quite short, therefore the difference as a percentage is non-trivial. We'd rather not have to decide per service whether to enable parallel ref processing or not, so deciding ergonomically will be very helpful. If no-one else is working on this, I'm happy to take it.
13-04-2016
The task is to provide the capability to dynamically turn it on, some policy to turn it on, and some way of telling that it has been turned on/off. There is no need to try to find the best number of threads in this CR, just a binary decision. The policy could (possibly) be based on the relative time it takes to do reference processing, and some upper bound on upper time. Potentially also some heuristics that tries to incrementally increase/decrease the number of threads to find the best value.
16-04-2015

Duplicate :	JDK-6996616 - Ergonomically turn on parallel Reference processing
Relates :	JDK-8155257 - ParNew/CMS: Clean up promoted object tracking
Relates :	JDK-8172792 - JEP 308: Improve Dynamic Number of Thread Sizing for G1
Relates :	JDK-8204686 - Dynamic parallel reference processing support for Parallel GC
Relates :	JDK-8043575 - Dynamically parallelize reference processing work