JDK-8226731 : Remove StoreLoad in G1 post barrier
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 14
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2019-06-25
  • Updated: 2020-09-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8236485 :  
Description
In 8u20(?) we had to introduce a StoreLoad memory barrier instruction to the G1 post write barrier to make sure that the refinement thread ses the reference store from the mutator (JDK-8014555).

This made the G1 post write barrier pretty big due to mitigations of the performance impact, i.e.

given

x.a = q

and

p = @x.a

the barrier currently looks as follows:

if (p and q in same regions or q NULL) -> exit
if (card(p) == Young) -> exit (*)
StoreLoad; (*)
if (card(p) == Old) -> exit
card(p) = Dirty;
enqueue(card(p))

(The lines with the (*) have been added in JDK-8014555)

However the StoreLoad in the write barrier (and the mitigation) is not necessary: in May 2015 [~eosterlund] proposed a technique [1] to use regular global synchronization in a discussion about another memory barrier missing in the CMS write barrier.

The same technique could be applied to G1, and there is even a prototype for this [2]. This code uses a system wide memory synchronization call (e.g. sys_membarrier() equivalent in Linux in this case).

A less invasive way to implement this is to use an epoch counter added to the refinement buffers; if that buffer's epoch counter is less than a minimum of epoch counters attached to threads indicating the "time" at which they last performed memory synchronization, that buffer can be safely refined.

These updates to the epoch counters can be piggy-backed on existing VM transitions/handshakes which already perform a memory barrier (earlier there has been a global UseMemBarrier flag but that controlled this, but it has been removed and defaulted to true).

If a thread does not seem to progress, force such an epoch update by a handshake.

However, according to [~eosterlund] using existing thread transitions already gets you very far with not needing to force any threads. (Also because of existing forced background activity like biased locking revocation or other global handshakes). 

Quiescent threads (blocked, in native) can be ignored for this calculation - they are not going to refine anything anyway, and in the next VM transition change they will automatically update to the latest epoch. 

(sys_membarrier may still be used in "urgent" cases)

With this technique you can reduce the barrier to

if (p and q in same regions or q NULL) -> exit
if (card(p) == Old) -> exit
card(p) = Dirty;
enqueue(card(p))

Which is known to improve performance quite a bit.

[1] http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2015-May/013264.html
[2] http://cr.openjdk.java.net/~eosterlund/g1_experiments/lazy_sync/webrev.06/ (Note that this webrev also contains changes for JDK-8087198)
Comments
There is JDK-8220049 planned for 14. Let's assume for now TLH is always available; one could always use sys_membarrier or equivalent in these cases, or at worst override -XX:-ThreadLocalHandshakes. Do not implementt TLH yourselves. :)
02-08-2019

As we will remove "-XX:-ThreadLocalHandshakes" soon, we don't need to support the case of "-XX:-ThreadLocalHandshakes".
02-08-2019

I prototyped that patch back in 2015. Back then we did not have thread-local handshakes, so I implemented it for the asynchronous asymmetric dekker synchronization mechanism (global synchronizer). That's how thread-local handshakes (TLH) started actually. Today we have a TLH implementation using indirect loads in hotspot. While that is not quite as good as the conditional branch for our purposes as the slow path is slower, I would still stick to the default global handshake mechanism for sure. Just shoot no-op handshakes (they now always run cpuid on the target thread which is all sync we need). Hope this helps...
29-06-2019

Thanks for the explanation! It makes much sense to me now. I think it is reasonable to first implement epoch maintenance and handshake protocol. And probably only add a test guarded by a develop/diagnostic VM flag that checks the handshake protocol works and makes progress. If this sounds good, I'll create several sub-task CRs for this CR. Another question: I see [2] added ThreadLocalSafepoints flag for yieldpoint support. Is it necessary to add yieldpoint to implement the handshake? It will be less invasive if the handshake can be built on top of the existing polling-page safepoint mechanism.
29-06-2019

Going to answer the last question first, because it the easier one: "Also what is the plan for [2] and JDK-8087198? Rebasing and stripping down [2] will likely implement this CR, and JDK-8087198 also seems helpful to this CR?" This CR and JDK-8087198 kind of complement each other. I.e. you need to break up enqueuing the buffers into the global set of buffers and processing the buffers since inbetween you need to make sure that memory synchronization is proper. I.e. the main issue is the store to the field must be visible to the refinement thread before processing the card contents and the card value update before refinement threads scan the card. It is fine (and currently may already happen) to do repeated scanning of the card. JDK-8087198 is maybe more about optimizing the processing of the buffer contents by breaking up enqueuing the buffers and processing too :) and additionally distribute and optimize the buffer processing (eg. binning, sorting, batching, ...). So you are right, one could implement JDK-8087198 before this. I.e. when enqueuing a new buffer into the overall buffer set, do preprocessing and all the other tasks described there without changes to the barrier and synchronization. I.e. your view on the situation that JDK-8087198 is about breaking up enqueuing and processing and this change just adds the synchronization in between is as fine. You could even argue that you want to do the break-up of enqueuing and processing of buffers separately (that would be even easier to review these parts) - but I would separate out that change only after you have done either of the other two to minimize shuffling around stuff. I.e. a preparation patch for either. Feel free to choose the order and steps as you see fit and think is easiest to understand, but I would really recommend separating the buffer processing improvements with memory synchronization changes.
28-06-2019

I see [2] contains something similar to the per-thread and global epoch (JavaThread::_serialized_memory_version and GlobalSynchronizer::_global_serialized_memory_version). I have a question for the per-refinement-buffer epoch approach. I assume the write post-barrier needs to update the buffer's epoch to the current thread's (T) epoch: T.enqueue(card_addr): T.buf.epoch = T.epoch T.buf.add(card_addr) Or maybe only when it retires the buffer to the global queue set. Then when a concurrent refinement thread processes a buffer: if (buf.epoch < MIN(t.epoch for each Java thread t)) { refine(buf); } What if T1 enqueues a card and updates the buffer's epoch to T1.epoch. Then later T2 tries to dirty the same card, but skips because it's already dirty. Then should a refinement thread wait for each Java thread's epoch to exceed the value of T2.epoch, instead of T1.epoch? Or perhaps the per-refinement-buffer epoch is maintained in a different way? Also what is the plan for [2] and JDK-8087198? Rebasing and stripping down [2] will likely implement this CR, and JDK-8087198 also seems helpful to this CR.
28-06-2019

"I see [2] contains something similar to the per-thread and global epoch (JavaThread::_serialized_memory_version and GlobalSynchronizer::_global_serialized_memory_version). I have a question for the per-refinement-buffer epoch approach. " Yes, the prototype implements the same thing. The barrier does not need to update the thread's epoch, this is done every time the mutator thread passed a synchronization point, i.e. executed a StoreLoad. This is to let the refinement threads know that that thread has seen a particular memory state, i.e. has definitely executed the StoreLoad that it would otherwise execute in the barrier. As for the other question, here is a short answer: The current global epoch needs to be attached to the buffer, but only after you executed the StoreLoad in the refinement threads. This is not necessarily the original buffer the mutator refined. The patch [2] had BufferedRefineCardTableEntryClosure caching buffers (work) until it is sure that a StoreLoad has executed in the mutators (the epoch of all mutators is >= that epoch) after it executed its StoreLoad. Here is an overview of what needs to be done, applied to current mechanisms (i.e. the prototype implemented its own handshaking and had other optimizations and all that JDK-8087198 stuff that just fit in). Setup: - every thread has an epoch indicating whether it has seen a particular memory state. We frequently(*) update this with the value of a global (increasing) counter; we can reset that global counter at GC to avoid overflows (but since it is a 64 bit value, we probably do not care about overflow too much; even on 32 bit machines this should not matter, and in the worst case executes a safepoint operation that resets all epoch values properly). Mutator code: The write barrier pushes new cards onto its local buffers, and eventually calls into runtime to hand over the full buffer to the refinement queue as before. No StoreLoad in the write barrier (as the refinement thread simply waits until it knows that all threads must have executed one). Refinement threads: Using the existing machinery, as it gets a new buffer to process, the refinement thread: 1) cleans the cards for all buffer entries and does some optional pre-filtering (like in the BufferedRefineCardTableEntryClosure::pre_sync() method using clean_card, i.e. first parts of current G1RemSet::refine_card_concurrently()) into a separate buffer (if you do pre-filtering) 2) issue a StoreLoad (refinement thread's synchronization point) 3) assign the current epoch in some way to this (intermediate) buffer 4) [potentially do some stuff to pass more time like the sorting in the example code] 5) sync with all threads' global counters: 6a) if all thread's global counters are already >= epoch, then goto 7) - the refinement thread knows now that all mutator threads have executed the StoreLoad otherwise: 6b) issue a Handshake 6b1) in the Handshake, first check whether the global counters of threads are already >= epoch (due to a previous handshake) 6b2) the handshake operation, i.e. the code that the thread executes, it updates its own thread global counters and do Storeload, i.e. does the same as if it passed one of these synchronization points 7) process the cards in the buffer in whatever way desired This is a very basic (if slow) scheme that should work. The prototype [2] does not always issue a handshake if the condition in 6a fails since it is expensive. Only if repeated attempts to get in that situation (caching the buffers along the way) fail, the refinement thread performs the handshake. And if that fails over and over again, it does the big hammer synchronization (e.g. sys_membarrier equivalent) to know that after that all threads memory must have been synchronized. There are some exceptional cases to think through (e.g. mutator needs to process its buffer itself), but that's the basic scheme.
28-06-2019

I'm glad to see interest for this resurfacing. At the time I wrote the referenced patch, sys_membarrier wasn't a thing in the kernel and neither was global handshakes. Nowadays this should work better than ever as we always have handshakes to fall back on. The lazy synchronization also implied that concurrent refinement waits for buffers to fill up more to amortize away the cost of the lazy synchronization. That extra delay in processing the buffers also lead to dirty cards sticking around for longer, causing cards to stick around in the dirty state in the card table for longer, making mutators exit early. Before I forget about it, as I mentioned to [~tschatzl] before, the NULL check and region checks can also be combined to a single check using ANDN. So provided a new value q stored to field q, you can instead of: if (q != NULL) { if (((q ^ p) >> RegionLog) != 0) { ... } } ...you can do: if (((q &~ p) >> RegionLog) != 0) { } where the &~ fold to a single ANDN instruction. On x86_64, the two checks become 3 instructions including the branch (andn tmp, q, p; sarq tmp, RegionLog; jne slower_path). Is there an enhancement filed for this too? If not I can file one before I forget about this.
25-06-2019