JDK-8204331 : AArch64: fix CAS not embedded in normal graph error.
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: aarch64
  • Submitted: 2018-06-05
  • Updated: 2018-07-06
  • Resolved: 2018-06-22
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 12
11 b20Fixed 12Fixed
Related Reports
Relates :  
Description
The current master of JDK fails when with CAS assertion error like:

  assert(mbar != __null) failed: CAS not embedded in normal graph!

in many jtreg test case, e.g.: compiler/loopopts/superword/SumRed_Int.java

This is caused by JDK-8202377
Comments
At present, volatile accesses are lowered to ordinary memory accesses + memory fences extremely early, during bytecode parsing. This is true of both C2 and Graal. There are essentially two ways of fixing this: 1. Don't lower volatile access until code generation. Let the back ends expand them. This is problematic on machines which require fences because some fence-removal optimizations would break. 2. Annotate load and store nodes with information about volatility. Annotate fence nodes with information about their associated loads and stores. Back ends which have ldar/stlr instructions could then emit nothing for the volatile-associated fence nodes. We could even make the annotations AArch64-specific, so only that target used them at all. This would require optimizations to preserve such annotations throughout all transformations. I prefer 2 because it requires essentially no changes to non-AArch64 targets.
22-06-2018

It turns out the re-ordering of the graph generation may lead to an extra IfNode being added to the graph (for the test of the boolean value returned by the CompareAndStoreNode). This also implies the possibility of an extra BotAliasIdx Phi merging the memory flow through the Region joining paths from the associated IfFalseNode/IfTrueNode. So, to fix this problem the card_mark_to_trailing and trailing_to_card_mark predicates need to allow for up to 4 Phis in the memory flow before giving up on finding a trailing MergeMem and MemBarVolatile that mark the end of a CAS graph.
21-06-2018

Andrew, Thanks for your review and sharing! Your proposal about fixing CAS graph is simpler and makes sense!
15-06-2018

Hi Zhongwei, Thanks for attempting the patch but I think it is not the correct way to solve this. The change in the memory flow actually now makes the CAS graph look very like the StoreN/P graph. So, the best fix is to change the graph check predicates to 1) remove the special case handling for CAS 2) modify the case handling for Store so it works graphs of the same shape but allows for either a StoreN node or a CAS + SCMemProj pair That change simplifies the logic of the leading_to_normal and normal_to_leading functions. What is better is that it requires no significant change to the logic of the card_mark_to_trailing and trailing_to_card_mark functions. Thank you for pointing out the change of order for the trailing MemBarCPUOrder and MemBarVolatile. I had noticed that this was being done in ~C2AccessFence() but had not noted that the order of the barriers had been changed. This rewrite also affects the ordering of membars for unsafe volatile reads, reversing the order of the trailing MemBarCPUOrder and MemBarCPUAcquire. The reordering seems to be completely unnecessary as far as correct graph management is concerned. It looks like it has been done simply to make the code easier to write/follow. I could fix C2AccessFence to restore the original behaviour but I think the simpler code is probably a good thing. So, I have updated the test predicates -- also the comments in aarch64.ad -- to take account of this new ordering. I am still investigating an intermittent problem with trailing_to_card_mark. Normally there are only 3 Phis in the Bottom memory chain. However, in some cases I am occasionally seeing more than 3 -- even though I am compiling the same CAS.
15-06-2018

And I also notice the normal graph for StoreN/P changed after https://bugs.openjdk.java.net/browse/JDK-8202377 before: MemBarRelease MemBarCPUOrder {leading} | \ . . . | StoreN/P[mo_release] . . . | / MergeMem | MemBarVolatile {card mark or trailing} MemBarCPUOrder after: MemBarRelease MemBarCPUOrder {leading} | \ . . . | StoreN/P[mo_release] . . . | / MergeMem | MemBarCPUOrder MemBarVolatile This change leads to some missing eliding in a case like: static AtomicLong al = new AtomicLong(); for (int i = 0; i < k; i++) { al.set(i); } I found this case accidentally. As discussed above, I also feel some test is needed.
14-06-2018

Hi, Andrew, I was trying to revert the insertion order of post-write barrier and SCMemProj in this patch: http://mail.openjdk.java.net/pipermail/hotspot-dev/2018-June/032806.html, but it seems better to be done in ad file. Yesterday, I also did a patch which tried to add more checkings to normal_to_leading and leading_to_normal: http://cr.openjdk.java.net/~zyao/8204331/webrev.01/ Since I'm not quite familiar with this part of the code, could you help take a look? Or if it is totally wrong, please ignore it. Thanks!
14-06-2018

We have always known this trick is vulnerable to changes in the shape of the ideal graph - most especially when the GC code changes. What is surprising is not that it has happened so much as how infrequently it has happened (this is only the second time now since it first went into our JDK8 releases). The problem is that there does not really appear to be any other way to achieve the desired goal of replacing memory barriers with ldar/stlr instructions. Yes, testing this would be a really good idea. However, I also struggled to find a way to implement a test (by contrast, with an equivalent trick that I have prototyped fro Graal/AArch64 I can hijack the Graal compiler back end and test it thoroughly). A debug build-only switch which generated log output form the graph check algorithms is something I never thought of. It might well be useful to allow some basic unit tests to work. I'll see if I can come up with something based on that idea.
13-06-2018

Great tracking that down. I'm sure a great deal of work went into that pattern matching code, but it seems vulnerable to changes in shared code like this. I've been trying to think of a way to verify that the optimizations are firing in the expected cases (gc barriers (across several GCs), volatiles, unsafe code, var handles, etc). But I haven't come up with any ideas other than running a series of test code through the system with logging in the optimizations, and searching the logs for the expected messages. Do you think testing this is a good idea? And do you have any better ideas on how to do it?
13-06-2018

This is happening because of a change to the order of inlining of unsafe loadstore operations following JDK-8202377 which has changed the shape of the memory flow subgraph for CAS nodes., Previously a GC pre-barrier, unsafe loadstore node and GC post barrier were added to the graph and then an SCMemProj was appended to the loadstore node in order to ensure the loadstore was not optimised away. This meant that the SCMemProj mem feed went direct into the trailing acquire membar (well, via a merge of it and the post barrier memory flow), making it simple to directly traverse the graph from the CAS to locate trailing membar. JDK-8202377 delegates responsibility for building the graph via the GC interface to the GC barrier set C2 assembler. In consequence the SCMemProj is appended as part of constructing the loadstore, placing it earlier in the memory flow. It feeds its memory output into the GC post barrier graph. In consequence the test predicates for the code generator find a merge feeding the volatile card membar rather than the trailing acquire membar. This change is actually fairly innocuous as it means that the CAS graph now looks like a volatile store graph -- where the StoreN/P node feeds its memory output into the GC post barrier graph. So, only a small amount of tweaking of the graph check predicates will be needed to correctly detect a CAS graph.
13-06-2018