Bug ID: JDK-8200450 Root cause analysis for JDK-8200366

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 11

Priority: P4
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2018-03-29
Updated: 2018-04-26
Resolved: 2018-04-19

JDK 11
11 b11Fixed

After JDK-8198691 (CodeHeap State Analytics) tests showed intermitted, very infrequent, hard to reproduce SIGSEGVs in CodeHeapState::print_names(). These are documented in JDK-8200366. The fix for this bug is preliminary because a reproducer test and the root cause could not be found due to time constraints.

I pushed changes but waiting "robot" to update bug. http://hg.openjdk.java.net/jdk/jdk/rev/f909f09569ca
19-04-2018
Testing passed clean.
19-04-2018
Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8200450.00/
17-04-2018
During the past days, I have been working on reproducing and understanding what happens when the CodeHeap State Analytics function print_names() fails (runs into a SIGSEGV). Here is what I found out. The CodeHeap is a living thing by nature �� that��s nothing new at all. To be absolutely sure no other thread is altering its contents, the CodeCache_lock must be held. This is not desirable for print_names() because printing the names of all CodeBlobs of a large CodeHeap may take quite a while. New nmethod compilations would be stalled. Without holding that lock, there are two actions by other threads that may cause print_names() to stumble: adding a new CodeBlob (most often caused by registration of a freshly compiled nmethod) and nmethod sweeping. Note that the assumption that nmethod sweeping poses a risk is theoretical. I never observed a situation that could actually be interpreted as a side effect of nmethod sweeping. CodeBlob construction does not include any kind of synchronization besides that inherent to acquiring/releasing the CodeCache_lock. It is therefore hard to reliably detect the ��fully initialized�� state and the possibility to see inconsistent (not fully initialized) CodeBlob instances exists. The same is true for CodeBlob subclasses, class CompiledMethod and class nmethod in our case. Uninitialized pointer and size fields have been observed in CodeBlob instances. These inconsistent states can be detected by plausibility checking various instance fields. For example, it is a consistency requirement that (this + header_size == relocation_begin) holds true. With such checks implemented, no further SIGSEGV has been observed in that context. Much less frequently, a non-NULL but uninitialized (invalid) pointer in a CompiledMethod instance has caused SIGSEGVs as well. I could not find any simple plausibility checks similar to those for CodeBlobs. To mitigate the risk, a new function is_readable_pointer(p) has been introduced. With the help of SafeFetch32, it checks for read access at the given address. No further SIGSEGV has been observed since then. So what are the options for CodeHeap State Analytics going forward? * Disable/remove the print_names() function altogether. This is the most rigorous approach. If there is no code executed, there can��t be a failure. * Disable/remove the print_names() function from the jcmd interface, but leave it in for -Xlog:codecache={Debug\|Trace}. The -Xlog: output is triggered in two situations: during VM shutdown and/or the first time when the VM runs into a ��code cache full�� condition and the JIT compiler has been disabled. In both cases, I expect the new CodeBlob construction rate to be low (shutdown) or even zero (code cache full). * Introduce a new (volatile) state flag (at least) in class CodeBlob and class Method to indicate ��class initialization complete��. Only with this flag set any further inspection of the data structure is safely possible. It only protects against looking at inconsistent states during instance construction. nmethod sweeper activity could still interfere. * Introduce a new lock (or use CodeHeapStateAnalytics_lock) to prevent concurrency between nmethod sweeper and print_names(). Blocking the sweeper for an extended time span is probably more tolerable than blocking all compiler threads. * Accept the code as it is now (with the new safeguarding checks). Though not zero, the remaining risk is much lower than it was originally. I could not reproduce any kind of anomaly so far. An RFR will be sent out very soon with a reference to this comment.
17-04-2018
ILW = possible SIGSEGV in CodeHeapState::print_names(); intermittent, rare; none = MLH = P4
02-04-2018