Bug ID: JDK-8219586 CodeHeap State Analytics processes dead nmethods

JDK 11	JDK 16
11.0.12Fixed	16 b18Fixed

Fix request (11u): The change puts an end to discussions on the "accessing dead storage" topic, at least to the best of Erik Österlund's and my knowledge. CodeHeap state analytics are included in hotspot since jdk11. It would be beneficial to have the latest fixes and improvements available in the LTS release. The link to the RFR discussion: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2021-April/005735.html
14-04-2021
Changeset: 4440bda3 Author: Lutz Schmidt <lucy@openjdk.org> Date: 2020-09-24 07:48:48 +0000 URL: https://git.openjdk.java.net/jdk/commit/4440bda3
24-09-2020
I would like to get the discussion started again, and I would like to bring it to an end everybody involved can live with. I will summarize the previous comments in a compact form (at least I'll try), and I will explain what I have done to address concerns. To start with, CodeHeap State Analytics has been created to gain detailed insights into the inner workings of the code heap. To that end, it's just about used blocks and free blocks. The contents of used blocks is not visible to/relevant for CodeHeap management. Free blocks cover the CodeHeap space that is not currently allocated to used blocks. Using CodeBlob or nmethod iterators doesn't do the trick. These would only cover parts of the entire CodeHeap and would not reveal details like fragmentation. When working on issues related to the CodeHeap, it is clearly helpful to know a bit more about the contents of used blocks. There is nothing to be known about the contents of free blocks. For that reason, CodeHeap State Analytics "looks into" used blocks and extracts information deemed helpful. Which information to extract was guided by the analysis requirements of real problems in real customer systems. When accessing shared data structures, the code must be robust with respect to concurrent modifications. This robustness can be achieved by holding a suitable lock. In our case, holding the CodeCache_lock protects against concurrent modifications of the CodeHeap structures. This lock is not sufficient, however, when we inspect the contents of an used CodeHeap block. To protect from concurrent modifications of the block contents, we need to hold the Compile_lock in addition. As long as we hold these locks, we can be sure we see a consistent state of the CodeHeap blocks and the CodeHeap block contents. Please correct me if this assumption lacks yet another lock to be held. Starting from here, I have rearranged CodeHeap State Analytics processing. The CodeHeap structure itself as well as method data in individual CodeHeap blocks are accessed only during the "aggregate" step. Both locks (mentioned above) are continuously held all the time during aggregation. All information retrieved is temporarily held in local data structures for later printing. Because aggregation walks all CodeHeap blocks, it may encounter (and peek) block contents which is in a transitional (but in that moment frozen and well-defined) state. I have invested many thoughts in only accessing fields which are valid for the given state. Should my thoughts be incorrect or incomplete, I would like to learn what I missed so I can improve the code. The changes I suggest are mostly concentrated in two files: src/hotspot/share/compiler/compileBroker.cpp was adapted to handle CodeCache_lock and Compile_lock simultaneously. Printing a list of names for all blocks is limited to the case when the locks are continuously held. That removes the potential risk of accessing/interpreting current data with outdated knowledge. src/hotspot/share/code/codeHeapState.cpp prints a TOP-100 list of the largest used blocks. All information required for that list, including the blob/nmethod name, is now collected during the aggregate step. During printing (in print_usedSpace) CodeHeap data is not accessed anymore. The other print_* functions relied only on collected data from the beginning. No need for action there. The webrev with the suggested modifications can be found here: https://cr.openjdk.java.net/~lucy/webrevs/8219586.01/
16-07-2020
ad 1) The problem I'm talking about is not buggy mutations in the code heap without locks. What I am talking about is references in the code heap that are completely frozen and not changing, but uninitialized, accidentally (by chance) pointing at memory that is concurrently mutated and possibly unmapped by other threads. So even though all threads respect the code cache lock and whatever other locks you think you need to protect the state of the blobs in the code cache, unless you know the uninitialized data this code blob references (e.g. the pointer chain to the constant pool) are all initialized, not accidentally pointing at memory that is being mutated or unmapped, you can crash. The sequence: 1) Run through pointers pointing at arbitrary freed memory and see that we don't crash 2) Access said pointers again without protection hoping it will provide the same pointers even though going through freed memory that we have no idea who is accessing it... ...is fundamentally flawed. ad 3) The iterators only report CodeBlob instances, that have valid vtable entries, and are used under the CodeCache_lock, making it safe. ad 4) As long as you are pointer chasing through uninitialized arbitrary pointer chains (which you have in is_alive() nmethods as well as long as you refuse to fix the initialization code), as I explained already, you will come across pointer chains that are mutated and concurrently being unmapped, because you don't know what uninitialized pointers point at. Sprinkling more code to try and figure out if the arbitrary memory will be concurrently mutated or unmapped is not a solution. The only real solution I see is to initialize data properly, and only access data that isn't freed only.
23-04-2019
ad 1) Yes, there could theoretically be (unknown) mutators who disregard CodeCache_lock and Compile_lock. If that is the working assumption, we can stop the discussion. Everything will break then, eventually. Let's assume the contents of the code heap is frozen during analysis. Then, we have alive blobs, and a few blobs which are in a transition state (dead, in construction, unloaded). Any such state is real and valid in this specific moment. Those blobs in transition will soon go away, being subsumed in free heap blocks. ANY blob can change state the moment after analysis is complete. In other words: we are taking a snapshot here. There is value in knowing how many blobs in transition there are and how much space they occupy. If the numbers are excessive, there could be something wrong with the sweeper threads. This all is not fiction or phantasy. CodeHeap State Analytics have been guided by real issues in real (production) systems of real customers. I agree, initialization sequence should be such that if you see an instance of an object, all fields of that object are properly initialized. Maybe I didn't state that clearly enough before. But I do not agree that it is in the scope of this topic to analyze and fix the initialization sequence. There should rather be a separate bug. ad 3) Yes, is_alive() is virtual. If that property makes it unsafe/unreliable, the same is true for iter(CompiledMethodIterator::only_alive_and_not_unloading). If I understood the code correctly, this iterator makes use of is_alive() as well. Somehow, you must find out about the state of a blob. ad 4) My dictionary lookup for "racy" told me "lively, entertaining, and typically sexually titillating". So I tried to map this explanation to dry code. Together with the wording "stupid sentinel value" I interpreted "racy" as "funny, unprofessional". All that pointer chasing and usability checking is done only when is_alive() returns true. Even then we have to expect pointer chains to be mutated? How can the vm do more than maybe print the version string?
23-04-2019
ad 1) Why is it an intention of CodeHeap State Analytics to provide information about dead code blobs? What is the use case for that? It might be overwritten with new live code heap memory the instant after. Isn't it better to report only the data that is alive, and have users be able to actually trust that data instead (no spurious nonsense values)? And I insist that sorting out the order of initialization in ciEnv::register_method() is definitely preferable compared to trying to parse whether fields have been initialized yet. ad 2-3) Okay good. Note though that is_alive also is a virtual call, requiring the vtable entry in memory to not have been overwritten. Since this is intended to be used on freed memory, we don't have a contract that there will be valid vtable entries around. I really don't think we should be trying to parse the safety of accessing freed memory... ad 4) Well one race is that you first probe the pointer chasing path with safe fetch to see that you can read the memory without crashing, and then assume that this same path can be chased through using normal loads, and then perform the loads again without safe fetch to read the actual memory. Since you don't know what memory this pointed to and whether it is being concurrently mutated, or if the memory is concurrently unmapped, it is possible that when actually reading the values after the "safety checks", you will crash. It is inherently racy and unsafe.
23-04-2019
ad 1) Walking only live code blobs doesn't do the trick. It is the intention of CodeHeap State Analytics to provide information about all contents, be it associated with live, free, or some kind of transient blocks. To solve the issue of (at the time of examination) uninitialized values would probably require a rework of the ciEnv::register_method() sequence of events. In the proposed change, the Compile_lock is acquired in addition. That should help. ad 2) CompiledMethod::nmethod_access_is_safe() was moved and renamed to nmethod::access_is_safe(). The tests have been reordered to respect pointer chaining. I would like to keep it. It is not only used in CodeHeap State Analytics, but also to resolve JDK-8209588. Not sure if grabbing the Compile_lock is feasible / would help there. ad 3) The is_zombie() call is gone (replaced by is_alive()). ad 4) I don't see the racy nature of os::is_readable_pointer(), nor do I see the chance of false negatives. SafeFetch is designed to return the data found at the given address, or a substitute value if the memory is not readable. If you call SafeFetch twice with distinct substitute values, you can reliably decide if the memory is readable. There is a webrev with changes I would suggest at http://cr.openjdk.java.net/~lucy/webrevs/8219586.00/ It is not yet made public via a RFR. I wanted to collect the feedback of the watchers first.
23-04-2019
1) I would prefer to remove it, and use iterators that walk only live code blobs instead. As for the issue of having uninitialized values... my advise is to simply initialize them, rather than trying to parse the memory to figure out if it looks uninitialized or not. Maintaining a function that parses memory that has been freed and tries to figure out if it is dead or not is bound to be a maintenance nightmare going forward. In particular some stuff is initialized in ciEnv::register_method outside the CodeCache_lock and some is initialized inside of nmethod::new_nmethod with the CodeCache_lock. Either make the heap iteration also grab the Compile_lock (which protects ciEnv::register_method()) as well, or move stuff you need to safely access from ciEnv::register_method() inside of CodeCache_lock, and then use the CompiledMethodIterator iter(CompiledMethodIterator::only_alive_and_not_unloading) iterator. Whenever you find uninitialized data, make it initialized. 2) I'm talking about the fact that nmethod_access_is_safe itself calls method->signature(), before checking if it is safe to call that, which will blow up if any of either the constMethod, or the constant pool (constants()) of the constMethod are not valid addresses. First you call method->signature() (which pointer chases the constMethod() and its constants()), and after that you check that the method, and its constants() and signature() are "readable pointers". So the order is obviously wrong here, and you also have to check that the constMethod is a readable pointer before you check if the constants() is a readable pointer, as it is indirected through constMethod. 3) As for vtable accesses, you do check is_zombie(), which is a vtable call. The only reason it works today is because there is a NULL check for the nm->method. And that is cleared when nmethods are made zombie or are made unloaded (both comprising !is_alive(), but you "check" for only one of them... if that check was ever evaluated that is). If the null check was not there (in fact I want to remove it as part of cleaning up the nmethod life cycle), then this code starts crashing, because we start actually running these virtual calls on nonsense memory. So I can't do that. Also note that truffle JVMCI nmethods don't have a nm->_method, so they will always be considered as not safe to access with this API, which seems a bit odd. 4) I mean that os::is_readable_pointer is a bit racy and can say the memory isn't readable even though it is, due to using SafeFetch which has some stupid sentinel value you check for instead of a boolean, so that you by design can never be sure if it was actually readable or not. Yay. Less of an issue, but annoying.
26-03-2019
I finally found the time to thoroughly investigate the issues brought up. Let me make some general remarks first. Accessing freed memory and interpreting its contents as if it were still bound to a particular data structure is an error. Every effort has been and will be made to avoid exactly that. It was found - by bad experience - that a (nmethod/Method/CodeBlob) object may become visible before its fields are all initialized. This is a problem. And the CodeCache_lock does not help. The associated cleanup task would for sure be non-trivial. ad 1) Agreed. nmethod_access_is_safe() should be a member function of class nmethod. ad 2) The method signature is read by two means: a) directly calling method->signature() and b) calling method->name_and_sig_as_C_string() which in turn calls signature(). In both cases, nmethod_access_is_safe() is used to protect from accesses to stale memory. In addition to the obvious checks in nmethod_access_is_safe(), it is made sure (by preceding program logic) that the CodeCache_lock is held. Plus, nm->is_in_use() or nm->is_not_entrant() must be true to call name_and_sig_as_C_string(). ad 3) Yes, there was a problem with illegal vtable accesses, see JDK-8216314. The issue is fixed. ad 4) Not sure what's meant by the term "false negatives". Is it the indication of an unsafe access while actually the access would be safe? Not sure when this would happen. But if so, there are less details printed than possible - not a tragedy in my opinion. Further comments: nmethod_access_is_safe() was moved out of the private scope of CodeHeapState to help solve JDK-8209588. This bug, which is completely unrelated to CodeHeap Analytics, describes a problem when iterating the CodeCache with the help of a NMethodIterator (method CodeCache::nmethods_do(void f(nmethod* nm))). It turned out the code suffered from the same inconsistencies nmethod_access_is_safe() is designed to detect. nmethod->is_in_use() returns (_state <= in_use). This may have been correct before the related enum was extended with the value (CompiledMethod::not_installed = -1). As of now, a nmethod is reported "in_use" even though it is "not_installed" yet. To my understanding, this is not correct. Comments welcome.
05-03-2019
Similar issues were addressed/fixed by JDK-8216314. ILW = Potential access to stale memory, with diagnostic/non-default feature, no workaround (but can disable diagnostic option) = HLM = P3
25-02-2019
No worries, there is no rush.
22-02-2019
I will have a look into this, but it may take a few days until I get back with comments. Hope that's ok.
22-02-2019

Relates :	JDK-8216314 - SIGILL in CodeHeapState::print_names()
Relates :	JDK-8217465 - [REDO] - Optimize CodeHeap Analytics
Relates :	JDK-8301338 - Identical branch conditions in CompileBroker::print_heapinfo
Relates :	JDK-8284458 - CodeHeapState::aggregate() leaks blob_name