JDK-8249192 : MonitorInfo stores raw oops across safepoints
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 11,15,16
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2020-07-10
  • Updated: 2024-01-29
  • Resolved: 2020-07-23
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 15 JDK 16
11.0.10-oracleFixed 15.0.2Fixed 16 b08Fixed
Related Reports
Blocks :  
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
java/lang/StackWalker/LocalsAndOperands.java crashes when run with -XX:+UseJVMCICompiler with g1 and parallel, with and without class unloading.

Stack trace is typically like this:

V  [libjvm.dylib+0x6ea177]  G1ParScanThreadState::copy_to_survivor_space(G1HeapRegionAttr, oop, markWord)+0x67
V  [libjvm.dylib+0x68585c]  void G1ParScanThreadState::do_oop_evac<unsigned int>(unsigned int*)+0x1dc
V  [libjvm.dylib+0x6e755b]  G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x15b
V  [libjvm.dylib+0x6e9dca]  G1ParScanThreadState::trim_queue()+0x1a
V  [libjvm.dylib+0x680bbb]  G1ParEvacuateFollowersClosure::do_void()+0x6b
V  [libjvm.dylib+0x689ab4]  G1EvacuateRegionsBaseTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases)+0x94
V  [libjvm.dylib+0x68984f]  G1EvacuateRegionsBaseTask::work(unsigned int)+0x14f
V  [libjvm.dylib+0x1105f80]  GangWorker::run_task(WorkData)+0x60
V  [libjvm.dylib+0x110603c]  GangWorker::loop()+0x2c

Investigation showed that the broken oop has a invalid/broken Klass; it is always part of (first and only element of) an objArrayOop, and the test exercises code to generate and retrieve java thread stack frames

Attached patch adds an oop_iterate() call (in check_obj()) on all oops passed to user data structures when filling in stack frames; that crashes as follows, indicating some race with MonitorInfo::_owner.

V  [libjvm.so+0x15a8ed8]  check_obj(oop)+0x88
V  [libjvm.so+0x15aa5db]  LiveFrameStream::monitors_to_object_array(GrowableArray<MonitorInfo*>*, Thread*)+0xfb
V  [libjvm.so+0x15aa916]  LiveFrameStream::fill_live_stackframe(Handle, methodHandle const&, Thread*)+0xe6
V  [libjvm.so+0x15aad5f]  LiveFrameStream::fill_frame(int, objArrayHandle, methodHandle const&, Thread*)+0x12f
V  [libjvm.so+0x15ab8a1]  StackWalk::fill_in_frames(long, BaseFrameStream&, int, int, objArrayHandle, int&, Thread*)+0x361
V  [libjvm.so+0x15ac575]  StackWalk::fetchFirstBatch(BaseFrameStream&, Handle, long, int, int, int, objArrayHandle, Thread*)+0x2b5
V  [libjvm.so+0x15aca20]  StackWalk::walk(Handle, long, int, int, int, objArrayHandle, Thread*)+0x120
V  [libjvm.so+0xe8d923]  JVM_CallStackWalk+0x1d3

[The attached patch also adds some additional code to verify eden space (where the original crash always occurred) by calling oop_iterate() on all objects in it - it always crashes with above or quite similar stack trace *before* GC operation starts - i.e. the bad oop is created outside of gc. This verification code is needed because VerifyBeforeGC otherwise does not check Klasses.]

[~dlong] mentioned that this crash occurs in jdk 11/15/16; I only ever tried 15 and 16.

The crash also looks very much the recent crashes caused by integration of JDK-8244603 (later backed out in JDK-8248650).

Reproduction with that test and -XX:+UseJVMCICompiler is ~1/100 on jdk16 (also without the patch).

Kindly asking for investigation by the runtime team.
Comments
Fix Request (15u) The same rationale as for 11u. Patch does not apply to 15u cleanly, because of the minor differences. Additionally, there is already the 15.0.2 backport recorded (JDK-8251485 -- 15u maintainers, please reopen?), but that is a process mistake. 15u RFR (dholmes verified it is the same as in 15u CPU repo): https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-September/003788.html. Should be backported together with JDK-8251118.
15-09-2020

Fix Request (11u) This handles the corner case that may corrupt heap. Patch does not apply cleanly to 11u, because of the minor differences. 11u RFR [acked by goetz]: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-September/003732.html. Backport passes tier{1,2,3} with and without Shenandoah (which seems to expose some of these). Should be backported together with JDK-8251118.
02-09-2020

URL: https://hg.openjdk.java.net/jdk/jdk/rev/61f9028f360d User: tschatzl Date: 2020-07-23 19:11:16 +0000
23-07-2020

While I like the locality of [~dcubed] change, the other values that are returned during the live frame stack walk are also array types that require Handles: // Fill LiveStackFrameInfo with locals, monitors, and expressions void LiveFrameStream::fill_live_stackframe(Handle stackFrame, const methodHandle& method, TRAPS) { fill_stackframe(stackFrame, method, CHECK); if (_jvf != NULL) { StackValueCollection* locals = _jvf->locals(); StackValueCollection* expressions = _jvf->expressions(); GrowableArray<MonitorInfo*>* monitors = _jvf->monitors(); ... where: class StackValue : public ResourceObj { private: BasicType _type; intptr_t _integer_value; // Blank java stack slot value Handle _handle_value; // Java stack slot value interpreted as a Handle So making MonitorInfo hold a Handle is consistent with the other values that are retrieved during the stack walk. One question is whether making MonitorInfo create Handles is surprising in the other contexts where it is called or if it is called in contexts where Handles are not allowed. The callers in vframeArray and vframe also create locals() and expressions() which create StackValues which have Handles. The only case that is not obvious is deoptimization but deoptimization has: // It is actually ok to allocate handles in a leaf method. It causes no safepoints, // but makes the entry a little slower. There is however a little dance we have to // do in debug mode to get around the NoHandleMark code in the JRT_LEAF macro ResetNoHandleMark rnhm; // No-op in release/product versions And this code creates other Handles. I'm going to change my opinion that Thomas's fix is more consistent with surrounding code and I prefer this solution now.
23-07-2020

Fix request rejected After having discussed the relative merits of making this change in JDK 15 with [~tschatzl] we agree that it's not worth the risk at this point: * JDK 15 is already in RDP2 * The issue has been there since JDK 9 and there's no reason to believe it's been made more likely recently * The fix is non-trivial (though the significant testing and careful reviews of course balance that) With all of that in mind it does not seem urgent to get this fixed for JDK 15.
22-07-2020

Fix request: Reason: This change fixes crashes due to oops not being handled correctly across a gc safepoint during walking the Java stacks when using the StackWalking API introduced in JEP 259: Stack-Walking API. Oops are stored internally in the VM, but not published to the GC so that it can update it. Hence the remaining code will operate on wrong oop locations. It affects all garbage collectors. Likelihood of the crash is in the range 1/100 for the given reproducer (this is using graal compiler, but this is a matter of GC timing, not the graal compiler); in jdk16, we had to back out a change that modified heap sizing for G1 due to many crashes caused by this, but otherwise only very rarely occurs. There is a suspicion that this is the reason for many crashes of commonly used IDEs (intellij and variants) across all garbage collectors as they may use the StackWalking API more commonly than other applications. Risk: This change modifies an internal data structure that is not only used in the stack-walking API but also in deoptimization, biased locking and JVMTI, adding an indirection so that the GC can modify these oops as needed. Given the test coverage and the reviews, the risk should be small enough to take this in at this time. Test coverage: tier1-8 runs on jdk15 and jdk16, thousands of successful runs of the reproduction test case without apparent issues. See links to CI result in the comments. Reviewers: [~coleenp], [~dholmes], [~sspitsyn] Webrev: http://cr.openjdk.java.net/~tschatzl/8249192/webrev.jdk15.2/ Review thread: http://mail.openjdk.java.net/pipermail/hotspot-dev/2020-July/042441.html
22-07-2020

[~dcubed] yes, that's the fix for that problem essentially. Uploaded a new version of the verification changes as "check-oops-3".
22-07-2020

[~tschatzl] - Does this: > The two failures in G1ParScanThreadState::verify_task() are in my > reproductions a too zealous verification: the verification not only > verifies the current object, but also the referenced objects. mean that you have a newer version of your check-oops-fixed patch? Update: I took a stab at modifying the check-oops-fixed patch: $ hg diff src/hotspot/share/gc/g1/g1ParScanThreadState.cpp diff -r 2a4f736c3571 src/hotspot/share/gc/g1/g1ParScanThreadState.cpp --- a/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp Tue Jul 21 20:15:02 2020 -0400 +++ b/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp Tue Jul 21 20:34:09 2020 -0400 @@ -152,8 +152,10 @@ "task=" PTR_FORMAT " p=" PTR_FORMAT, p2i(task), p2i(p)); assert(oopDesc::is_oop(p), "must be"); +#if 0 IsObjClosure cl; p->oop_iterate(&cl); +#endif } void G1ParScanThreadState::verify_task(oop* task) const { @@ -163,8 +165,10 @@ "task=" PTR_FORMAT " p=" PTR_FORMAT, p2i(task), p2i(p)); assert(oopDesc::is_oop(p), "must be"); +#if 0 IsObjClosure cl; p->oop_iterate(&cl); +#endif } void G1ParScanThreadState::verify_task(PartialArrayScanTask task) const { And I'm going to start a stress run that version of check-oops-fixed in place along with my simpler version of the fix.
22-07-2020

I made JDK-8248013 a duplicate and copied over tags, labels, fix version, affects version.
20-07-2020

Based on new data, and following the discussion within RT Triage team, raising priority to P2 ILW = HLH = P2
20-07-2020

Out for review: http://cr.openjdk.java.net/~tschatzl/8249192/webrev/
20-07-2020

The two failures in G1ParScanThreadState::verify_task() are in my reproductions a too zealous verification: the verification not only verifies the current object, but also the referenced objects. It can happen that the referenced objects are not yet copied, and so fail. The fix I have passes hundreds of runs with java/lang/StackWalker/LocalsAndOperands.java, and a full tier8 run with JDK-8249676 which previously crashed a lot.
20-07-2020

I'm done doing repro runs for the moment. Here's the final data: $ grep FAIL doit_LocalsAndOperands_fastdebug.save.01/do_test_loop_forever.doit_LocalsAndOperands_fastdebug.log Run #9...FAIL (status=262) Run #19...FAIL (status=262) Run #73...FAIL (status=262) Run #418...FAIL (status=262) Run #555...FAIL (status=262) Run #680...FAIL (status=262) Run #750...FAIL (status=262) Run #909...FAIL (status=262) Run #969...FAIL (status=262) Run #1015...FAIL (status=262) Run #1020...FAIL (status=262) Run #1216...FAIL (status=262) $ tail -2 doit_LocalsAndOperands_fastdebug.save.01/do_test_loop_forever.doit_LocalsAndOperands_fastdebug.log Run #1252...PASS 1253 runs done; pass_count=1241; fail_count=12 So 12 failures in about 1200 runs or 1/100 which is what Thomas observed. 2 of the failures are crashes in G1ParScanThreadState::verify_task() like the one mentioned above (#73 and #1020). 10 of the failures are in LiveFrameStream which is what Thomas and I were expecting.
16-07-2020

Issue introduced in JDK-8140450: Implement Stack-Walking API in jdk9. With a preliminary patch for this particular issue (which certainly is not perfect), there are still crashes similar to the ones [~dcubed] reported as above in run #73. Seems to appear only with G1 and class unloading. The other crashes with other collectors tested (Parallel) are gone.
15-07-2020

That's exactly where Thomas is at in his analysis of this issue... :-)
15-07-2020

I was looking around for raw oops, and this caught my eye: 269 objArrayHandle LiveFrameStream::monitors_to_object_array(GrowableArray<MonitorInfo*>* monitors, TRAPS) { 270 int length = monitors->length(); 271 objArrayOop array_oop = oopFactory::new_objArray(SystemDictionary::Object_klass(), 272 length, CHECK_(objArrayHandle())); 273 objArrayHandle array_h(THREAD, array_oop); 274 for (int i = 0; i < length; i++) { 275 MonitorInfo* monitor = monitors->at(i); 276 array_h->obj_at_put(i, monitor->owner()); 277 } 278 return array_h; 279 } It looks like MonitorInfo contains raw oops, so what happens if new_objArray() causes a safepoint? Putting a NoSafepointVerifier around uses of MonitorInfo might catch more problems.
15-07-2020

With Thomas' debug patch in place, I typically see failures in the LiveFrameStream which is expected. I do have one failure elsewhere: # Internal Error (/work/shared/bug_hunt/monitors/8249192_for_jdk15/open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:153), pid=8009, tid=8035 # assert(oopDesc::is_oop(p)) failed: must be # # JRE version: Java(TM) SE Runtime Environment (15.0) (fastdebug build 15-internal+0-2020-07-10-1727186.dcubed...) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 15-internal+0-2020-07-10-1727186.dcubed..., compiled mode, sharing, tiered, jvmci, jvmci compiler, compressed oops, g1 gc, linux-amd64) Here's the crashing thread's stack: --------------- T H R E A D --------------- Current thread (0x00007f0588005120): GCTaskThread "GC Thread#5" [stack: 0x00007f05904f1000,0x00007f05905f1000] [id=8035] Stack: [0x00007f05904f1000,0x00007f05905f1000], sp=0x00007f05905efa70, free space=1018k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xb6a84c] G1ParScanThreadState::verify_task(unsigned int*) const+0x1cc V [libjvm.so+0xb6c133] G1ParScanThreadState::verify_task(ScannerTask) const+0xa3 V [libjvm.so+0xb6d43b] G1ParScanThreadState::trim_queue()+0x36b V [libjvm.so+0xae6baa] G1ParEvacuateFollowersClosure::do_void()+0x32a V [libjvm.so+0xaf5e57] G1EvacuateRegionsBaseTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases)+0x97 V [libjvm.so+0xaea84a] G1EvacuateRegionsBaseTask::work(unsigned int)+0x9a V [libjvm.so+0x18187a4] GangWorker::run_task(WorkData)+0x84 V [libjvm.so+0x18188ce] GangWorker::loop()+0x2e V [libjvm.so+0x16da570] Thread::call_run()+0x100 V [libjvm.so+0x13db1f6] thread_native_entry(Thread*)+0x116 So far I have four failures: $ grep FAIL do_test_loop_forever.doit_LocalsAndOperands_fastdebug.log Run #9...FAIL (status=262) Run #19...FAIL (status=262) Run #73...FAIL (status=262) Run #418...FAIL (status=262) The one in Run #73 is the one that is different from the others...
14-07-2020

ILW = HLM = P3
14-07-2020

The patch does apply on 15 here, but I found that I forgot to add a late #include change (#include iterator.hpp should be #include iterator.inline.hpp) to compile it in the CI. Uploaded as check-oops-fixed.
13-07-2020

Thanks for looking [~tschatzl]! My current run is #929 and so far I've seen just the one failure. On Monday, I'll stop the current run and apply your debug patch and start a new run. My repo is baselined on the jdk-15+31 snapshot so I'm hoping your patch will apply without much noise.
12-07-2020

I think it is the same failure. The stack trace for the failure is the same as the others until and including G1ParScanThreadState::copy_to_survivor_space(). The rest (the part about allocating a new PLAB, and that allocation being too large) is new, but can be explained by a broken klass for that object. I.e. there can be no objects in eden or survivor space that are humongous-sized (and we haven't seen an issue with that forever), which that stack trace implies and asserts. In all the crashes I looked at the crashes where when trying to iterate that object after copying to copy over the objects reachable by it because that object's klass was apparently much more broken.
11-07-2020

First failure popped up on my Linux-X64 machine in run #334. # Internal Error (/work/shared/bug_hunt/monitors/8249192_for_jdk15/open/src/hotspot/share/gc/g1/g1Allocator.cpp:242), pid=18653, tid=31922 # assert(!_g1h->is_humongous(desired_word_size)) failed: we should not be seeing humongous-size allocations in this path # # JRE version: Java(TM) SE Runtime Environment (15.0) (fastdebug build 15-internal+0-2020-07-10-1727186.dcubed...) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 15-internal+0-2020-07-10-1727186.dcubed..., compiled mode, sharing, tiered, jvmci, jvmci compiler, compressed oops, g1 gc, linux-amd64) Here's the crashing thread's stack: --------------- T H R E A D --------------- Current thread (0x00007f411000a330): GCTaskThread "GC Thread#9" [stack: 0x00007f4118fcd000,0x00007f41190cd000] [id=31922] Stack: [0x00007f4118fcd000,0x00007f41190cd000], sp=0x00007f41190cb8b0, free space=1018k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xaa79b4] G1Allocator::survivor_attempt_allocation(unsigned long, unsigned long, unsigned long*, unsigned int)+0xc4 V [libjvm.so+0xaa7fd2] G1Allocator::par_allocate_during_gc(G1HeapRegionAttr, unsigned long, unsigned int)+0x72 V [libjvm.so+0xaa8561] G1PLABAllocator::allocate_direct_or_new_plab(G1HeapRegionAttr, unsigned long, bool*, unsigned int)+0xf1 V [libjvm.so+0xb67df8] G1ParScanThreadState::copy_to_survivor_space(G1HeapRegionAttr, oop, markWord)+0x7c8 V [libjvm.so+0xaec733] void G1ParScanThreadState::do_oop_evac<unsigned int>(unsigned int*)+0xc3 V [libjvm.so+0xb68835] G1ParScanThreadState::trim_queue()+0x465 V [libjvm.so+0xae59c7] G1ParEvacuateFollowersClosure::do_void()+0x67 V [libjvm.so+0xaf2de7] G1EvacuateRegionsBaseTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases)+0x97 V [libjvm.so+0xae920a] G1EvacuateRegionsBaseTask::work(unsigned int)+0x9a V [libjvm.so+0x180d554] GangWorker::run_task(WorkData)+0x84 V [libjvm.so+0x180d67e] GangWorker::loop()+0x2e V [libjvm.so+0x16cf320] Thread::call_run()+0x100 V [libjvm.so+0x13d33a6] thread_native_entry(Thread*)+0x116 [~tschatzl] - this failure mode doesn't sound like the bug you and Dean are hunting. Please let me know if it is...
11-07-2020

Please don't close this bug yet.
11-07-2020

Original crash report for jdk11 is JDK-8248013 which I did not know existed. Maybe close this as duplicate...
11-07-2020

I've modified a local copy of the test run the problematic test configuration and not the other two configs that the test has: $ hg diff diff -r a32f58c6b8be test/jdk/java/lang/StackWalker/LocalsAndOperands.java --- a/test/jdk/java/lang/StackWalker/LocalsAndOperands.java Wed Jul 08 11:28:06 2020 -0700 +++ b/test/jdk/java/lang/StackWalker/LocalsAndOperands.java Fri Jul 10 16:37:00 2020 -0400 @@ -26,16 +26,7 @@ * @bug 8020968 8147039 8156073 * @summary Tests for locals and operands * @modules java.base/java.lang:open - * @run testng/othervm -Xint -DtestUnused=true LocalsAndOperands - * @run testng/othervm -Xcomp LocalsAndOperands - */ - -/* - * @test - * @bug 8020968 8147039 8156073 - * @modules java.base/java.lang:open - * @requires !vm.graal.enabled - * @run testng/othervm -Xcomp -XX:-TieredCompilation LocalsAndOperands + * @run testng/othervm -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler -Xlog:gc=debug -Xcomp -showversion LocalsAndOperands */ import org.testng.annotations.*; I've added -showversion also and I was expecting to see an output line like what one of Thomas' crashes shows: # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 16-internal+0-2020-07-10-1150396.thomas.schatzl.jdk, compiled mode, sharing, tiered, jvmci, jvmci compiler, compressed oops, g1 gc, linux-amd64) instead I'm seeing this: Java HotSpot(TM) 64-Bit Server VM (fastdebug build 15-internal+0-2020-07-10-1727186.dcubed..., compiled mode, sharing) Granted that '-showversion' is processed very early so maybe not all the options/configs are setup yet. I am seeing output from -Xlog:gc=debug in my test runs so that part works. Update: I added '-XX:+PrintVMOptions' and verified that UseJVMCICompiler == true. My repo is baselined on the jdk-15+31 snapshot and I'm running the test in a loop with fastdebug bits to see if I can repro the failure on my Linux-X64 server. I have jdk-15+31 stress testing running on that machine right now (typically runs from Thu -> Sun in my lab) so the machine is pretty heavily loaded.
10-07-2020

The patch assumes that any oop passed around or filled in somewhere should be valid (i.e. iterable) at the time it is handed over, particularly something like MonitorInfo::_owner that seems to be initialized elsewhere; if that is not the case, that part of the patch is obviously not applicable.
10-07-2020

Same stack trace/crash also in several other issues like JDK-8241804 and JDK-8239148.
10-07-2020