JDK-8255598 : [PPC64] assert(Universe::heap()->is_in(result)) failed: object not in heap
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 16
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • CPU: ppc
  • Submitted: 2020-10-29
  • Updated: 2020-11-16
  • Resolved: 2020-11-10
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b24Fixed
Related Reports
Relates :  
Relates :  
Description
Seen on ppcle with serialgc during gc/metaspace/TestMetaspacePerfCounters, but I am not sure this has anything to do with Metaspace. Possibly related to JDK-8237363. Happened today the first time.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/sapmnt/sapjvm_work/openjdk/nb/linuxppc64le/jdk/src/hotspot/share/oops/compressedOops.inline.hpp:57), pid=19611, tid=19645
#  assert(Universe::heap()->is_in(result)) failed: object not in heap 0x00000000e0010648
#
# JRE version: OpenJDK Runtime Environment (16.0.0.1) (fastdebug build 16.0.0.1-internal+0-adhoc.openjdk.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 16.0.0.1-internal+0-adhoc.openjdk.jdk, mixed mode, sharing, tiered, compressed oops, serial gc, linux-ppc64le)
# Problematic frame:
# V  [libjvm.so+0x14b9d50]  NativeMovConstReg::data() const+0x3b0
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %e" (or dumping to /priv/jvmtests/output_openjdk16_stage_dbgU_linuxppc64le/jtreg_hotspot_tier1_work/JTwork/gc/metaspace/TestMetaspacePerfCounters_id0/core.19611)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#


Current thread (0x00007fffa81c85e0):  VMThread "VM Thread" [stack: 0x00007fffaea00000,0x00007fffaec00000] [id=19645]

Stack: [0x00007fffaea00000,0x00007fffaec00000],  sp=0x00007fffaebfd100,  free space=2036k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x14b9d50]  NativeMovConstReg::data() const+0x3b0
V  [libjvm.so+0x170b730]  Relocation::pd_set_data_value(unsigned char*, long, bool)+0x1f0
V  [libjvm.so+0x1706410]  oop_Relocation::verify_oop_relocation()+0x1f0
V  [libjvm.so+0x93a3a8]  CompiledMethod::verify_oop_relocations()+0x228
V  [libjvm.so+0x14d43d8]  nmethod::oops_do_marking_epilogue()+0xf8
V  [libjvm.so+0x1810a18]  StrongRootsScope::~StrongRootsScope()+0x48
V  [libjvm.so+0xcf1d68]  GenMarkSweep::mark_sweep_phase3()+0x238
V  [libjvm.so+0xcf2498]  GenMarkSweep::invoke_at_safepoint(ReferenceProcessor*, bool)+0x218
V  [libjvm.so+0x18f0858]  TenuredGeneration::collect(bool, bool, unsigned long, bool)+0x128
V  [libjvm.so+0xceba14]  GenCollectedHeap::collect_generation(Generation*, bool, unsigned long, bool, bool, bool, bool)+0x3a4
V  [libjvm.so+0xcedc50]  GenCollectedHeap::do_collection(bool, bool, unsigned long, bool, GenCollectedHeap::GenerationType)+0x8b0
V  [libjvm.so+0xcef5a4]  GenCollectedHeap::do_full_collection(bool, GenCollectedHeap::GenerationType)+0x44
V  [libjvm.so+0xcd228c]  VM_GenCollectFull::doit()+0xec
V  [libjvm.so+0x1a44b44]  VM_Operation::evaluate()+0x254
V  [libjvm.so+0x1a6fc6c]  VMThread::evaluate_operation(VM_Operation*)+0x18c
V  [libjvm.so+0x1a70910]  VMThread::inner_execute(VM_Operation*)+0x240
V  [libjvm.so+0x1a70d00]  VMThread::loop()+0x120
V  [libjvm.so+0x1a70eb8]  VMThread::run()+0x118
V  [libjvm.so+0x190823c]  Thread::call_run()+0x14c
V  [libjvm.so+0x1585bac]  thread_native_entry(Thread*)+0x17c
C  [libpthread.so.0+0x85e4]  start_thread+0x114





Comments
Changeset: 9d07259f Author: Martin Doerr <mdoerr@openjdk.org> Date: 2020-11-10 11:48:23 +0000 URL: https://github.com/openjdk/jdk/commit/9d07259f
10-11-2020

The fix below should also do the job for this issue. But that means encoding/decoding continues to have non-obvious limitations. Other people may run into such issues, too, sooner or later. diff --git a/src/hotspot/cpu/ppc/nativeInst_ppc.cpp b/src/hotspot/cpu/ppc/nativeInst_ppc.cpp index fbe956322a6..dd3a3756491 100644 --- a/src/hotspot/cpu/ppc/nativeInst_ppc.cpp +++ b/src/hotspot/cpu/ppc/nativeInst_ppc.cpp @@ -197,7 +197,8 @@ intptr_t NativeMovConstReg::data() const { CodeBlob* cb = CodeCache::find_blob_unsafe(addr); if (MacroAssembler::is_set_narrow_oop(addr, cb->content_begin())) { narrowOop no = MacroAssembler::get_narrow_oop(addr, cb->content_begin()); - return cast_from_oop<intptr_t>(CompressedOops::decode(no)); + if (no == narrowOop::null) return 0; + return cast_from_oop<intptr_t>(CompressedOops::decode_raw(no)); } else { assert(MacroAssembler::is_load_const_from_method_toc_at(addr), "must be load_const_from_pool");
02-11-2020

I dislike that suggestion because it significantly weakens the check and I prefer to have that check as strong as possible. It is very helpful when implementing new functionality. So while it is true that decoding an oop does not mean you will necessarily access that memory location, it is highly likely as the only issue is currently in this code. Hand-waving the responsibility to the caller is just a bad idea imo, as it will be a maintenance nightmare - these are ~80 locations afaics when searching the shared code for CompressedOops::en/decode only, and think of new locations. I would prefer to special-case this use than all the others. Maybe the suggested de/encode_raw() methods work for this case?
02-11-2020

Thanks, Thomas! My preferred solution is to disentangle CompressedOops which contains the encode/decode functionality and the heap management. I think it should be the caller's responsibility to check if there's a real object at the location if such an assertion is needed. And CompressedOops functions should better only rely on CompressedOops::is_in. Would something like the following patch be ok for you? diff --git a/src/hotspot/share/oops/compressedOops.hpp b/src/hotspot/share/oops/compressedOops.hpp index c12f6bd197b..7aa336e23e0 100644 --- a/src/hotspot/share/oops/compressedOops.hpp +++ b/src/hotspot/share/oops/compressedOops.hpp @@ -98,6 +98,7 @@ public: static address ptrs_base() { return _narrow_oop._base; } static bool is_in(void* addr); + static bool is_in_or_null(void* addr) { return addr == NULL || is_in(addr); } static bool is_in(MemRegion mr); static Mode mode(); diff --git a/src/hotspot/share/oops/compressedOops.inline.hpp b/src/hotspot/share/oops/compressedOops.inline.hpp index 920e87703b4..a2a497663b4 100644 --- a/src/hotspot/share/oops/compressedOops.inline.hpp +++ b/src/hotspot/share/oops/compressedOops.inline.hpp @@ -54,7 +54,7 @@ inline oop CompressedOops::decode_not_null(narrowOop v) { assert(!is_null(v), "narrow oop value can never be zero"); oop result = decode_raw(v); assert(is_object_aligned(result), "address not aligned: " INTPTR_FORMAT, p2i((void*) result)); - assert(Universe::heap()->is_in(result), "object not in heap " PTR_FORMAT, p2i((void*) result)); + assert(is_in(result), "object not in heap " PTR_FORMAT, p2i((void*) result)); return result; } @@ -78,12 +78,12 @@ inline narrowOop CompressedOops::encode(oop v) { } inline oop CompressedOops::decode_not_null(oop v) { - assert(Universe::heap()->is_in(v), "object not in heap " PTR_FORMAT, p2i((void*) v)); + assert(is_in(v), "object not in heap " PTR_FORMAT, p2i((void*) v)); return v; } inline oop CompressedOops::decode(oop v) { - assert(Universe::heap()->is_in_or_null(v), "object not in heap " PTR_FORMAT, p2i((void*) v)); + assert(is_in_or_null(v), "object not in heap " PTR_FORMAT, p2i((void*) v)); return v; }
02-11-2020

I think you and Stefan were right: top() of old gen is updated only very very late in the method so the CollectedHeap::is_in() can't be used. Here's output of the same message at the very end of GenMarkSweep::invoke_at_safepoint(): at end def new generation total 154240K, used 0K [0x000000060b400000, 0x0000000615b50000, 0x00000006b22a0000) eden space 137152K, 0% used [0x000000060b400000, 0x000000060b400000, 0x00000006139f0000) from space 17088K, 0% used [0x00000006139f0000, 0x00000006139f0000, 0x0000000614aa0000) to space 17088K, 0% used [0x0000000614aa0000, 0x0000000614aa0000, 0x0000000615b50000) tenured generation total 342720K, used 1395K [0x00000006b22a0000, 0x00000006c7150000, 0x0000000800000000) the space 342720K, 0% used [0x00000006b22a0000, 0x00000006b23fce10, 0x00000006b23fd000, 0x00000006c7150000) Notice that at that time top != bottom.
30-10-2020

Thank you, Thomas! The reason why it only shows up on PPC is that only the PPC code uses CompressedOops::decode in Relocation::pd_set_data_value (indirectly). Maybe the issue could get observed on other platforms, too, when inserting a usage of "is_in". I had thought the pointer pointed to the new object and the old gen end pointer was updated later, but I have to take a closer look when I find more time. We didn't see any crashes in the product build, so there's no hint for any real corruption.
30-10-2020

Thank you, Stefan! I have comments about your points: 1) Yeah, CompressedOops::is_in sounds like a good candidate. I'd appreciate to use that! 2) We already had to reduce our nightly tests with fastdebug builds. I'd prefer running more tests with less slow checks which burn too much CPU time. 3) I understand the point that we may usually want to check if we're pointing to a really valid object. But would a typical developper using CompressedOops::decode expect hitting the assertion sporadically when using it in code which is executed during GC?
30-10-2020

I looked at the issue yesterday evening too but Stefan's comment seems insufficient to understand why there is a pointer outside of the allocated heap at that point (from the hs_err file) or at least why using is_in() would be wrong here. I do think this value is correct since this is the first GC (which is a full gc) as nothing has been allocated into old gen yet. Particularly for serial gc adjusted pointers must always be < top() as it's a textbook sliding compactor with a single thread. Even if there were a young gc before the old gc which moves stuff into old gen, top() of old gen should be correct (!= bottom()) after that young gc (but the behavior looks normal, see below) Here's a printout of spaces on x86 linux (attached the patch): before phase 1 def new generation total 154240K, used 21945K [0x00007fa037a00000, 0x00007fa042150000, 0x00007fa0de8a0000) eden space 137152K, 16% used [0x00007fa037a00000, 0x00007fa038f6e460, 0x00007fa03fff0000) from space 17088K, 0% used [0x00007fa03fff0000, 0x00007fa03fff0000, 0x00007fa0410a0000) to space 17088K, 0% used [0x00007fa0410a0000, 0x00007fa0410a0000, 0x00007fa042150000) tenured generation total 342720K, used 0K [0x00007fa0de8a0000, 0x00007fa0f3750000, 0x00007fa22c600000) the space 342720K, 0% used [0x00007fa0de8a0000, 0x00007fa0de8a0000, 0x00007fa0de8a0200, 0x00007fa0f3750000) Metaspace used 8746K, committed 8832K, reserved 16384K before phase 2 def new generation total 154240K, used 21945K [0x00007fa037a00000, 0x00007fa042150000, 0x00007fa0de8a0000) eden space 137152K, 16% used [0x00007fa037a00000, 0x00007fa038f6e460, 0x00007fa03fff0000) from space 17088K, 0% used [0x00007fa03fff0000, 0x00007fa03fff0000, 0x00007fa0410a0000) to space 17088K, 0% used [0x00007fa0410a0000, 0x00007fa0410a0000, 0x00007fa042150000) tenured generation total 342720K, used 0K [0x00007fa0de8a0000, 0x00007fa0f3750000, 0x00007fa22c600000) the space 342720K, 0% used [0x00007fa0de8a0000, 0x00007fa0de8a0000, 0x00007fa0de8a0200, 0x00007fa0f3750000) Metaspace used 8746K, committed 8832K, reserved 16384K before phase 3 def new generation total 154240K, used 21945K [0x00007fa037a00000, 0x00007fa042150000, 0x00007fa0de8a0000) eden space 137152K, 16% used [0x00007fa037a00000, 0x00007fa038f6e460, 0x00007fa03fff0000) from space 17088K, 0% used [0x00007fa03fff0000, 0x00007fa03fff0000, 0x00007fa0410a0000) to space 17088K, 0% used [0x00007fa0410a0000, 0x00007fa0410a0000, 0x00007fa042150000) tenured generation total 342720K, used 0K [0x00007fa0de8a0000, 0x00007fa0f3750000, 0x00007fa22c600000) the space 342720K, 0% used [0x00007fa0de8a0000, 0x00007fa0de8a0000, 0x00007fa0dea88e00, 0x00007fa0f3750000) Metaspace used 8746K, committed 8832K, reserved 16384K before phase 4 def new generation total 154240K, used 21945K [0x00007fa037a00000, 0x00007fa042150000, 0x00007fa0de8a0000) eden space 137152K, 16% used [0x00007fa037a00000, 0x00007fa038f6e460, 0x00007fa03fff0000) from space 17088K, 0% used [0x00007fa03fff0000, 0x00007fa03fff0000, 0x00007fa0410a0000) to space 17088K, 0% used [0x00007fa0410a0000, 0x00007fa0410a0000, 0x00007fa042150000) tenured generation total 342720K, used 0K [0x00007fa0de8a0000, 0x00007fa0f3750000, 0x00007fa22c600000) the space 342720K, 0% used [0x00007fa0de8a0000, 0x00007fa0de8a0000, 0x00007fa0dea88e00, 0x00007fa0f3750000) So top == bottom is normal in that test. Assuming that is_in() is too strict, it does not explain why it only fails on ppc and where that bad pointer value comes from - that relocation entry must have pointed to some object when it had been generated, but there shouldn't have been any object there ever. As for the performance of is_in(), all implementations are fast enough for debug code (and is_in_reserved() in Serial GC is not much faster anyway).
30-10-2020

1) There's no CollectedHeap::is_in_reserved in later JDKs, so that won't work. However, you could use the new CompressedOops::is_in function. 2) IMHO, I don't think that we should be that afraid of this warning. It's meant to warn you that this shouldn't be used in performance sensitive *product* code. Like the hot loop of a GC. To start to worry about this check is a bit premature, when we do so much other expensive operations in fastdebug. (Famous last words) 3) I agree that having is_in in CompressedOops has been shown to not be a 100% perfect match. On the other hand, if you call CompressedOops::decode and get something that really isn't pointing to a valid object, then I think it's good that we encode that into the code base somehow (CompressedOops::decode_raw/decode_unsafe?). I don't think that we should consider it "biting us", but rather showing us where our assumptions might be wrong. Though, if I'm the only one with this view, then I'm also fine with getting rid of this check.
30-10-2020

Thanks a lot, folks, for looking at this. Stefan has already figured it out. It's not the type punning (which I don't really like btw.). There are periods of time (during GC) at which we can't use "heap()->is_in" because the pointer gets switched between old and new location, but "is_in" is not yet accurate. Could be, that the only place where this currently hits us is in PPC code. So we could fix it in PPC code by using CompressedOops::decode_raw, but this doesn't sound like my preferred solution. This could bite us elsewhere again in the future. So I think it would be better to assert something like is_in_reserved instead. The comment before "virtual bool is_in(const void* p) const = 0;" suggests not to use it in performance critical code. Slowing down the fastdebug build is not a good thing IMHO. This also speaks for avoiding "is_in" usages in compressedOops.inline.hpp. What do you think?
30-10-2020

The ppc-specific type punning between compressed oops and compressed klass pointers (JDK-8253860) seems like a good candidate for causing this. With is_in checking having been added to CompressedOops, that punning seems even more questionable.
30-10-2020

I see that decode and decode_raw has different semantics w.r.t. 0, so be careful if you change this code to use decode_raw: inline oop CompressedOops::decode_raw(narrowOop v) { return (oop)(void*)((uintptr_t)base() + ((uintptr_t)v << shift())); } inline oop CompressedOops::decode(narrowOop v) { return is_null(v) ? (oop)NULL : decode_not_null(v); }
29-10-2020

FWIW, I grep:ed for CompressedOops::decode, and the instance in nativeInst_ppc.cpp is the only platform specific code that calls it.
29-10-2020

Some more info from the hs_err file: Heap address: 0x00000000d0000000, size: 768 MB, Compressed Oops mode: 32-bit CDS archive(s) mapped at: [0x0000000800000000-0x0000000800c10000-0x0000000800c10000), size 12648448, SharedBaseAddress: 0x0000000800000000, ArchiveRelocationMode: 0. Compressed class space mapped at: 0x0000000801000000-0x0000000841000000, reserved size: 1073741824 Narrow klass base: 0x0000000800000000, Narrow klass shift: 3, Narrow klass range: 0x100000000 So, the compressed oops used are without shift and without base, while compressed klasses uses both shift and base. This makes it unlikely that the above is caused by a mix-up between oops and klasses. --- The broken pointer is 0x00000000e0010648, which is above the old gen top (0x00000000e0000000) tenured generation total 524288K, used 0K [0x00000000e0000000, 0x0000000100000000, 0x0000000100000000) the space 524288K, 0% used [0x00000000e0000000, 0x00000000e0000000 (top), 0x00000000e0156e00, 0x0000000100000000) However, we're crashing in mark sweep phase 3, which adjust all pointers to point to their new location, the top pointer is reset in mark sweep phase 4. Maybe the oop relocation verification code is the only thing that decodes narrow oops after the adjust phase. I wonder why this is only crashing on ppc.
29-10-2020

For most GCs, its quite suspicious to have heap pointers pointing into the reserved space and not within valid heap areas. I wouldn't change this to decode_raw, without further investigation. Have you ruled out that the code isn't reading a klass/metadata relocation? There's code in this vincinity that performs type punning of compressed klasses into narrowOops: narrowOop no = (type() == relocInfo::oop_type) ? CompressedOops::encode((oop)x) : // Type punning compressed klass pointer as narrowOop. CompressedOops::narrow_oop_cast(CompressedKlassPointers::encode((Klass*)x)); And other code that says that it deals with both compressed oops and klasses, which seem suspicious: intptr_t NativeMovConstReg::data() const { ... if (MacroAssembler::is_set_narrow_oop(addr, cb->content_begin())) { narrowOop no = MacroAssembler::get_narrow_oop(addr, cb->content_begin()); return cast_from_oop<intptr_t>(CompressedOops::decode(no)); --- // Get compressed oop or klass constant. narrowOop MacroAssembler::get_narrow_oop(address a, address bound) { --- Can you reproduce this even if compressed klass pointers are turned off with -XX:-UseCompressedClassPointers ?
29-10-2020

Maybe we should use CompressedOops::decode_raw in NativeMovConstReg::data() on PPC64.
29-10-2020

Right, the assertion was added with JDK-8237363. Doesn't look like a Metaspace problem. Seems like it's not safe to assert this in some use cases. We may point to a valid address within the reserved Java heap space, but "heap()->is_in" can return false. In this case: bool GenCollectedHeap::is_in(const void* p) const { return _young_gen->is_in(p) || _old_gen->is_in(p); }
29-10-2020