Bug ID: JDK-8200175 Assertion from virtualMemoryTracker.hpp

JDK-8200175 : Assertion from virtualMemoryTracker.hpp

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 11

Priority: P3
Status: Closed
Resolution: Cannot Reproduce

Submitted: 2018-03-23
Updated: 2019-06-20
Resolved: 2018-06-25

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11
11Resolved

Related Reports

Relates :	JDK-8202772 - NMT thread stack tracking causes crashes on AIX
Relates :	JDK-8201321 - NMT: Unnecessary re-recording thread stack and size when attaching listener to JavaThread

Description

In test on solaris-sparcv9 fromTonga/metaspace/stressHierarchy/stressHierarchy010/TestDescription.java	

hierarchyType = INTERFACES
using TriggerUnloadingByFillingMetaspace
attemptsLimit = 50
unloadingPause = 1000
pausesLimit = 5
treeDepth=70, minLevelSize=10, maxLevelSize=100, hierarchyType=INTERFACES, triggerUnloadingHelper.getClass().getName()=vm.share.gc.TriggerUnloadingByFillingMetaspace
Create tree
Load classes and instantiate objects
Generating took 20 sec
cleanupLevel=69
Time expired. TimeoutWatchdog is calling TimeoutHandler.handleTimeout.
Shutting down vm because of time expired.
# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/virtualMemoryTracker.hpp:62
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/scratch/opt/mach5/mesos/work_dir/dbd38275-44a2-4351-b12e-2562e50b439c/workspace/open/src/hotspot/share/services/virtualMemoryTracker.hpp:62), pid=21562, tid=91
#  assert(_committed >= sz) failed: Negative amount
#
# JRE version: Java(TM) SE Runtime Environment (11.0) (fastdebug build 11-internal+0-2018-03-22-1910029.coleen.phillimore.11clean)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 11-internal+0-2018-03-22-1910029.coleen.phillimore.11clean, mixed mode, tiered, compressed oops, g1 gc, solaris-sparc)
# Core dump will be written. Default location: /scratch/opt/mach5/mesos/work_dir/8e977dd0-e713-492d-8283-0f9589f94349/testoutput/jtreg/JTwork/scratch/7/core or core.21562
#
Unsupported internal testing APIs have been used.

Having trouble attaching the hs_err_pid file, the stack trace looks like this:

Stack: [0x0007fffebd800000,0x0007fffebe200000],  sp=0x0007fffebe1fbcc0,  free space=10223k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1dee230]  void VMError::report_and_die(int,const char*,const char*,void*,Thread*,unsigned char*,void*,voi
d*,const char*,int,unsigned long)+0x940
V  [libjvm.so+0x1ded87c]  void VMError::report_and_die(Thread*,const char*,int,const char*,const char*,void*)+0x3c
V  [libjvm.so+0xd6e678]  void report_vm_error(const char*,int,const char*,const char*,...)+0x78
V  [libjvm.so+0x1dddaa0]  void ReservedMemoryRegion::set_flag(MemoryType)+0x170
V  [libjvm.so+0x1dde7ec]  void VirtualMemoryTracker::set_reserved_region_type(unsigned char*,MemoryType)+0x26c
V  [libjvm.so+0x1875890]  VirtualSpaceNode::VirtualSpaceNode #Nvariant 1(bool,unsigned long)+0x320
V  [libjvm.so+0x187aea8]  bool VirtualSpaceList::create_new_virtual_space(unsigned long)+0x178
V  [libjvm.so+0x187b4d8]  bool VirtualSpaceList::expand_by(unsigned long,unsigned long)+0x2e8
V  [libjvm.so+0x187b678]  Metachunk*VirtualSpaceList::get_new_chunk(unsigned long,unsigned long)+0xa8
V  [libjvm.so+0x18820c8]  Metachunk*SpaceManager::get_new_chunk(unsigned long)+0x78
V  [libjvm.so+0x1880f2c]  MetaWord*SpaceManager::grow_and_allocate(unsigned long)+0x17c
V  [libjvm.so+0x188224c]  MetaWord*SpaceManager::allocate_work(unsigned long)+0x5c
V  [libjvm.so+0x18821ac]  MetaWord*SpaceManager::allocate(unsigned long)+0x7c
V  [libjvm.so+0x1885ee4]  MetaWord*Metaspace::allocate(ClassLoaderData*,unsigned long,MetaspaceObj::Type,Thread*)+0x144
V  [libjvm.so+0x8847c8]  void*MetaspaceObj::operator new(unsigned long,ClassLoaderData*,unsigned long,MetaspaceObj::Type,
Thread*)+0x18

I will attach the test.

Comments

We have not seen this failure in a while and this test runs all the time. And it's possible that JDK-8201321 fixed this. Closing as CNR.
25-06-2018
I was not able to reproduce on Linux 64, and I don't have any Solaris machines. If there are not new cases, I would like to close it as not reproducible.
25-06-2018
make run-test TEST=open/test/hotspot/jtreg/vmTestbase/metaspace/stressHierarchy/stressHierarchy012/TestDescription.java JTREG="RETAIN=all;VM_OPTIONS=-XX:NativeMemoryTracking=detail" works for me. Or you can stop at vmTestbase and it'll run them all, but they might take a while. We haven't seen this crash since April 30th.
12-06-2018
Any instructions on how to run vmTestbase tests?
12-06-2018
The tests are now in open source. They are in test/hotspot/jtreg/vmTestbase/metaspace/stressHierarchy I'll look around and see if I can see any recent failures.
12-06-2018
Eliminated a scenario in case 2.
10-04-2018
I tried this patch: http://cr.openjdk.java.net/~zgu/8200175/webrev.00/, it passed submit-hs. It looks like that under normal circumstances: (2) has special cases. (I am not sure thread->record_stack_base_and_size() is needed in attaching listener, will run some tests.) (3) is dead code. But none of them can explain the assertion failure. Do failed tests do something special? e.g. have some threads exit without proper cleanup? as mentioned in JDK-8199067? Anyway, will wait till the tests become available to take further look.
06-04-2018
No submit-hs only runs tier1 tests and these are only run in PIT AFAICS.
05-04-2018
Does submit-hs repo run those tests?
05-04-2018
And, yes, I agree that (3) seems to be dead code.
29-03-2018
But if we're inadvertently over mmapping memory (except CDS which does this on some platforms), then don't we have bigger problems than NMT with what we're writing in this memory?
29-03-2018
I can provoke a failure with the example above: With: diff --git a/src/hotspot/share/runtime/init.cpp b/src/hotspot/share/runtime/init.cpp --- a/src/hotspot/share/runtime/init.cpp +++ b/src/hotspot/share/runtime/init.cpp @@ -166,6 +166,11 @@ CommandLineFlags::printFlags(tty, false, PrintFlagsRanges); } + char* addr = os::reserve_memory(4096, NULL, 4096, mtGC); + os::commit_memory(addr, 4096, false); + char* other_addr = os::reserve_memory(4096, addr, 4096, mtClass); + guarantee(addr == other_addr, "Failed"); + return JNI_OK; } I get: # Internal Error (/home/stefank/hg/z/open/src/hotspot/share/services/virtualMemoryTracker.cpp:281), pid=30347, tid=30348 # assert((flag() == mtNone \|\| flag() == f)) failed: Overwrite memory type
29-03-2018
Yes, it was the assumption, unfortunately.
29-03-2018
I have the feeling the memtracker does just not expect os::reserve_memory covering the same region. However, with mmap, overlapping mmaps are possible, and the os:: APIs do not prevent this. Would be fine for sysV shm or Windows Virtual memory, but mmap is more fluid.
29-03-2018
You are right about case (2). I didn't realize that the set_flag had global side-effects ... ��Looking at set_flag: void ReservedMemoryRegion::set_flag(MEMFLAGS f) { assert((flag() == mtNone \|\| flag() == f), "Overwrite memory type"); if (flag() != f) { VirtualMemorySummary::move_reserved_memory(flag(), f, size()); VirtualMemorySummary::move_committed_memory(flag(), f, committed_size()); _flag = f; } } This means that we temporarily has move committed memory than reserved memory. I think that works today, but it seems fragile. It would be nice if _reserved >= _committed was an invariant throughout the code. Thinking more about (2), there might still be a bug here if we have the following scenario (pseudo code): Reserve memory with mtFlag1: addr = os::reserve_memory(...); record_virtual_memory_reserve(addr, ..., mtFlag1); Commit memory os::commit_memory(addr, ...) Reserve the same memory area with mtFlag2: addr = os::reserve_memory(addr, ...) record_virtual_memory_reserve(addr, ..., mtFlag1); The last call to record_virtual_memory_reserve will boil down to (2) above, and call set_flag. This will now transfer the committed memory to mtFlag2, eventhough the committed region was lost when the second os::reserve_memory was called.
29-03-2018
It has only been seen on solaris, and we don't have anything reproduceable. Once we open the rest of the tests, it'll be in the open and may reproduce. I was hoping the case #3 could be investigated in the meantime. Or I can move it to JDK12
29-03-2018
Is this Solaris only bug? any test cases I can use to reproduce on available platforms (Linux or Windows)? BTW, I won't be able to get to this for a while. I need to sort out our debug symbols issues.
29-03-2018
I can't seem to reproduce this locally anymore.
29-03-2018
> I took a look at VirtualMemoryTracker::add_reserved_region: > > ReservedMemoryRegion rgn(base_addr, size, stack, flag); > ReservedMemoryRegion* reserved_rgn = _reserved_regions->find(rgn); > > if (reserved_rgn == NULL) { > VirtualMemorySummary::record_reserved_memory(size, flag); > return _reserved_regions->add(rgn) != NULL; > } else { > if (reserved_rgn->same_region(base_addr, size)) { > reserved_rgn->set_call_stack(stack); > reserved_rgn->set_flag(flag); > return true; > } else if (reserved_rgn->adjacent_to(base_addr, size)) { > VirtualMemorySummary::record_reserved_memory(size, flag); > reserved_rgn->expand_region(base_addr, size); > reserved_rgn->set_call_stack(stack); > return true; > } else { > > If we dissect this a bit: > 1) > if (reserved_rgn == NULL) { > VirtualMemorySummary::record_reserved_memory(size, flag); > return _reserved_regions->add(rgn) != NULL; > } > > Seems correct. If we didn't have a region for this, ust increment the size and add the region. > > 2) > if (reserved_rgn->same_region(base_addr, size)) { > reserved_rgn->set_call_stack(stack); > reserved_rgn->set_flag(flag); > return true; > } > > Seems wrong. If we overlap another region, shouldn't we call record_released_memory with the old flag, and record_reserved_memory on with the new flag. > I think this looks ok. "same_region()" is really identity, not overlapping. And set_flag() does first deaccounting with old, then reaccounting with new flag. (Apart from the question: reserving the same address space from a different caller with a different memflags - does that even happen, and if yes, is it legal?) > 3) > if (reserved_rgn->adjacent_to(base_addr, size)) { > VirtualMemorySummary::record_reserved_memory(size, flag); > reserved_rgn->expand_region(base_addr, size); > reserved_rgn->set_call_stack(stack); > return true; > } > > Seems wrong. We don't check if the two regions have the same flag, but we expand the old region, containing the old flag. > I am not sure we ever even hit this: reserved_rgn is found using "_reserved_regions->find(rgn);", which does return any region overlapping the given region. But "adjacent_to()" is a neighbor check, not an overlap check. Overlapping regions are not be neighbors. But even if it were to hit, I agree, this looks wrong: "VirtualMemorySummary::record_reserved_memory(size, flag);" will record <reserved[flag] += size>. "expand_region" and "set_call_stack" do not change the counters. So, if the flags of new region and pre-existing regions are not identical, the reserved counters for both flags are now wrong by +/-size. > This might not be the cause of this bug, but this part should be unit tested and fixed. Yes. ----- Since my time is limited and I am no NMT expert and - above all - I cannot even recreate this issue, I'd like someone else to take this over. I am not sure the new metaspace chunk allocator is involved here. So I'll unassign myself from this issue.
29-03-2018
Note that JDK-8196405 didn't add merging of Reserved regions. I took a look at VirtualMemoryTracker::add_reserved_region: ReservedMemoryRegion rgn(base_addr, size, stack, flag); ReservedMemoryRegion* reserved_rgn = _reserved_regions->find(rgn); if (reserved_rgn == NULL) { VirtualMemorySummary::record_reserved_memory(size, flag); return _reserved_regions->add(rgn) != NULL; } else { if (reserved_rgn->same_region(base_addr, size)) { reserved_rgn->set_call_stack(stack); reserved_rgn->set_flag(flag); return true; } else if (reserved_rgn->adjacent_to(base_addr, size)) { VirtualMemorySummary::record_reserved_memory(size, flag); reserved_rgn->expand_region(base_addr, size); reserved_rgn->set_call_stack(stack); return true; } else { If we dissect this a bit: 1) if (reserved_rgn == NULL) { VirtualMemorySummary::record_reserved_memory(size, flag); return _reserved_regions->add(rgn) != NULL; } Seems correct. If we didn't have a region for this, ust increment the size and add the region. 2) if (reserved_rgn->same_region(base_addr, size)) { reserved_rgn->set_call_stack(stack); reserved_rgn->set_flag(flag); return true; } Seems wrong. If we overlap another region, shouldn't we call record_released_memory with the old flag, and record_reserved_memory on with the new flag. 3) if (reserved_rgn->adjacent_to(base_addr, size)) { VirtualMemorySummary::record_reserved_memory(size, flag); reserved_rgn->expand_region(base_addr, size); reserved_rgn->set_call_stack(stack); return true; } Seems wrong. We don't check if the two regions have the same flag, but we expand the old region, containing the old flag. This might not be the cause of this bug, but this part should be unit tested and fixed.
27-03-2018
I am not yet fully convinced this is caused by JDK-8198423. Looking at the source I think I understand this: VirtualSpaceNode gets created - creates a ReservedSpace - reserves memory range with os::reserve_memory (presumably, no indication for large pages) - which tracks that reservation with mtNone. This reservation should not have committed parts. - VirtualSpaceNode tries to re-book this memory to mtClass: MemTracker::record_virtual_memory_type((address)_rs.base(), mtClass); - calls VirtualMemoryTracker::set_reserved_region_type - finds a pre-existing region with mtNone - calls ReservedMemoryRegion::set_flag - re-books the reserved counter from mtNone to mtClass (VirtualMemorySummary::move_reserved_memory(flag(), f, size());) which works - attempts to re-book the committed counter from mtNone to mtClass. Note that for this region, that committed counter should be zero, since we did not yet commit memory for this VirtualSpaceNode. So we call: VirtualMemorySummary::move_committed_memory(flag(), f, committed_size()); But the pre-existing committed size seems to be higher than what is recorded in the VirtualMemory object. The pre-existing committed size comes from ReservedMemoryRegion::committed_size(). The ReservedRegion for this node is found by: ReservedMemoryRegion* reserved_rgn = _reserved_regions->find(rgn); which uses ReservedMemoryRegion::compare(), which does some sort of overlap check? So, are we really looking at the committed space size within this VirtualSpaceNode - which should be 0 - or something else? I am speculating here: maybe the mtNone region for this newly created VirtualSpaceNode has been merged with a neighboring mtNone area, which has already committed memory? But mtNone should not have anything committed, or? We do not seem to see "mtNone" tagged virtual space in the hs-err file. This could also explain why we only see this on sparc (do we?) since placement of memory regions via mmap is platform dependent. So, I wonder whether this could have been caused by "JDK-8196405: [REDO] NMT: add_committed_regions doesn't merge succeeding regions" ? And since my patch JDK-8198423 probably changed allocation pattern, maybe it shook this pre-existing bug loose? At least this would be worth checking out. I am leaving this on my name for now, but if anyone feels better positioned to look at this please take over.
27-03-2018
Attached hs_err_pid file. Still working on reproduceable case outside this harness (the hs_err_file is not attaching!). I can reproduce this, just not outside of all the stuff.
26-03-2018
I'll upload a test later today with more information (sorry ran out of time last week).
26-03-2018
Is this only sparc? Would there be a way for me to reproduce those tests? Can I see the hs-err file?
26-03-2018
The change leading to these failures seems to have been between 3/2/2018 and 3/15/2018. JDK-8198423 is in that time range. I've seen failures in all of stressHierarchy009, stressHierarchy010, stressHierarchy011, and stressHierarchy012.
24-03-2018