Bug ID: JDK-8007074 SIGSEGV at ParMarkBitMap::verify

JDK 7	JDK 8	Other
7u60Fixed	8Fixed	hs25Fixed

The fix is risky because it affects default VM native memory allocation behavior on many modern Linux systems. Taking into account the this is not a security vulnerability issue SQE vote to fix it in 7u60.
18-10-2013
While working on JDK-8013057, Dan managed to reproduce the problem that mmap with MAP_FIXED might lose the old reservation on Linux even without MAP_HUGETLB. So, this problem is not only limited to MAP_HUGETLB mappings. Engineers working with the Linux kernel has also verified that unfortunately this is the intended behavior. I'm working on a patch to see if we can mimic the behavior of UseSHM which maps large pages upfront when memory is reserved.
30-08-2013
The reason for the crash is a bug in the Linux version of os::pd_commit_memory(char* addr, size_t size, size_t alignment_hint, bool exec). The code to setup the bitmaps: 1) Reserve the memory for the bitmaps 2) Try to mmap large pages starting at the reserved address start 3) If (2) fails, try to mmap normal pages at the same address. The problem is that if (2) fails we lose our previous reservation the memory. This opens up a short window when other mmaps/mallocs might allocate from the same memory area, causing all kind of crashes. (2) and (3) is done in: bool os::pd_commit_memory(char* addr, size_t size, size_t alignment_hint, bool exec) { if (UseHugeTLBFS && alignment_hint > (size_t)vm_page_size()) { int prot = exec ? PROT_READ\|PROT_WRITE\|PROT_EXEC : PROT_READ\|PROT_WRITE; uintptr_t res = (uintptr_t) ::mmap(addr, size, prot, // <--- (2) MAP_PRIVATE\|MAP_FIXED\|MAP_ANONYMOUS\|MAP_HUGETLB, -1, 0); if (res != (uintptr_t) MAP_FAILED) { if (UseNUMAInterleaving) { numa_make_global(addr, size); } return true; } // Fall through and try to use small pages } if (commit_memory(addr, size, exec)) { // <--- (3) realign_memory(addr, size, alignment_hint); return true; } return false; } Note that it's not only the bitmap code that is affected by this bug.
30-08-2013
URL: http://hg.openjdk.java.net/hsx/hsx25/hotspot/rev/4c84d351cca9 User: amurillo Date: 2013-08-30 10:57:24 +0000
30-08-2013
URL: http://hg.openjdk.java.net/hsx/hotspot-rt/hotspot/rev/4c84d351cca9 User: stefank Date: 2013-08-26 22:53:48 +0000
26-08-2013
Confirmed with FMW and Coherence that they are ok with deferring the issue from 7u40.
10-07-2013
SQE is OK with deferring
10-07-2013
This can be reproduced by doing: 1) Set up a few large pages echo 809 \| sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 2) Repeatedly run a small test that provokes a Full GC while true; do java -server -Xmixed -esa -ea -XX:+UseParallelGC -Xmx32g -XX:+PrintGC -XX:+ShowMessageBoxOnError HelloSystemGC ; done class HelloSystemGC { public static void main(String [] args) { System.gc(); } } Note that on this machine the default heap size gets initialized to 32GB which gives 1GB for the bitmaps.
03-07-2013
The changes are relatively large and affect important functionality in VM. The new options are introduced and default behaviour is affected. It is require at least 2 weeks of testing. We are not able to test this fix during ATR. Test plan includes execution of all VM tests including stress and Bigapps on specific hosts with specific configuration. Also additional test development could be required. To test this feature we need to free some of our resources. It means that we should drop some of our activities for 2-3 weeks: 1) bugifx verification of P3 in 7u40 2) ATR analysis (PIT and Promotion) for VM/GC in JDK 8 3) test bugfixing
03-07-2013
Yes. I've added a comment to JDK-8017481.
25-06-2013
Is it possible that JDK-8017481 is a duplicate of this issue?
24-06-2013
I don't have a reliable reproducer. I can sometimes reproduce this with the command line I gave above.
16-05-2013
To me, the Linux quote is even more clear. Our ReservedSpace construct has created a 'MAP_NORESERVE' mapping. Our VirtualSpace construct tries to replace the 'MAP_NORESERVE' mapping with a 'MAP_FIXED' mapping. It seems to me that this falls right into these words: {If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail.} Our 'MAP_NORESERVE' mapping is discarded when we try to setup the 'MAP_FIXED' mapping. Notice I didn't say anything about the 'MAP_FIXED' mapping failing here... yet. And then our 'MAP_FIXED' mapping attempt fails so now we have no mapping. Neither the old mapping nor the new mapping. However, our ReservedSpace construct still thinks it has the space 'reserved', i.e., mapped as 'MAP_NORESERVE'. I'm working on a possible fix, but I'm having trouble reproducing the crash. Do you have a (somewhat) reliable reproducer?
16-05-2013
Thanks for the pointers to bug JDK-6843484. I don't find the mmap man page text you are referring to, on my Linux mmap man pages. I see that in JDK-6843484 it says: {'man mmap' says: "If mmap() fails for reasons other than [EBADF], [EINVAL], or [ENOTSUP], some of the mappings in the address range starting at addr and continuing for len bytes may have been unmapped.".} And that in JDK-8013057 there's a comment saying: {Solaris mmap manpage states: If mmap() fails for reasons other than EBADF, EINVAL or ENOTSUP, some of the mappings in the address range starting at addr and continuing for len bytes may have been unmapped.} However, in my man pages this is the section for MAP_FIXED: {MAP_FIXED Don't interpret addr as a hint: place the mapping at exactly that address. addr must be a multiple of the page size. If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail. Because requiring a fixed address for a mapping is less portable, the use of this option is discouraged.} It's not clear to me that the mapping should be removed if we get ENOMEM.
16-05-2013
This bug is related to the following two bugs: JDK-6843484 os::commit_memory() failures are not handled properly on linux https://jbs.oracle.com/bugs/browse/JDK-6843484 JDK-8013057 assert(_needs_gc \|\| SafepointSynchronize::is_at_safepoint()) failed: only read at safepoint https://jbs.oracle.com/bugs/browse/JDK-8013057 For what it's worth, the Linux mmap() man page clearly states that for specific mmap() failures, the previous mapping may be lost (see JDK-6843484).
15-05-2013
Bug logged against the Linux kernel: https://bugzilla.kernel.org/show_bug.cgi?id=57951
10-05-2013
Attached a simple C reproducer. Example output from the reproducer (with 200 2MB pages setup as described above): $ ./a.out Reserved at: 0x7f7cbee98000-0x7f7cd8298000 Printing /proc/self/maps: 00400000-00401000 r-xp 00000000 08:11 20565214 /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out 00600000-00601000 r--p 00000000 08:11 20565214 /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out 00601000-00602000 rw-p 00001000 08:11 20565214 /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out 7f7cbee98000-7f7cd8298000 rw-p 00000000 00:00 0 7f7cd8298000-7f7cd844d000 r-xp 00000000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd844d000-7f7cd864c000 ---p 001b5000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd864c000-7f7cd8650000 r--p 001b4000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd8650000-7f7cd8652000 rw-p 001b8000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd8652000-7f7cd8657000 rw-p 00000000 00:00 0 7f7cd8657000-7f7cd8679000 r-xp 00000000 08:06 2089723 /lib/x86_64-linux-gnu/ld-2.15.so 7f7cd884f000-7f7cd8852000 rw-p 00000000 00:00 0 7f7cd8876000-7f7cd8879000 rw-p 00000000 00:00 0 7f7cd8879000-7f7cd887a000 r--p 00022000 08:06 2089723 /lib/x86_64-linux-gnu/ld-2.15.so 7f7cd887a000-7f7cd887c000 rw-p 00023000 08:06 2089723 /lib/x86_64-linux-gnu/ld-2.15.so 7fffae46e000-7fffae48f000 rw-p 00000000 00:00 0 [stack] 7fffae594000-7fffae595000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Tried mmap with MAP_HUGETLB at: 0x7f7cbf000000-0x7f7cd8200000 Printing /proc/self/maps: 00400000-00401000 r-xp 00000000 08:11 20565214 /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out 00600000-00601000 r--p 00000000 08:11 20565214 /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out 00601000-00602000 rw-p 00001000 08:11 20565214 /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out 7f7cbee98000-7f7cbf000000 rw-p 00000000 00:00 0 7f7cd8200000-7f7cd8298000 rw-p 00000000 00:00 0 7f7cd8298000-7f7cd844d000 r-xp 00000000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd844d000-7f7cd864c000 ---p 001b5000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd864c000-7f7cd8650000 r--p 001b4000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd8650000-7f7cd8652000 rw-p 001b8000 08:06 2089694 /lib/x86_64-linux-gnu/libc-2.15.so 7f7cd8652000-7f7cd8657000 rw-p 00000000 00:00 0 7f7cd8657000-7f7cd8679000 r-xp 00000000 08:06 2089723 /lib/x86_64-linux-gnu/ld-2.15.so 7f7cd884f000-7f7cd8852000 rw-p 00000000 00:00 0 7f7cd8876000-7f7cd8879000 rw-p 00000000 00:00 0 7f7cd8879000-7f7cd887a000 r--p 00022000 08:06 2089723 /lib/x86_64-linux-gnu/ld-2.15.so 7f7cd887a000-7f7cd887c000 rw-p 00023000 08:06 2089723 /lib/x86_64-linux-gnu/ld-2.15.so 7fffae46e000-7fffae48f000 rw-p 00000000 00:00 0 [stack] 7fffae594000-7fffae595000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
15-04-2013
In core.20537 there's unmapped memory in the middle of the bit maps. I'm trying to reproduce this on spbef19, but so far without success. Maybe the machine needs to be running other tests while running the reproducer.
22-03-2013
Also found this crash on product build: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007faa12b5db58, pid=20537, tid=140368391636736 # # JRE version: Java(TM) SE Runtime Environment (8.0-b80) (build 1.8.0-ea-b80) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b21 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x820b58] ParMarkBitMap::mark_obj(HeapWord*, unsigned long)+0xb8 # # Core dump written. Default location: /tmp/mlvm-oome/ResultDir/lotsOfCallSites/core or core.20537 # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # See attachment: hs_err_pid20537.log Core file is available: /net/vmsqe.ru.oracle.com/export/home/ppunegov/crashes/JDK-8007074/core.20537
21-03-2013
Was able to reproduce assert: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/HUDSON/workspace/2-build-linux-amd64/jdk8/3622/hotspot/src/share/vm/gc_implementation/parallelScavenge/parMarkBitMap.cpp:260), pid=2956, tid=140031773636352 # assert(p == 0) failed: bitmap not clear # # JRE version: Java(TM) SE Runtime Environment (8.0-b80) (build 1.8.0-ea-fastdebug-b80) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b21-fastdebug mixed mode linux-amd64 compressed oops) # Core dump written. Default location: /tmp/mlvm-oome/ResultDir/lotsOfCallSites/core or core.2956 # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # --------------- T H R E A D --------------- Current thread (0x00007f5c2c0e2800): VMThread [stack: 0x00007f5bb00ff000,0x00007f5bb0200000] [id=3057] Stack: [0x00007f5bb00ff000,0x00007f5bb0200000], sp=0x00007f5bb01fdfd0, free space=1019k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xd97582] VMError::report_and_die()+0x302 V [libjvm.so+0x67b274] report_vm_error(char const, int, char const, char const)+0x84 V [libjvm.so+0xb89375] ParMarkBitMap::verify_clear() const+0x55 V [libjvm.so+0xbedf46] PSParallelCompact::pre_compact(PreGCValues)+0x196 V [libjvm.so+0xbf72a8] PSParallelCompact::invoke_no_policy(bool)+0x148 V [libjvm.so+0xbf872b] PSParallelCompact::invoke(bool)+0xfb V [libjvm.so+0x5d097f] CollectedHeap::collect_as_vm_thread(GCCause::Cause)+0x1df V [libjvm.so+0xd98b09] VM_CollectForMetadataAllocation::doit()+0x209 V [libjvm.so+0xdc0919] VM_Operation::evaluate()+0x89 V [libjvm.so+0xdbddd1] VMThread::evaluate_operation(VM_Operation)+0xb1 V [libjvm.so+0xdbf040] VMThread::loop()+0x660 V [libjvm.so+0xdbf270] VMThread::run()+0xb0 V [libjvm.so+0xb6e818] java_start(Thread*)+0x108 VM_Operation (0x00007f5c339c7100): CollectForMetadataAllocation, mode: safepoint, requested by thread 0x00007f5c2c00c000 See attachment "hs_err_pid2956.log" Core file is available here: /net/vmsqe.ru.oracle.com/export/home/ppunegov/crashes/JDK-8007074/core.2956
21-03-2013
Reopen since this happened again.
20-03-2013
The 2013.03.15 RT_Baseline nightly ran into a failure related to this bug. nsk/jdi/VirtualMachine/instanceCounts/instancecounts003 This test failed the following assertion: # Internal Error (/opt/jprt/T/P1/214605.cphillim/s/src/share/vm/gc_implementation/parallelScavenge/parMarkBitMap.cpp:260), pid=28318, tid=140461972346624 # assert(*p == 0) failed: bitmap not clear # # JRE version: Java(TM) SE Runtime Environment (8.0-b81) (build 1.8.0-ea-fastdebug-b81) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b23-internal-201303152146.cphillim.il-8007725-fastdebug mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp Test run URL: http://aurora.ru.oracle.com/functional/faces/RunDetails.xhtml?names=186461.JAVASE.NIGHTLY.VM.RT_Baseline.2013-03-15-36&show-limit=2000&filter= Host: spbef17, Intel Xeon 3068 MHz, 24 cores, 142G, Linux / Oracle Linux 6.2, x86_64 Options: -server -Xmixed -XX:DefaultMaxRAMFraction=8 -XX:+CreateMinidumpOnCrash -XX:ReservedCodeCacheSize=256M There is an existing bug that describes this failure mode: JDK-8003121 Internal error in parMarkBitMap https://jbs.oracle.com/bugs/browse/JDK-8003121 However, 8003121 is closed as a duplicate of the following bug: JDK-8007074 SIGSEGV at ParMarkBitMap::verify_clear() https://jbs.oracle.com/bugs/browse/JDK-8007074 However, 8007074 is closed as "cannot reproduce". I will add this entry to 8007074.
18-03-2013
We have run the different tests that failed in this and the duplicated bug over and over, in total more than 120,000 times. We tried to restart the PIT using gtee in the same way as the tests were originally run on the same hardware using the same JDK in a loop over 100 times. No bitmap related failures. There were a bunch of tests that failed in the same way on the same machine that PIT. Hardware problems is not ruled out. Maybe we are not handling a failed malloc properly. Or maybe javac wasn't using the same JDK as were tested and didn't have the fix for JDK-8003121.
12-02-2013
yes, it's my bad. actually crash has happened not on test's run, but on javac /export/local/common/jdk/baseline/linux-amd64/bin/javac -d /export/local/160741.JAVASE.PIT.VM.linux-amd64_vm__server_mixed_cvm.testlist.runTests/results/ResultDir/Test4338756.java /export/local/common/testbase/7/vm/vm//src/cvm/j2me_reg/cdc_foundation/cvm/4338756/Test4338756.java
29-01-2013
There is no way G1 can cause a crash in ParMarkBitMap::verify_clear(). ParMarkBitMap is a parallel GC structure. I believe the addition of the "used vm_opts" comment in the description is bogus.
29-01-2013
There seems to be c heap allocated memory inside the bit maps. From the core file: (gdb) p p (const size_t ) 0x7f9e2c000008 (gdb) x /4xg p-1 0x7f9e2c000000: 0x00007f9e2c000020 0x0000000000000000 0x7f9e2c000010: 0x0000000000289000 0x0000000000289000 p /x PSParallelCompact::_mark_bitmap->_virtual_space { <CHeapObj<1280u>> = { <AllocatedObj> = { _vptr.AllocatedObj = 0x7f9e4481a7f0 }, <No data fields>}, members of PSVirtualSpace: _alignment = 0x200000, _reserved_low_addr = 0x7f9df8000000, _reserved_high_addr = 0x7f9e34000000, _committed_low_addr = 0x7f9df8000000, _committed_high_addr = 0x7f9e34000000, _special = 0x0 } From hs_err: Event: 0.777 loading class 0x00007f9e2c04b5b8 <--- Inside the bit map (gdb) p (char) ((Symbol)0x7f9e2c04b5b8)->_body 0x7f9e2c04b5da "sun/security/action/GetBooleanAction\253\253\253\253\253\253\253\253\253\253\260ՁD\236\177" This looks a lot like the recently fixed: JDK-8003121: Jvm crashed during coherence exabus (tmb) testing though, the fix for that has already been pushed to HS24.
29-01-2013

Duplicate :	JDK-8003121 - Internal error in parMarkBitMap
Duplicate :	JDK-8016666 - G1:UseSHM crashes the JVM if G1HeapRegionSize > os::large_page_size()
Duplicate :	JDK-8009602 - Should be graceful/informative exit rather than crash
Relates :	JDK-8024396 - VM crashing with assert(!UseLargePages \|\| UseParallelOldGC \|\| use_large_pages) failed: Wrong alignment to use large pages
Relates :	JDK-8017629 - G1: UseSHM in combination with a G1HeapRegionSize > os::large_page_size() falls back to use small pages
Relates :	JDK-8012015 - Use PROT_NONE when reserving memory
Relates :	JDK-6843484 - os::commit_memory() failures are not handled properly on linux