United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-8007074 : SIGSEGV at ParMarkBitMap::verify_clear()

Details
Type:
Bug
Submit Date:
2013-01-29
Status:
Closed
Updated Date:
2014-06-23
Project Name:
JDK
Resolved Date:
2013-08-26
Component:
hotspot
OS:
linux
Sub-Component:
gc
CPU:
Priority:
P2
Resolution:
Fixed
Affected Versions:
hs24,hs25
Fixed Versions:
hs25 (b48)

Related Reports
Backport:
Duplicate:
Duplicate:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
during PIT of hs24-b31 for jdk7u14-b12 tests cvm/j2me_reg/cdc_foundation/cvm/4903751/Test4903751.sh cvm/j2me_reg/cdc_foundation/cvm/4338756/Test4338756.java crashes:

;; Using jvm: "/export/local/common/jdk/baseline/linux-amd64/jre/lib/amd64/server/libjvm.so"
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2fd9d7cdb9, pid=28082, tid=139843163752192
#
# JRE version: Java(TM) SE Runtime Environment (7.0_14-b10) (build 1.7.0_14-ea-fastdebug-b10)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.0-b31-internal-201301241932.amurillo.hs24-b31-snapshot-fastdebug mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0xb88db9]  ParMarkBitMap::verify_clear() const+0x29
#
# Core dump written. Default location: /export/local/160741.JAVASE.PIT.VM.linux-amd64_vm__server_mixed_cvm.testlist.runTests/results/ResultDir/Test4338756.java/core or core.28082
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

                                    

Comments
The fix is risky because it affects default VM native memory allocation behavior on many modern Linux systems. Taking into account the this is not a security vulnerability issue SQE vote to fix it in 7u60.
                                     
2013-10-18
URL:   http://hg.openjdk.java.net/hsx/hsx25/hotspot/rev/4c84d351cca9
User:  amurillo
Date:  2013-08-30 10:57:24 +0000

                                     
2013-08-30
URL:   http://hg.openjdk.java.net/hsx/hotspot-rt/hotspot/rev/4c84d351cca9
User:  stefank
Date:  2013-08-26 22:53:48 +0000

                                     
2013-08-26
SQE is OK with deferring  
                                     
2013-07-10
Confirmed with FMW and Coherence that they are ok with deferring the issue from 7u40. 
                                     
2013-07-10
The changes are relatively large and affect important functionality in VM. The new options are introduced and default behaviour is affected.
It is  require at least 2 weeks of testing. We are not able to test this fix during ATR.
Test plan includes execution of all VM tests including stress and Bigapps on specific hosts with specific configuration. 
Also additional test development could be required. 
To test this feature we need to free some of our resources. It means that we should drop some of our activities for 2-3 weeks:
1) bugifx verification of P3 in 7u40
2) ATR analysis (PIT and Promotion) for VM/GC in JDK 8 
3) test bugfixing
                                     
2013-07-03
Yes. I've added a comment to JDK-8017481.
                                     
2013-06-25
Is it possible that JDK-8017481 is a duplicate of this issue?
                                     
2013-06-24
While working on JDK-8013057, Dan managed to reproduce the problem that mmap with MAP_FIXED might lose the old reservation on Linux even without MAP_HUGETLB. So, this problem is not only limited to MAP_HUGETLB mappings.

Engineers working with the Linux kernel has also verified that unfortunately this is the intended behavior.

I'm working on a patch to see if we can mimic the behavior of UseSHM which maps large pages upfront when memory is reserved.
                                     
2013-06-11
To me, the Linux quote is even more clear. Our ReservedSpace construct has
created a 'MAP_NORESERVE' mapping. Our VirtualSpace construct tries to
replace the 'MAP_NORESERVE' mapping with a 'MAP_FIXED' mapping.

It seems to me that this falls right into these words:

     {If the memory region specified by addr and len overlaps pages of any
     existing mapping(s), then the overlapped part of the existing mapping(s)
     will be discarded. If the specified address cannot be used, mmap() will fail.}

Our 'MAP_NORESERVE' mapping is discarded when we try to setup the
'MAP_FIXED' mapping. Notice I didn't say anything about the 'MAP_FIXED'
mapping failing here... yet. And then our 'MAP_FIXED' mapping attempt
fails so now we have no mapping. Neither the old mapping nor the new
mapping. However, our ReservedSpace construct still thinks it has the
space 'reserved', i.e., mapped as 'MAP_NORESERVE'.

I'm working on a possible fix, but I'm having trouble reproducing the crash.
Do you have a (somewhat) reliable reproducer?
                                     
2013-05-16
Thanks for the pointers to bug JDK-6843484.

I don't find the mmap man page text you are referring to, on my Linux mmap man pages.

I see that in JDK-6843484 it says:
{'man mmap' says: "If mmap() fails for reasons other than [EBADF], [EINVAL], or
[ENOTSUP], some of the mappings in the address range starting at addr and
continuing for len bytes may have been unmapped.".}

And that in JDK-8013057 there's a comment saying:
{Solaris mmap manpage states:

     If mmap() fails for reasons other than EBADF, EINVAL or
     ENOTSUP, some of the mappings in the address range starting
     at addr and continuing for len bytes may have been unmapped.}

However, in my man pages this is the section for MAP_FIXED:
{MAP_FIXED
  Don't  interpret addr as a hint: place the mapping at exactly that address. 
  addr must be a multiple of the page size.  If the memory region specified by
  addr and len overlaps pages of any existing mapping(s), then the overlapped
  part of the existing mapping(s) will be discarded.  If the specified address
  cannot be used, mmap() will fail.  Because requiring a  fixed  address for a
  mapping is less portable, the use of this option is discouraged.}

It's not clear to me that the mapping should be removed if we get ENOMEM.

                                     
2013-05-16
I don't have a reliable reproducer. I can sometimes reproduce this with the command line I gave above.
                                     
2013-05-16
This bug is related to the following two bugs:

    JDK-6843484 os::commit_memory() failures are not handled properly on linux
    https://jbs.oracle.com/bugs/browse/JDK-6843484 

    JDK-8013057 assert(_needs_gc || SafepointSynchronize::is_at_safepoint()) failed:
                           only read at safepoint 
    https://jbs.oracle.com/bugs/browse/JDK-8013057

For what it's worth, the Linux mmap() man page clearly states that for specific
mmap() failures, the previous mapping may be lost (see JDK-6843484).

                                     
2013-05-15
Bug logged against the Linux kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=57951

                                     
2013-05-10
Attached a simple C reproducer.

Example output from the reproducer (with 200 2MB pages setup as described above):
$ ./a.out
Reserved at: 0x7f7cbee98000-0x7f7cd8298000
Printing /proc/self/maps:
00400000-00401000 r-xp 00000000 08:11 20565214                           /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out
00600000-00601000 r--p 00000000 08:11 20565214                           /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out
00601000-00602000 rw-p 00001000 08:11 20565214                           /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out
7f7cbee98000-7f7cd8298000 rw-p 00000000 00:00 0 
7f7cd8298000-7f7cd844d000 r-xp 00000000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd844d000-7f7cd864c000 ---p 001b5000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd864c000-7f7cd8650000 r--p 001b4000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd8650000-7f7cd8652000 rw-p 001b8000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd8652000-7f7cd8657000 rw-p 00000000 00:00 0 
7f7cd8657000-7f7cd8679000 r-xp 00000000 08:06 2089723                    /lib/x86_64-linux-gnu/ld-2.15.so
7f7cd884f000-7f7cd8852000 rw-p 00000000 00:00 0 
7f7cd8876000-7f7cd8879000 rw-p 00000000 00:00 0 
7f7cd8879000-7f7cd887a000 r--p 00022000 08:06 2089723                    /lib/x86_64-linux-gnu/ld-2.15.so
7f7cd887a000-7f7cd887c000 rw-p 00023000 08:06 2089723                    /lib/x86_64-linux-gnu/ld-2.15.so
7fffae46e000-7fffae48f000 rw-p 00000000 00:00 0                          [stack]
7fffae594000-7fffae595000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Tried mmap with MAP_HUGETLB at: 0x7f7cbf000000-0x7f7cd8200000
Printing /proc/self/maps:
00400000-00401000 r-xp 00000000 08:11 20565214                           /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out
00600000-00601000 r--p 00000000 08:11 20565214                           /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out
00601000-00602000 rw-p 00001000 08:11 20565214                           /home/stefank/hg/hsx-gc/src/share/vm/gc_implementation/a.out
7f7cbee98000-7f7cbf000000 rw-p 00000000 00:00 0 
7f7cd8200000-7f7cd8298000 rw-p 00000000 00:00 0 
7f7cd8298000-7f7cd844d000 r-xp 00000000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd844d000-7f7cd864c000 ---p 001b5000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd864c000-7f7cd8650000 r--p 001b4000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd8650000-7f7cd8652000 rw-p 001b8000 08:06 2089694                    /lib/x86_64-linux-gnu/libc-2.15.so
7f7cd8652000-7f7cd8657000 rw-p 00000000 00:00 0 
7f7cd8657000-7f7cd8679000 r-xp 00000000 08:06 2089723                    /lib/x86_64-linux-gnu/ld-2.15.so
7f7cd884f000-7f7cd8852000 rw-p 00000000 00:00 0 
7f7cd8876000-7f7cd8879000 rw-p 00000000 00:00 0 
7f7cd8879000-7f7cd887a000 r--p 00022000 08:06 2089723                    /lib/x86_64-linux-gnu/ld-2.15.so
7f7cd887a000-7f7cd887c000 rw-p 00023000 08:06 2089723                    /lib/x86_64-linux-gnu/ld-2.15.so
7fffae46e000-7fffae48f000 rw-p 00000000 00:00 0                          [stack]
7fffae594000-7fffae595000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

                                     
2013-04-15
This can be reproduced by doing:

1) Set up a few large pages
echo 809 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2) Repeatedly run a small test that provokes a Full GC
while true; do java -server -Xmixed -esa -ea -XX:+UseParallelGC -Xmx32g -XX:+PrintGC -XX:+ShowMessageBoxOnError HelloSystemGC ; done

class HelloSystemGC {
  public static void main(String [] args) {
    System.gc();
  }
}

Note that on this machine the default heap size gets initialized to 32GB which gives 1GB for the bitmaps.
                                     
2013-03-25
The reason for the crash is a bug in the Linux version of os::pd_commit_memory(char* addr, size_t size, size_t alignment_hint, bool exec).

The code to setup the bitmaps:
1) Reserve the memory for the bitmaps
2) Try to mmap large pages starting at the reserved address start
3) If (2) fails, try to mmap normal pages at the same address.

The problem is that if (2) fails we lose our previous reservation the memory. This opens up a short window when other mmaps/mallocs might allocate from the same memory area, causing all kind of crashes.

(2) and (3) is done in:
bool os::pd_commit_memory(char* addr, size_t size, size_t alignment_hint, bool exec) {
  if (UseHugeTLBFS && alignment_hint > (size_t)vm_page_size()) {
    int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC : PROT_READ|PROT_WRITE;
    uintptr_t res =
      (uintptr_t) ::mmap(addr, size, prot,                             // <--- (2)
                         MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_HUGETLB,
                         -1, 0);
    if (res != (uintptr_t) MAP_FAILED) {
      if (UseNUMAInterleaving) {
        numa_make_global(addr, size);
      }
      return true;
    }
    // Fall through and try to use small pages
  }

  if (commit_memory(addr, size, exec)) {           // <--- (3)
    realign_memory(addr, size, alignment_hint);
    return true;
  }
  return false;
}

Note that it's not only the bitmap code that is affected by this bug.
                                     
2013-03-25
In core.20537 there's unmapped memory in the middle of the bit maps.

I'm trying to reproduce this on spbef19, but so far without success. Maybe the machine needs to be running other tests while running the reproducer.
                                     
2013-03-22
Was able to reproduce assert:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/HUDSON/workspace/2-build-linux-amd64/jdk8/3622/hotspot/src/share/vm/gc_implementation/parallelScavenge/parMarkBitMap.cpp:260), pid=2956, tid=140031773636352
#  assert(*p == 0) failed: bitmap not clear
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b80) (build 1.8.0-ea-fastdebug-b80)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b21-fastdebug mixed mode linux-amd64 compressed oops)
# Core dump written. Default location: /tmp/mlvm-oome/ResultDir/lotsOfCallSites/core or core.2956
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x00007f5c2c0e2800):  VMThread [stack: 0x00007f5bb00ff000,0x00007f5bb0200000] [id=3057]

Stack: [0x00007f5bb00ff000,0x00007f5bb0200000],  sp=0x00007f5bb01fdfd0,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xd97582]  VMError::report_and_die()+0x302
V  [libjvm.so+0x67b274]  report_vm_error(char const*, int, char const*, char const*)+0x84
V  [libjvm.so+0xb89375]  ParMarkBitMap::verify_clear() const+0x55
V  [libjvm.so+0xbedf46]  PSParallelCompact::pre_compact(PreGCValues*)+0x196
V  [libjvm.so+0xbf72a8]  PSParallelCompact::invoke_no_policy(bool)+0x148
V  [libjvm.so+0xbf872b]  PSParallelCompact::invoke(bool)+0xfb
V  [libjvm.so+0x5d097f]  CollectedHeap::collect_as_vm_thread(GCCause::Cause)+0x1df
V  [libjvm.so+0xd98b09]  VM_CollectForMetadataAllocation::doit()+0x209
V  [libjvm.so+0xdc0919]  VM_Operation::evaluate()+0x89
V  [libjvm.so+0xdbddd1]  VMThread::evaluate_operation(VM_Operation*)+0xb1
V  [libjvm.so+0xdbf040]  VMThread::loop()+0x660
V  [libjvm.so+0xdbf270]  VMThread::run()+0xb0
V  [libjvm.so+0xb6e818]  java_start(Thread*)+0x108

VM_Operation (0x00007f5c339c7100): CollectForMetadataAllocation, mode: safepoint, requested by thread 0x00007f5c2c00c000

See attachment "hs_err_pid2956.log"

Core file is available here:  /net/vmsqe.ru.oracle.com/export/home/ppunegov/crashes/JDK-8007074/core.2956
                                     
2013-03-21
Also found this crash on product build:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007faa12b5db58, pid=20537, tid=140368391636736
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b80) (build 1.8.0-ea-b80)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b21 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x820b58]  ParMarkBitMap::mark_obj(HeapWord*, unsigned long)+0xb8
#
# Core dump written. Default location: /tmp/mlvm-oome/ResultDir/lotsOfCallSites/core or core.20537
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

See attachment: hs_err_pid20537.log
Core file is available: /net/vmsqe.ru.oracle.com/export/home/ppunegov/crashes/JDK-8007074/core.20537
                                     
2013-03-21
Reopen since this happened again.
                                     
2013-03-20
The 2013.03.15 RT_Baseline nightly ran into a failure related to this bug.

nsk/jdi/VirtualMachine/instanceCounts/instancecounts003

    This test failed the following assertion:

    #  Internal Error (/opt/jprt/T/P1/214605.cphillim/s/src/share/vm/gc_implementation/parallelScavenge/parMarkBitMap.cpp:260), pid=28318, tid=140461972346624
    #  assert(*p == 0) failed: bitmap not clear
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0-b81) (build 1.8.0-ea-fastdebug-b81)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b23-internal-201303152146.cphillim.il-8007725-fastdebug mixed mode linux-amd64 compressed oops)
    # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.sun.com/bugreport/crash.jsp

    Test run URL: http://aurora.ru.oracle.com/functional/faces/RunDetails.xhtml?names=186461.JAVASE.NIGHTLY.VM.RT_Baseline.2013-03-15-36&show-limit=2000&filter=

    Host:    spbef17, Intel Xeon 3068 MHz, 24 cores, 142G, Linux / Oracle Linux 6.2, x86_64
    Options: -server -Xmixed -XX:DefaultMaxRAMFraction=8 -XX:+CreateMinidumpOnCrash -XX:ReservedCodeCacheSize=256M

    There is an existing bug that describes this failure mode:

    JDK-8003121 Internal error in parMarkBitMap
    https://jbs.oracle.com/bugs/browse/JDK-8003121

    However, 8003121 is closed as a duplicate of the following bug:

    JDK-8007074 SIGSEGV at ParMarkBitMap::verify_clear()
    https://jbs.oracle.com/bugs/browse/JDK-8007074

    However, 8007074 is closed as "cannot reproduce". I will add this entry to
    8007074.
                                     
2013-03-18
We have run the different tests that failed in this and the duplicated bug over and over, in total more than 120,000 times. We tried to restart the PIT using gtee in the same way as the tests were originally run on the same hardware using the same JDK in a loop over 100 times. No bitmap related failures.

There were a bunch of tests that failed in the same way on the same machine that PIT. Hardware problems is not ruled out. Maybe we are not handling a failed malloc properly. Or maybe javac wasn't using the same JDK as were tested and didn't have the fix for JDK-8003121.
                                     
2013-02-12
There is no way G1 can cause a crash in ParMarkBitMap::verify_clear(). ParMarkBitMap is a parallel GC structure. I believe the addition of the "used vm_opts" comment in the description is bogus.
                                     
2013-01-29
There seems to be c heap allocated memory inside the bit maps.

From the core file:
(gdb) p p
 (const size_t *) 0x7f9e2c000008

(gdb) x /4xg p-1
 0x7f9e2c000000:	0x00007f9e2c000020	0x0000000000000000
 0x7f9e2c000010:	0x0000000000289000	0x0000000000289000

p /x *PSParallelCompact::_mark_bitmap->_virtual_space 
 {
  <CHeapObj<1280u>> = {
    <AllocatedObj> = {
      _vptr.AllocatedObj = 0x7f9e4481a7f0
    }, <No data fields>}, 
  members of PSVirtualSpace: 
  _alignment = 0x200000, 
  _reserved_low_addr = 0x7f9df8000000, 
  _reserved_high_addr = 0x7f9e34000000, 
  _committed_low_addr = 0x7f9df8000000, 
  _committed_high_addr = 0x7f9e34000000, 
  _special = 0x0
}

From hs_err:
 Event: 0.777 loading class 0x00007f9e2c04b5b8  <--- Inside the bit map

(gdb) p (char*) ((Symbol*)0x7f9e2c04b5b8)->_body
 0x7f9e2c04b5da "sun/security/action/GetBooleanAction\253\253\253\253\253\253\253\253\253\253\260??D\236\177"


This looks a lot like the recently fixed:
 JDK-8003121: Jvm crashed during coherence exabus (tmb) testing

though, the fix for that has already been pushed to HS24.
                                     
2013-01-29
yes, it's my bad.
actually crash has happened not on test's run, but on javac

/export/local/common/jdk/baseline/linux-amd64/bin/javac -d /export/local/160741.JAVASE.PIT.VM.linux-amd64_vm__server_mixed_cvm.testlist.runTests/results/ResultDir/Test4338756.java /export/local/common/testbase/7/vm/vm//src/cvm/j2me_reg/cdc_foundation/cvm/4338756/Test4338756.java
                                     
2013-01-29



Hardware and Software, Engineered to Work Together