United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6635560 segv in reference processor on t1000
JDK-6635560 : segv in reference processor on t1000

Details
Type:
Bug
Submit Date:
2007-11-29
Status:
Closed
Updated Date:
2012-02-01
Project Name:
JDK
Resolved Date:
2010-01-13
Component:
hotspot
OS:
generic,solaris_10
Sub-Component:
gc
CPU:
sparc,generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
7
Fixed Versions:
hs11 (b09)

Related Reports
Backport:
Backport:
Duplicate:
Relates:

Sub Tasks

Description
See Comments.

                                    

Comments
WORK AROUND

Increase the size of the perm gen, e.g., use -XX:MaxPermSize=128m.
Whether this works because it makes the perm gen bigger or whether
it works because it changes the page size that maps the perm gen
is unknown.
                                     
2007-11-29
EVALUATION

Something very strange going on in the process when it crashes.  A native library from the testbase, libloadClass.so, is mapped over a very large region of memory, including right over the top of many of the VM data structures (java heap, card table, etc.).  The file is ~136K in size, but the mappings attributed to it are more than 110 MB.

See attached hs_err file (contains heap addresses) as well as the similarly-named pmap file taken from the core.

libloadClass.so is being read over nfs; there are numerous error messages in the
/var/adm/messages file indicating that the nfs server where it lives stops responding and then comes back.  None of the message timestamps exactly correspond to the crashes I've seen, but they are reasonably close (15-30min).  Cannot put the blame outside the JVM yet, but the mappings in the process are bizarre enough that nfs problems should be ruled out.  Need to get all the test binaries and data moved local to one machine and then see if the problem is still reproducible.
                                     
2007-12-01
EVALUATION

The crash is reproduced the same way using both local jdk and local vm testbase:
        jdk: /export/local/common/jdk/6u4p/b01/solaris-sparcv9
vm testbase: /export/local/common/testbase/6/vm/bin

To reproduce do:
1. ssh vm-t1000-03 -l gtee

2. cd /net/sqenfs-1.sfbay/export1/comp/vm/execution/results/JDK_PERFORMANCE/PROMOTION/VM/6u4p/b20/2007-11-24_1/vm/64BITSOLSPARC5.10/server/mixed/vm-64BITSOLSPARC5.10_server_mixed_vm.parallel_class_loading.testlist2007-11-24-16-31-30/ResultDir/inner-simple_vm-t1000-03_katya

3. /net/sqenfs-1.sfbay/export1/comp/vm/bin/reproduce_bug.sh rerun_jdk.sh.local

See:
 hs_err_pid29783.log, core_pid29783

 hs_err_pid6008.log, core_pid6008 - fastdebug bits
                                     
2007-12-01
EVALUATION

The libloadClass.so mappings reported by pmap appear to be a limitation (or bug) in pmap, they can be ignored.

I ran the test case with

     truss -f -s\!all -tmmap,munmap java ...

and saw the following output just before the crash:

11508/27:   mmap(0xFFFFFFFF6D000000, 4194304, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF6D000000
 [PSYoungGen: 22688K->1088K(35520K)] 23816K->2496K(60096K), 0.0355506 secs] [Times: user=0.19 sys=0.11, real=0.04 secs]
11508/1457: mmap(0xFFFFFFFF54000000, 4194304, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF54000000
11508/1457: mmap(0xFFFFFFFF7A920000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF7A920000
11508/1457: mmap(0xFFFFFFFF7AB20000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF7AB20000
11508/29:       Incurred fault #6, FLTBOUNDS  %pc = 0xFFFFFFFF76023BAC
11508/29:         siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFFFFF7AB2A13D
==============================================================================
Unexpected Error
------------------------------------------------------------------------------
SIGSEGV (0xb) at pc=0xffffffff76023bac, pid=11508, tid=29

Do you want to debug the problem?

To debug, run 'dbx - 11508'; then switch to thread 29
Enter 'yes' to launch dbx automatically (PATH must include dbx)
Otherwise, press RETURN to abort...
==============================================================================

The full output is attached as truss-mmap.out.

The mmap of 64K bytes at 0xFFFFFFFF7AB20000 succeeds, so 0xFFFFFFFF7AB20000- 0xFFFFFFFF7AB30000 should be mapped w/read, write & exec permissions.  This is followed by a SEGV accessing addr 0xFFFFFFFF7AB2A13D, which is covered by the preceding mmap.  

Three of the four mmap calls shown correspond to growing the vm data structures during  a GC:  the first is the young gen, the second is the perm gen, the third is unknown and the fourth is the card table covering the newly-added part of the perm gen.  The mmap and SEGV occur on different LWPs; the mmap is normally done by the VM thread and the SEGV generated by an application thread (could not use pstack or dbx on the process to verify this because it was run under truss).  In previous instances, the SEGV occurred when dirtying a card.

If the ordering of the truss output for the mmap and SEGV can be relied upon, looks like a kernel bug.
                                     
2007-12-01
EVALUATION

The strange thing is that 
1. bug is easy reproduced starting from jdk7b22
2. is not reproduced with jdk7b21 
  (more that 1000 iterations passed without problems)
3. is reproduced only on T1000 machines
   (note, latest Sol patches have been installed)
                                     
2007-12-03
EVALUATION

Running truss with timestamps shows the mmap and the SEGV occurring simultaneously (at least to the precision reported by truss).  The problem is due to the use of large pages for the card table.  The card table assumes that the generations in the heap are multiples of the region covered by a page of the card table, an assumption broken by the changes for bug 6588638:  improve support for large pages.

This causes the card table to map/unmap memory for cards across generation boundaries.  In this particular case, the card table for the last part of the perm gen is being mapped.  The mapped page extends onto the cards that cover the first part of the old gen.  Because the perm gen can be expanded without taking a safepoint, the mmap call occurs while the mutators are still running.  Another thread is writing a card mark for an object in the old gen and is touching an address in the range covered by the mmap, during the mmap.  It seems as if the mmap first tears down any existing mapping before installing new ones, and during this window any thread that touches the memory will get a SEGV.
                                     
2007-12-03
WORK AROUND

Reliable workarounds are

(1) fix the size of the heap (including the perm gen) to prevent growing/shrinking:

    java -Xms<heap_size> -Xmx<heap_size> \
        -XX:PermSize=<perm_size> -XX:MaxPermSize=<perm_size> ...

(2) disable the use of large pages:

    java -XX:-UseLargePages ...
                                     
2007-12-03
SUGGESTED FIX

Use the default page size for the card table.

--- old/src/share/vm/memory/cardTableModRefBS.cpp	Tue Dec  4 22:15:14 2007
+++ new/src/share/vm/memory/cardTableModRefBS.cpp	Tue Dec  4 22:15:14 2007
@@ -51,7 +51,7 @@
   _whole_heap(whole_heap),
   _guard_index(cards_required(whole_heap.word_size()) - 1),
   _last_valid_index(_guard_index - 1),
-  _page_size(os::page_size_for_region(_guard_index + 1, _guard_index + 1, 1)),
+  _page_size(os::vm_page_size()),
   _byte_map_size(compute_byte_map_size())
 {
   _kind = BarrierSet::CardTableModRef;
                                     
2007-12-04
EVALUATION

A vm built with the suggested fix has passed 1000+ iterations of the test.  Prior to the fix it would crash within ~70 iterations.
                                     
2007-12-05



Hardware and Software, Engineered to Work Together