JDK-6635560 : segv in reference processor on t1000
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
A vm built with the suggested fix has passed 1000+ iterations of the test. Prior to the fix it would crash within ~70 iterations.
Use the default page size for the card table.
--- old/src/share/vm/memory/cardTableModRefBS.cpp Tue Dec 4 22:15:14 2007
+++ new/src/share/vm/memory/cardTableModRefBS.cpp Tue Dec 4 22:15:14 2007
@@ -51,7 +51,7 @@
_guard_index(cards_required(whole_heap.word_size()) - 1),
_last_valid_index(_guard_index - 1),
- _page_size(os::page_size_for_region(_guard_index + 1, _guard_index + 1, 1)),
_kind = BarrierSet::CardTableModRef;
Reliable workarounds are
(1) fix the size of the heap (including the perm gen) to prevent growing/shrinking:
java -Xms<heap_size> -Xmx<heap_size> \
-XX:PermSize=<perm_size> -XX:MaxPermSize=<perm_size> ...
(2) disable the use of large pages:
java -XX:-UseLargePages ...
Running truss with timestamps shows the mmap and the SEGV occurring simultaneously (at least to the precision reported by truss). The problem is due to the use of large pages for the card table. The card table assumes that the generations in the heap are multiples of the region covered by a page of the card table, an assumption broken by the changes for bug 6588638: improve support for large pages.
This causes the card table to map/unmap memory for cards across generation boundaries. In this particular case, the card table for the last part of the perm gen is being mapped. The mapped page extends onto the cards that cover the first part of the old gen. Because the perm gen can be expanded without taking a safepoint, the mmap call occurs while the mutators are still running. Another thread is writing a card mark for an object in the old gen and is touching an address in the range covered by the mmap, during the mmap. It seems as if the mmap first tears down any existing mapping before installing new ones, and during this window any thread that touches the memory will get a SEGV.
The strange thing is that
1. bug is easy reproduced starting from jdk7b22
2. is not reproduced with jdk7b21
(more that 1000 iterations passed without problems)
3. is reproduced only on T1000 machines
(note, latest Sol patches have been installed)
The libloadClass.so mappings reported by pmap appear to be a limitation (or bug) in pmap, they can be ignored.
I ran the test case with
truss -f -s\!all -tmmap,munmap java ...
and saw the following output just before the crash:
11508/27: mmap(0xFFFFFFFF6D000000, 4194304, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF6D000000
[PSYoungGen: 22688K->1088K(35520K)] 23816K->2496K(60096K), 0.0355506 secs] [Times: user=0.19 sys=0.11, real=0.04 secs]
11508/1457: mmap(0xFFFFFFFF54000000, 4194304, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF54000000
11508/1457: mmap(0xFFFFFFFF7A920000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF7A920000
11508/1457: mmap(0xFFFFFFFF7AB20000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFFFFFFFF7AB20000
11508/29: Incurred fault #6, FLTBOUNDS %pc = 0xFFFFFFFF76023BAC
11508/29: siginfo: SIGSEGV SEGV_MAPERR addr=0xFFFFFFFF7AB2A13D
SIGSEGV (0xb) at pc=0xffffffff76023bac, pid=11508, tid=29
Do you want to debug the problem?
To debug, run 'dbx - 11508'; then switch to thread 29
Enter 'yes' to launch dbx automatically (PATH must include dbx)
Otherwise, press RETURN to abort...
The full output is attached as truss-mmap.out.
The mmap of 64K bytes at 0xFFFFFFFF7AB20000 succeeds, so 0xFFFFFFFF7AB20000- 0xFFFFFFFF7AB30000 should be mapped w/read, write & exec permissions. This is followed by a SEGV accessing addr 0xFFFFFFFF7AB2A13D, which is covered by the preceding mmap.
Three of the four mmap calls shown correspond to growing the vm data structures during a GC: the first is the young gen, the second is the perm gen, the third is unknown and the fourth is the card table covering the newly-added part of the perm gen. The mmap and SEGV occur on different LWPs; the mmap is normally done by the VM thread and the SEGV generated by an application thread (could not use pstack or dbx on the process to verify this because it was run under truss). In previous instances, the SEGV occurred when dirtying a card.
If the ordering of the truss output for the mmap and SEGV can be relied upon, looks like a kernel bug.
The crash is reproduced the same way using both local jdk and local vm testbase:
vm testbase: /export/local/common/testbase/6/vm/bin
To reproduce do:
1. ssh vm-t1000-03 -l gtee
2. cd /net/sqenfs-1.sfbay/export1/comp/vm/execution/results/JDK_PERFORMANCE/PROMOTION/VM/6u4p/b20/2007-11-24_1/vm/64BITSOLSPARC5.10/server/mixed/vm-64BITSOLSPARC5.10_server_mixed_vm.parallel_class_loading.testlist2007-11-24-16-31-30/ResultDir/inner-simple_vm-t1000-03_katya
3. /net/sqenfs-1.sfbay/export1/comp/vm/bin/reproduce_bug.sh rerun_jdk.sh.local
hs_err_pid6008.log, core_pid6008 - fastdebug bits
Something very strange going on in the process when it crashes. A native library from the testbase, libloadClass.so, is mapped over a very large region of memory, including right over the top of many of the VM data structures (java heap, card table, etc.). The file is ~136K in size, but the mappings attributed to it are more than 110 MB.
See attached hs_err file (contains heap addresses) as well as the similarly-named pmap file taken from the core.
libloadClass.so is being read over nfs; there are numerous error messages in the
/var/adm/messages file indicating that the nfs server where it lives stops responding and then comes back. None of the message timestamps exactly correspond to the crashes I've seen, but they are reasonably close (15-30min). Cannot put the blame outside the JVM yet, but the mappings in the process are bizarre enough that nfs problems should be ruled out. Need to get all the test binaries and data moved local to one machine and then see if the problem is still reproducible.
Increase the size of the perm gen, e.g., use -XX:MaxPermSize=128m.
Whether this works because it makes the perm gen bigger or whether
it works because it changes the page size that maps the perm gen