JDK-8196485 : FromCardCache default card index can cause crashes
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8,9,10
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: x86_64
  • Submitted: 2018-01-30
  • Updated: 2018-08-16
  • Resolved: 2018-03-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11
11 b10Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
FULL PRODUCT VERSION :
JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 )

FULL OS VERSION :
Linux foo 3.10.0-327.28.3.el7.x86_64 #1 SMP Fri Aug 12 13:21:05 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux


A DESCRIPTION OF THE PROBLEM :
I am creating a new bug report as I cannot update the old one (https://bugs.java.com/view_bug.do?bug_id=8172756) with new information.

The current theory is that the crashes stem from card cache returning invalid results because of the overflow here: http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/69aec2ca5d90/src/share/vm/gc_implementation/g1/heapRegionRemSet.cpp#l436

This can at least give from_card of -1 which is also used to mean that cache is empty (bonus: t also seems heaps of size > 2^40 should overlap in that cache.)

In the case of overflow to -1, we can then end up with a situation where some region end up being reclaimed even when there are live pointers to objects in it. When those pointers are then later on followed in a marking phase, the pointers points into a (middle of) some new object and calling oopDesc::size() on that "object" will follow some trash pointers and cause crash.


THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
We've been unable to reproduce the crash with a simple program, but we're able to trigger the assert

    assert(contains_reference(from), err_msg("We just found " PTR_FORMAT " in the FromCardCache", from));

with this small program, fastdebug JVM and specific command line parameters (listed in the code comments) https://gist.github.com/jmiettinen/62506fd75855edb346d88cc95697bf1e

Currently, we're able to reproduce the actual failurearound once a week by running production software with a forced load pattern. We've run a patched OpenJDK JVM now without failures for a few week. The patch we tried was changing values in the card table from int to long to avoid overflow (https://gist.github.com/jmiettinen/1e1e8a0c71c2d1ae664335f1e875f227).

ERROR MESSAGES/STACK TRACES THAT OCCUR :
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f24c84669f4, pid=289114, tid=0x00007c1152831700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0x4cb9f4]  oopDesc::size()+0x44
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x00007f24c0420800):  ConcurrentGCThread [stack: 0x00007c1152731000,0x00007c1152832000] [id=289624]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00000c9a00000dd4

Registers:
RAX=0x00007f24c8f55cf9, RBX=0x00007f24c0477dd0, RCX=0x0000000000000001, RDX=0x0000000000000000
RSP=0x00007c1152830950, RBP=0x00007c1152830950, RSI=0x0000000000000000, RDI=0x00000c9a00000dcc
R8 =0x00007cf618c10000, R9 =0x0000000000000001, R10=0x00007d3f80262f08, R11=0x00007f24c0474850
R12=0x00007cf618c10000, R13=0x00007be38f10eea0, R14=0x00007f24c03ff070, R15=0x00007f24c0474850
RIP=0x00007f24c84669f4, EFLAGS=0x0000000000010246, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007c1152830950)
0x00007c1152830950:   00007c11528309b0 00007f24c85f4360
0x00007c1152830960:   00007c11528309b0 00007f23fc005010
0x00007c1152830970:   00007d3f80262e98 000000000000000e
0x00007c1152830980:   00007e31e25a9bf0 00007e31e25bc010
0x00007c1152830990:   00007e31e27b3a18 00007c1152830aa0
0x00007c11528309a0:   0000000000000001 00007f24c8f55cf9
0x00007c11528309b0:   00007c11528309f0 00007f24c88a6ee8
0x00007c11528309c0:   0000000300040001 00007e31e25b39f8
0x00007c11528309d0:   00007f24c0477dd0 00007c1152830af0
0x00007c11528309e0:   0000000101840000 0000004061000000
0x00007c11528309f0:   00007c1152830a20 00007f24c8469d34
0x00007c1152830a00:   000000000023acb3 0000004060cb673f
0x00007c1152830a10:   00007f24c03ff0f8 00007f24c0477dd0
0x00007c1152830a20:   00007c1152830b60 00007f24c846341e
0x00007c1152830a30:   3fe0000000000000 0000000000000000
0x00007c1152830a40:   00007f24c9777c53 00007be3b0287ea0
0x00007c1152830a50:   00007f24c0477ee0 00007c1152830af0
0x00007c1152830a60:   0100000000000000 4024000000000000
0x00007c1152830a70:   00007f24c0420801 000000000023fd02
0x00007c1152830a80:   3fefbb9dc59efb46 3fefbb9dc59efb46
0x00007c1152830a90:   00007f24c03ff110 00007e31e4000000
0x00007c1152830aa0:   00007f24c8ef6ad0 000000000100e400
0x00007c1152830ab0:   00007be32f3b4dd0 00007f24c8f006a0
0x00007c1152830ac0:   0000000000000b00 00007c1152830aa0
0x00007c1152830ad0:   00007f24c03ff070 00007f24c002a990
0x00007c1152830ae0:   00007f24c0477dd0 0000000000000000
0x00007c1152830af0:   00007f24c8ef6a50 00007f24c03ff0f8
0x00007c1152830b00:   00007f24c03ff070 00007f24c0477dd0
0x00007c1152830b10:   00007f24c8ef6ab0 00007f24c0477dd0
0x00007c1152830b20:   00007f24c002a990 0000000000000000
0x00007c1152830b30:   0000000000000000 0000000000000000
0x00007c1152830b40:   00007f24c0477dd0 00007f1db84d6a60 

Instructions: (pc=0x00007f24c84669f4)
0x00007f24c84669d4:   d1 48 d3 e7 48 03 38 8b 4f 08 83 f9 00 7e 19 f6
0x00007f24c84669e4:   c1 01 75 18 89 c8 c9 c1 f8 03 c3 90 48 8b 7f 08
0x00007f24c84669f4:   8b 4f 08 83 f9 00 7f e7 75 1a 66 90 48 8b 07 4c
0x00007f24c8466a04:   89 c6 4c 8b 98 e8 00 00 00 c9 41 ff e3 0f 1f 80 

Register to memory mapping:

RAX=0x00007f24c8f55cf9: <offset 0xfbacf9> in /usr/lib/jdk1.8.0_144/jre/lib/amd64/server/libjvm.so at 0x00007f24c7f9b000
RBX=0x00007f24c0477dd0 is an unknown value
RCX=0x0000000000000001 is an unknown value
RDX=0x0000000000000000 is an unknown value
RSP=0x00007c1152830950 is an unknown value
RBP=0x00007c1152830950 is an unknown value
RSI=0x0000000000000000 is an unknown value
RDI=0x00000c9a00000dcc is an unknown value
R8 =0x00007cf618c10000 is pointing into object: 0x00007cf618800018
[J 
 - klass: {type array long}
 - length: 1048576
R9 =0x0000000000000001 is an unknown value
R10=0x00007d3f80262f08 is an oop
[J 
 - klass: {type array long}
 - length: 12
R11=0x00007f24c0474850 is an unknown value
R12=0x00007cf618c10000 is pointing into object: 0x00007cf618800018
[J 
 - klass: {type array long}
 - length: 1048576
R13=0x00007be38f10eea0 is an unknown value
R14=0x00007f24c03ff070 is an unknown value
R15=0x00007f24c0474850 is an unknown value


Stack: [0x00007c1152731000,0x00007c1152832000],  sp=0x00007c1152830950,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x4cb9f4]  oopDesc::size()+0x44
V  [libjvm.so+0x659360]  CMTask::deal_with_reference(oopDesc*)+0x180
V  [libjvm.so+0x90bee8]  ObjArrayKlass::oop_oop_iterate_nv(oopDesc*, G1CMOopClosure*)+0x148
V  [libjvm.so+0x4ced34]  CMBitMapClosure::do_bit(unsigned long)+0x74
V  [libjvm.so+0x4c841e]  CMTask::do_marking_step(double, bool, bool)+0xa9e
V  [libjvm.so+0x4cf673]  CMConcurrentMarkingTask::work(unsigned int)+0x293
V  [libjvm.so+0xaeccff]  GangWorker::loop()+0xcf
V  [libjvm.so+0x92a338]  java_start(Thread*)+0x108


<SAME AS IN THE LINKED TICKET>

REPRODUCIBILITY :
This bug can be reproduced occasionally.

---------- BEGIN SOURCE ----------
// Running the test:
// javac FromCardCacheTest.java
// /opt/jdk1.8.0_144_fastdebug/bin/java -classpath . -Xms20M -Xmx20M -XX:HeapBaseMinAddress=2199011721216 -XX:+UseG1GC -verbose:gc FromCardCacheTest

import java.lang.management.GarbageCollectorMXBean;
import java.lang.management.ManagementFactory;
import java.lang.reflect.Field;

import sun.misc.Unsafe;

public class FromCardCacheTest {

    private static int nArrays = 24;
    private static int arraySize = 131068; // object array, a bit less than half of a 1M region
    private static int byteArraySize = 524270;  // byte array, a bit less than half of a 1M region
    private static final long heapBase = 2199011721216L;  // less than 2TB

    public static void main(String[] args) {
        while (runTest()) {
            System.gc();
            try {
                Thread.sleep(2000);
            } catch(Exception e) {}
        }
    }

    public static boolean runTest() {
        GarbageCollectorMXBean oldGc = null;
        GarbageCollectorMXBean youngGc = null;
        for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) {
            if (gc.getName().equals("G1 Old Generation")) oldGc = gc;
            if (gc.getName().equals("G1 Young Generation")) youngGc = gc;
        }
        if (oldGc == null) {
            System.out.println("G1 oldGc = null");
            return false;
        }
        if (youngGc == null) {
            System.out.println("G1 youngGc = null");
            return false;
        }
        long oldGcCount = oldGc.getCollectionCount();

        System.out.println("Running test");
        Object[][] arrays = new Object[nArrays][];
        for (int i = 0; i < nArrays; i++) {
            arrays[i] = new Object[arraySize];
        }

        Object[][] arrayWithCardMinus1 = new Object[1][];
        arrayWithCardMinus1[0] = findArray(arrays, true, true);
        Object[] otherArray = findArray(arrays, false, false);

        if (arrayWithCardMinus1[0] == null) {
            System.out.println("Array with card -1 not found. Trying again.");
            return true;
        } else {
            System.out.println("Found an array with card -1.");
        }

        if (otherArray == null) {
            System.out.println("otherArray = null.");
            return true;
        }

        System.out.println("Modifying the last card in the array...");
        final int modifyEntries = 10;

        final long youngGcCount = youngGc.getCollectionCount();
        for (int i = 0; i < modifyEntries; i++) {
           if (oldGc.getCollectionCount() > oldGcCount) {
               System.out.println("Old gc done. Trying again.");
               return true;
           }

           final int arrayIndex = arraySize - i - 1;
           arrayWithCardMinus1[0][arrayIndex] = new Object[]{new byte[byteArraySize]};

           final int card = getCard32Bit(getAddress(arrayWithCardMinus1[0]) + 4 * arrayIndex);
           System.out.println(arrayIndex + " " + String.format("0x%016x", getAddress(arrayWithCardMinus1[0]) + 16 + 4 * arrayIndex) + " set to " +
             String.format("0x%016x", getAddress(arrayWithCardMinus1[0][arrayIndex])) + ", card: " + card);

           if (card != -1) {
               arrayWithCardMinus1[0] = findArray(arrays, true, true);
               if (arrayWithCardMinus1[0] == null) {
                   System.out.println("Array with card -1 not found. Trying again.");
                   return true;
               }
           }

           try {
                Thread.sleep(100);
           } catch(Exception e) {}
        }


        System.out.println("Producting garbage to run gc.");

        final long t1 = System.currentTimeMillis();
        oldGcCount = oldGc.getCollectionCount();
        while (System.currentTimeMillis() - t1 < 60000) {

           if (oldGc.getCollectionCount() > oldGcCount) {
               System.out.println("Old gc done. Trying again.");
               return true;
           }

           otherArray[0] = new byte[30000];
           otherArray[0] = null;

           try {
                Thread.sleep(100);
           } catch(Exception e) {}
        }

        System.out.println("The crash didn't reproduce. Trying again.");
        return true;
    }

    private static Object[] findArray(Object[][] arrays, boolean withCardMinus1, boolean printAddresses) {
        for (int i = 0; i < arrays.length; i++) {
            if (arrays[i] == null) continue;
            final long startAddress = getAddress(arrays[i]);
            final long lastCardAddress = startAddress + 4 * arrays[i].length;
            final int card = getCard32Bit(lastCardAddress);
            if (printAddresses) {
                System.out.println("array " + i + ", startAddress: " + String.format("0x%016x", startAddress) +
                    ", lastCardAddress: " + String.format("0x%016x", lastCardAddress) + ", card: " + card);
            }

            if ((withCardMinus1 && card == -1) || (!withCardMinus1 && card != -1)) {
                Object[] foundArray = arrays[i];
                arrays[i] = null;
                return foundArray;
            }
        }
        return null;
    }


    // Unsafe is only used for reading object addresses.
    private static final Unsafe UNSAFE;
    static {
        try {
            Field e = Unsafe.class.getDeclaredField("theUnsafe");
            e.setAccessible(true);
            UNSAFE = (Unsafe)e.get((Object)null);
        } catch (IllegalAccessException | NoSuchFieldException var1) {
            throw new Error(var1);
        }
    }
    static Object[] addrArray = new Object[]{null};
    static final long baseOffset = UNSAFE.arrayBaseOffset(Object[].class);
    static final int pageSize = UNSAFE.pageSize();
    public static long getAddress(Object o) {
        addrArray[0] = o;
        final long address = (UNSAFE.getInt(addrArray, baseOffset) << 3) + heapBase - pageSize + 1024*1024L;
        addrArray[0] = null;
        return address;
    }

    public static int getCard32Bit(final long address) {
        final int card = (int)(address >> 9);
        return card;
    }
}
---------- END SOURCE ----------


Comments
Out for review: http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2018-March/021379.html . Thanks for reporting.
08-03-2018

The provided patch can't be used as-is, because it uses the type "long" in shared code. Windows 64bit is LLP64, e.g. long is a 32bit type.
02-02-2018

Code remains same for 8, 9 and 10. hence affacted versions are updated. "sun.misc.Unsafe" api's are been used and only for reading and it should not be a concern. As the program takes a long to run and i am waiting for the crash to occur, meanwhile i confirm the observations from submitter is valid == "The current theory is that the crashes stem from card cache returning invalid results because of the overflow here: http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/69aec2ca5d90/src/share/vm/gc_implementation/g1/heapRegionRemSet.cpp#l436 This can at least give from_card of -1 which is also used to mean that cache is empty (bonus: t also seems heaps of size > 2^40 should overlap in that cache.) In the case of overflow to -1, we can then end up with a situation where some region end up being reclaimed even when there are live pointers to objects in it. When those pointers are then later on followed in a marking phase, the pointers points into a (middle of) some new object and calling oopDesc::size() on that "object" will follow some trash pointers and cause crash. " ==
31-01-2018