Bug ID: JDK-8144483 One long Safepoint pause directly after each GC log rotation

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 7u80,8u20

Priority: P3
Status: Resolved
Resolution: Fixed
OS: solaris_10
CPU: sparc

Submitted: 2015-11-11
Updated: 2023-09-04
Resolved: 2016-01-08

JDK 7	JDK 8
7u101Fixed	8u101Fixed

FULL PRODUCT VERSION :
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

Applies to 1.7.0_80 and above, 1.8.0_20 and above, 1.9.0.

FULL OS VERSION :
SunOS 5.10 Generic_142900-07 sun4u sparc SUNW,Sun-Fire-V240

Due to analysis bug applies to any Solaris version.

A DESCRIPTION OF THE PROBLEM :
There's a bug in the Solaris specific HotSpot code which exhibits itself as extremely long pauses directly after GC log rotation.

The bug can be seen in 1.7.0 patchlevel 80 and above, 1.8.0 patchlevel 20 and above and 1.9.0. It was noticed using -XX:PrintGCApplicationStoppedTime.

During GC log rotation the function os::print_memory_info() in src/os/solaris/vm/os_solaris.cpp is called, which itself calls check_addr0().

Now check_addr0() reads the file /proc/self/pmap, but does this entry by entry. Productive Java processes often have 1000 and more map entries (e.g. each thread stack is one entry, often each jar etc.

Reading one entry takes about 35-40ms in a zone on a T4 system. Reading e.g. 1000 entries takes 35-40 seconds during which the JVM is stopped. We experienced even stop times up until 2400 seconds in extreme cases.

Note that this doesn't have to do with GC. It only happens directly after GC log file rotation.

The bug is reading the file entry by entry instead of using pread() to read it in one go. Look for instance at the source code of the Solaris command "pmap" which uses pread() to read the file at once.

It might also be questioned, whether this map file check must happen during a safepoint or could also be done outside, only synchronizing in case a hit was found and a warning actually needs to be written to the log file.

The bug was shadowed by another bug until and including 1.7.0_76 and 1.8.0_11: the close() that closes the map file after reading the entries was erroneously put inside the read loop instead of after the read loop. Thus these versions only read the first entry, then closed the file and erred out when trying to read the next entry. So reading was quick.

The change that removed the shadowing bug was:

Java 8: http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/diff/7d28f4e15b61/src/os/solaris/vm/os_solaris.cpp

Java 7: http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/diff/e50eb3195734/src/os/solaris/vm/os_solaris.cpp

Example code for reading using pread() (going back to OpenSolaris):

https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/ptools/pmap/pmap.c

Look e.g. at function xmapping_iter() and there look for pread().

See also the discussion thread starting at http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-November/002351.html.

The analysis was done using source code review and realtime checks with pstack and truss. It would be nice if you could instruct me how to get this fixed. It is a bit specific because it only happens on Solaris and when GC log rotation is active, but Solaris should be an important platform for Oracle and GC log rotation will be used more and more.

I will also attach a simple reproduction code as a tarball, that contains a README and a copy of the bug description (if I find out how to attach).


THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

REGRESSION.  Last worked in version 7u76

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Start any Java process, that creates many entries in /proc/self/map and uses GC log rotation. Add -XX:+PrintGCStoppedTimeto track safepoint pauses. Trigger enough activity such that the GC log rotates. Watch a long safepoint pause directly after log rotation.

To create a Java process with many entries in /proc/self/map, it can e.g. have many threads and/or open many Jar files with a classloader (both common scenarios on the server).

The problem should be easily noticable with about 500-1000 entries in /proc/self/map.

I will attach a self-contained reproduction that implements this recipe (if attachments are allowed).

EXPECTED VERSUS ACTUAL BEHAVIOR :
Expected: Short safepoint pauses as long as no GC is involved.

Actual: One long safepoint pause directly after each GC log rotation. The reason is *not* the GC duration. Pause gets longer as more entries in /proc/self/map exist.
ERROR MESSAGES/STACK TRACES THAT OCCUR :
Example:

2015-11-10T22:32:43.347+0100: 27.582: Total time for which application threads were stopped: 24.1739366 seconds, Stopping threads took: 0.0002938 seconds

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLClassLoader;
import java.util.jar.JarFile;

public class GcRotationPause {

private static final int allocCount = 100000;

/* Create some threads to generate thread stack maps.
 * Threads will be started and sleep for the given time.
 * No CPU activity is needed.
 */
public static void startThreads(int num, final long duration) {
    int i;
    for (i = 0; i < num; i++) {
        new Thread(new Runnable() {
            public void run() {
                try {
                    Thread.currentThread().sleep(duration);
                } catch (InterruptedException ex) {
                    System.err.println(ex);
                }
            }
        }).start();
    }
}

/* Create class loaders that search a jar file
 * and look up a non-existing resource, so that
 * the jar file gets mapped into memory
 */
public static URLClassLoader[] openJars(int num, String jarName) {
    int i;
    URLClassLoader[] result = new URLClassLoader[num];
    for (i = 0; i < num; i++) {
        try {
            String file = i + "-" + jarName;
            result[i] = new URLClassLoader(new URL[]{(new File(file)).toURI().toURL()});
            result[i].findResource("does.not.exist");
        } catch (MalformedURLException ex) {
            System.err.println(ex);
        }
    }
    return result;
}

/* Loop for some time and do some allocation
 * to generate GC log activity and rotation.
 * The hashCode() and println() is just some
 * fuzzy code to prevent removal of the code
 * by the JIT
 */
public static void loopAllocation(long duration) {
    long until = System.currentTimeMillis() + duration;
    long now;
    int i;
    while ((now = System.currentTimeMillis()) < until) {
        for (i = 0; i < allocCount; i++) {
            Object o = new Object();
            if (o.hashCode() == 0) {
                System.out.println("Prevent JIT code removal");
            }
        }
    }
}

public static void main(String[] args) {
    int threadCount = Integer.parseInt(args[0]);
    long duration = Long.parseLong(args[1]);
    int jarCount = Integer.parseInt(args[2]);
    String jarName = args[3];
    System.out.println("Starting " + threadCount + " threads, each keeping them for " + duration + " ms.");
    startThreads(threadCount, duration);
    System.out.println("Opening " + jarCount + " jars based on " + jarName + ".");
    URLClassLoader[] cls = openJars(jarCount, jarName);
    loopAllocation(duration);
    System.out.println("Done.");
}
}

and use the following script to start it:

#!/bin/sh

# CONFIG STARTS HERE

# Choose your path to Java
JAVA_HOME=/usr/local/jdk1.7.0

# Number of threads to craete
THREADS=400

# Number of Jar files to map
JARS=400

# Time in milliseconds that the reproducer tuns
DURATION=600000

# JVM start options (sizing, GC log and log rotation)
FLAGS="
-server
-d64
-Xmx64m -Xms64m
-XX:NewSize=2m
-XX:MaxNewSize=2m
-XX:SurvivorRatio=6
-Xloggc:gc.log
-XX:+UseGCLogFileRotation
-XX:GCLogFileSize=51200
-XX:NumberOfGCLogFiles=100
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
"

# CONFIG ENDS HERE

# Our test class name - do not change
CLASS=GcRotationPause

PATH=$JAVA_HOME/bin:$PATH
export PATH

echo "Cleaning up old class and jar files..."
rm -rf work

echo "Creating work directory..."
mkdir -p work
cd work

echo "Compiling $CLASS.java..."
javac -d . ../$CLASS.java

echo "Creating $CLASS.jar"
jar cf $CLASS.jar ${CLASS}*.class

echo "Cloning jar file $JARS times"
n=$JARS
while [ $n -gt 0 ]
do
    cp $CLASS.jar ${n}-$CLASS.jar
    n=`expr $n - 1`
done

echo "Running $CLASS in background."
echo "Watch the GC log files in the work directory and attach with truss to follow reading /proc/self/map."
java $FLAGS $CLASS $THREADS $DURATION $JARS $CLASS.jar &

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Do not use GC log rotation.
Do not use Solaris.
Both not really helpful. No other workaround.

Test Result: ########## OS: Oracle Solaris 10 u11ga (Sun Sparc) JDK: **** 7u76 b38 : Pass 7u80 b15 : Fail 8u0 b132 : Pass 8u20 b31 : Fail 8u51 b16 : Fail 8u72ea b12 : Fail 9-ea b94 : Pass ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 7u76,8 ****** gc.log.1 ####### Total time for which application threads were stopped: 0.0088600 seconds gc.log.99 ######## Total time for which application threads were stopped: 0.0083722 seconds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 7u80,8u20,8u51,8u72ea: **************************** gc.log.1 ####### 2015-12-14T10:04:11.934+0000: 14.072: Total time for which application threads were stopped: 12.1704150 seconds, Stopping threads took: 0.0000977 seconds gc.log.45 ######## 2015-12-14T10:13:49.975+0000: 592.089: Total time for which application threads were stopped: 11.6838587 seconds, Stopping threads took: 0.0002991 seconds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 9ea **** gc.log.1 ####### Total time for which application threads were stopped: 0.0081028 seconds gc.log.99 ######## Total time for which application threads were stopped: 0.0095763 seconds
27-09-2016
According to submitter, Please find the tarball attached. It contains the problem description and a separate README for how to use it. Remember that the problem only happens on Solaris, so you need to try reproduction on that platform (more precisely I only tried Solaris Sparc).
26-09-2016
Requested the bug submitter to provide us the following information for our quick analysis. - The tarball which is mentioned in the reported file - The problem description - README file (reproducible steps)
26-09-2016
URL: http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/rev/b40d68340feb User: robm Date: 2016-01-15 15:34:02 +0000
15-01-2016
URL: http://hg.openjdk.java.net/jdk8u/jdk8u-dev/hotspot/rev/b40d68340feb User: kevinw Date: 2016-01-08 16:21:45 +0000
08-01-2016
JDK9 uses unified logging mechanism as default. Using unified logging this problem is not reproducible , therefore I have added 9-na label to it.
07-01-2016
Also see e-mail thread at http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2015-November/015369.html
18-11-2015