Bug ID: JDK-8004170 G1: Verbose GC output is not getting flushed to log file using JDK 8

JDK-8004170 : G1: Verbose GC output is not getting flushed to log file using JDK 8

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 8

Priority: P3
Status: Resolved
Resolution: Fixed
OS: solaris_11
CPU: sparc

Submitted: 2012-11-29
Updated: 2013-09-18
Resolved: 2012-12-04

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 7	JDK 8	Other
7u40Fixed	8Fixed	hs24Fixed

Description

I am using G1 GC along with JDK 8 to do Coherence analysis with very large heaps and am seeing an odd problem. The verbose GC logs contain fewer GC entries than what I can see happening with JVisualVM. When I close the running JVMs, additional entries get flushed to the log file. This seems to be a bug to me and hinders analysis since I do not wish to have to shut down JVMs simply to collect complete GC data.

Here's an example from a recent run, where *no* data was in the log file at the end of the run:

mwm@nit1-ld3:~$ ls -lrt coh1213/logs
total 102
-rw-r--r-- 1 mwm mwm 0 Nov 26 08:58 nit1-ld3.1.gc
-rw-r--r-- 1 mwm mwm 228 Nov 26 09:00 nit1-ld3.1.out
-rw-r--r-- 1 mwm mwm 49880 Nov 26 09:40 nit1-ld3.1.log
mwm@nit1-ld3:~$ jdk/bin/jps
50359 Jstatd
50357 DefaultCacheServer
50715 Jps
mwm@nit1-ld3:~$ pkill -f java
mwm@nit1-ld3:~$ jdk/bin/jps
50717 Jps
mwm@nit1-ld3:~$ ls -lrt coh1213/logs
total 336
-rw-r--r-- 1 mwm mwm 228 Nov 26 09:00 nit1-ld3.1.out
-rw-r--r-- 1 mwm mwm 50018 Nov 26 09:41 nit1-ld3.1.log
-rw-r--r-- 1 mwm mwm 119401 Nov 26 09:41 nit1-ld3.1.gc

All of the log files are in an attachment to the bug, explained at the bottom of the description section. Previous runs had some, but not all, of the output in the GC log file when the measurement phase completed.

Here is how one of the JVMs was started:

/export/home/mwm/jdk/bin/java -server -d64 -Xms192g -Xmx192g -XX:NewSize=10g -XX:MaxGCPauseMillis=800 -XX:InitiatingHeapOccupancyPercent=75
-XX:+AggressiveOpts -XX:+UseCompressedOops -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseLargePages -XX:+AlwaysPreTouch
-XX:-TieredCompilation -XX:-UseBiasedLocking -XX:+UnlockExperimentalVMOptions -XX:G1LogLevel=finest
-Dtangosol.coherence.cacheconfig=examples-cache-config.xml -Dtangosol.coherence.override=tangosol-coherence-override.xml -Dtangosol.coherence.mode=prod
-Dtangosol.coherence.distributed.localstorage=true -Dtangosol.coherence.distributed.transport.reliable=sdmb -Djava.net.preferIPv4Stack=true
-Dtangosol.coherence.grid.preprocess=true -Dtangosol.coherence.distributed.busOptimizedBackup=false -Dtangosol.coherence.distributed.asyncbackup=false
-Dtangosol.coherence.localhost=nit1-ld2-ib -Dtangosol.coherence.log=/export/home/mwm/coh1213/logs/nit1-ld2.1.log -server
-Xloggc:/export/home/mwm/coh1213/logs/nit1-ld2.1.gc -showversion
-cp /export/home/mwm/config:/export/home/mwm/coh1213/lib/coherence.jar:/export/home/mwm/faban/benchmarks/CoherenceBench/lib/CoherenceBench.jar
-Dtangosol.coherence.role=server com.tangosol.net.DefaultCacheServer > /export/home/mwm/coh1213/logs/nit1-ld2.1.out 2>&1

and here are some additional details about the test / measurement scenario:

I am running CoherenceBench using 2 x 192 GB heaps (one each on 2 T4 1
socket LDOMs with 256 GB RAM). I am using JDK8 u0 build 63:

         Java HotSpot(TM) 64-Bit Server VM warning: Max heap size too large for Compressed Oops
         java version "1.8.0-ea"
         Java(TM) SE Runtime Environment (build 1.8.0-ea-b63)
         Java HotSpot(TM) 64-Bit Server VM (build 25.0-b07, mixed mode)

Here are some of the benchmark details:

     CoherenceBench using "complex" objects, ~8KB each in size
     80 / 20 % mix of gets / puts
     2 x 192 GB heaps (1 server on each of 2 LDOMs)
     The cluster is preloaded with is preloaded with 6M entries --> 6M * 8KB == ~48 GB of primary data + ~48GB backup data
         * so each heap has about ~48 GB of cache data, so is ~25% occupied
     Run duration: 180 seconds "warmup", 300 seconds "steady state", 60 seconds "cooldown"

The preload phase (6M entries for about 50 GB data per JVM) of the workload takes just under 4 minutes, followed by 9 minutes of load being applied (as noted above).

JVisualVM on my laptop is running JDK 8 build 64, so should be as compatible as possible.

Here is the contents of the zip file that is attached to this bug. The *.gchisto files are output of Monica's GC post-processing script. run56E.gcpause.eden.parsed is my post processing of the "Eden" lines in the GC log, suitable for putting into a spreadsheet to plot heap growth per GC. *.out is the showversion output. The *.log files are the Coherence log output --- useful for correlating times with PrintGCDateStamps output.

Archive: cb.run55W.logs.zip
  Length Date Time Name
--------- ---------- ----- ----
   116030 11-26-2012 15:42 nit1-ld2.1.gc
      968 11-26-2012 15:43 nit1-ld2.1.gchisto
      973 11-26-2012 15:43 run56E.gcpause.eden.parsed
      650 11-26-2012 16:00 nit1-ld3.screen.shot
   119401 11-26-2012 15:42 nit1-ld3.1.gc
      994 11-26-2012 15:43 nit1-ld3.1.gchisto
      228 11-26-2012 15:18 nit1-ld2.1.out
      228 11-26-2012 15:18 nit1-ld3.1.out
    48919 11-26-2012 16:03 nit1-ld2.1.log
    50018 11-26-2012 16:03 nit1-ld3.1.log
--------- -------
   338409 10 files

Comments

I'm able to verify the problem with the GC output from GCOld being sent to a file: airnapple{jcuthber}:263> ls -l total 69 -rw-rw-r-- 1 jcuthber staff 6890 Apr 23 2010 GCOld.class -rw------- 1 jcuthber staff 11722 Apr 23 2010 GCOld.java -rw------- 1 jcuthber staff 11727 Apr 23 2010 GCOld.java.2 -rw-rw-r-- 1 jcuthber staff 249 Apr 23 2010 TreeNode.class -rw-rw-r-- 1 jcuthber staff 0 Nov 30 11:35 hotspot.log -rw-rw-r-- 1 jcuthber staff 0 Nov 30 11:35 johnc_gc_01.log -rwxrwxr-x 1 jcuthber staff 504 Nov 30 11:35 run.sh* cairnapple{jcuthber}:264> ls -l total 69 -rw-rw-r-- 1 jcuthber staff 6890 Apr 23 2010 GCOld.class -rw------- 1 jcuthber staff 11722 Apr 23 2010 GCOld.java -rw------- 1 jcuthber staff 11727 Apr 23 2010 GCOld.java.2 -rw-rw-r-- 1 jcuthber staff 249 Apr 23 2010 TreeNode.class -rw-rw-r-- 1 jcuthber staff 0 Nov 30 11:35 hotspot.log -rw-rw-r-- 1 jcuthber staff 0 Nov 30 11:35 johnc_gc_01.log -rwxrwxr-x 1 jcuthber staff 504 Nov 30 11:35 run.sh* cairnapple{jcuthber}:265> date Fri Nov 30 11:36:20 PST 2012 cairnapple{jcuthber}:266> date Fri Nov 30 11:36:22 PST 2012 cairnapple{jcuthber}:267> ls -l total 98 -rw-rw-r-- 1 jcuthber staff 6890 Apr 23 2010 GCOld.class -rw------- 1 jcuthber staff 11722 Apr 23 2010 GCOld.java -rw------- 1 jcuthber staff 11727 Apr 23 2010 GCOld.java.2 -rw-rw-r-- 1 jcuthber staff 249 Apr 23 2010 TreeNode.class -rw-rw-r-- 1 jcuthber staff 0 Nov 30 11:35 hotspot.log -rw-rw-r-- 1 jcuthber staff 17394 Nov 30 11:37 johnc_gc_01.log -rwxrwxr-x 1 jcuthber staff 504 Nov 30 11:35 run.sh* Basically the log file created at 11:35 but had not been updated until sometime before 11:37. And according to the output that was flushed, the first GC occurred at around 2s into the run: 2.084: [GC pause (G1 Evacuation Pause) (young), 1.4360847 secs] [Parallel Time: 1434.3 ms, GC Workers: 4] [GC Worker Start (ms): Min: 2084.9, Avg: 2084.9, Max: 2084.9, Diff: 0.0] [Ext Root Scanning (ms): Min: 0.5, Avg: 1.0, Max: 1.4, Diff: 0.8, Sum: 3.8] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0] [Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 0.3] [Object Copy (ms): Min: 1432.6, Avg: 1433.0, Max: 1433.6, Diff: 1.0, Sum: 5732.1] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [GC Worker Other (ms): Min: 0.1, Avg: 0.1, Max: 0.1, Diff: 0.0, Sum: 0.3] [GC Worker Total (ms): Min: 1434.1, Avg: 1434.1, Max: 1434.2, Diff: 0.1, Sum: 5736.5] [GC Worker End (ms): Min: 3519.0, Avg: 3519.0, Max: 3519.1, Diff: 0.1] [Code Root Fixup: 0.0 ms] [Clear CT: 0.2 ms] [Other: 1.6 ms] [Choose CSet: 0.0 ms] [Ref Proc: 0.4 ms] [Ref Enq: 0.0 ms] [Free CSet: 0.4 ms] [Eden: 102.0M(102.0M)->0.0B(89.0M) Survivors: 0.0B->13.0M Heap: 102.0M(512.0M)->102.0M(512.0M)] [Times: user=5.47 sys=0.21, real=1.44 secs] 4.766: [GC pause (G1 Evacuation Pause) (young), 1.4125973 secs] The proposed fix seems to work.

30-11-2012

The flow of control is: { ... TraceCPUTime tcpu(...); ... log_gc_header(); ... { // Internals of GC log_gc_footer(..) } } I believe there are a few problems here... 1. the call to log_gc_footer should be outside the inner scope. 2. log_gc_footer should flush the log file. 3. the destructor for TraceCPUTime (if newlines are enabled) should flush the stream (possibly only of newlines are to be printed). Why haven't we seen this? Normally when we run a test we are sending the GC output to stdout (or stderr) and there are enough other things writing to those streams to flush the GC output. When we are writing to a user specified log file *that is separate from the other streams* we are responsible for flushing and we don't seem to be doing that.

30-11-2012

Possible theory: Previously we used the TraceTime scoped object to print out the GC times. When the TraceTime object went out of scope and its destructor called, the log file would be flushed: TraceTime::~TraceTime() { if (_active) { _t.stop(); if (_accum!=NULL) _accum->add(_t); if (_verbose) { if (_print_cr) { _logfile->print_cr(", %3.7f secs]", _t.seconds()); } else { _logfile->print(", %3.7f secs]", _t.seconds()); } _logfile->flush(); } } } This was recently changed and we now explicitly write the times to the log file in G1CollectedHeap::log_gc_footer(): void G1CollectedHeap::log_gc_footer(double pause_time_sec) { if (!G1Log::fine()) { return; } if (G1Log::finer()) { if (evacuation_failed()) { gclog_or_tty->print(" (to-space exhausted)"); } gclog_or_tty->print_cr(", %3.7f secs]", pause_time_sec); g1_policy()->phase_times()->note_gc_end(); g1_policy()->phase_times()->print(pause_time_sec); g1_policy()->print_detailed_heap_transition(); } else { if (evacuation_failed()) { gclog_or_tty->print("--"); } g1_policy()->print_heap_transition(); gclog_or_tty->print_cr(", %3.7f secs]", pause_time_sec); } } Interestingly if G1LogLevel=finer is set then we also instantiate the CPU timer but this does not seem to flush the logs: TraceCPUTime tcpu(G1Log::finer(), true, gclog_or_tty); TraceCPUTime::~TraceCPUTime() { if (_active) { bool valid = false; if (!_error) { double real_secs; // walk clock time double system_secs; // system time double user_secs; // user time for all threads double real_time, user_time, system_time; valid = os::getTimesSecs(&real_time, &user_time, &system_time); if (valid) { user_secs = user_time - _starting_user_time; system_secs = system_time - _starting_system_time; real_secs = real_time - _starting_real_time; _logfile->print(" [Times: user=%3.2f sys=%3.2f, real=%3.2f secs] ", user_secs, system_secs, real_secs); } else { _logfile->print("[Invalid result in TraceCPUTime]"); } } else { _logfile->print("[Error in TraceCPUTime]"); } if (_print_cr) { _logfile->print_cr(""); } } } So it doesn't look like the file would be flushed with this flag either.

30-11-2012