JDK-8005792 : Long GC Pause Times When Opening Large Flight Recording in JMC
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 7u40
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2013-01-07
  • Updated: 2019-02-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
Opening a large (~150MB unzipped) flight recording in JMC causes inordinately long pause times when using the standard GC (Parallel Scavenge), with maximum pause times around 6-7 seconds during a 40 second timespan. 

In total this means that the time it takes to open that flight recording is around 40 seconds with parallel scavenge. Using CMS the equivalent time is around 14 seconds.
Comments
webrev of patch that saves the Klass for the object that crosses a region boundary instead of the partial size of the object extending into the region. Calculates the partial size from the Klass. This will allow (not in the patch) the removal of the regions that are processed after the compaction as delayed updates of regions with objects crossing the region boundary.
28-09-2016

I don't think this is a bug but rather poor performance on ParallelOldGC with some inputs. Further investigation will be needed and probably an enhancement based on the investigation.
20-07-2016

I tried loading the jfr file into JMC with SerialGC, ParallelGC/baseline and ParallelGC/live_cache The full GC's are below. Not much improvement. log.server.serial:[2.668s][info ][gc] GC(0) Pause Full (Metadata GC Threshold) 196M->16M(3891M) (2.596s, 2.668s) 71.560ms log.server.serial:[3.905s][info ][gc] GC(1) Pause Full (Metadata GC Threshold) 177M->28M(3891M) (3.830s, 3.905s) 75.217ms log.server.serial:[15.732s][info ][gc] GC(2) Pause Full (System.gc()) 204M->41M(3891M) (15.602s, 15.732s) 130.257ms log.server:[2.663s][info ][gc] GC(1) Pause Full (Metadata GC Threshold) 16M->16M(3755M) (2.599s, 2.663s) 63.922ms log.server:[3.943s][info ][gc] GC(3) Pause Full (Metadata GC Threshold) 29M->23M(3755M) (3.895s, 3.943s) 48.332ms log.server:[13.843s][info ][gc] GC(5) Pause Full (System.gc()) 37M->27M(3755M) (13.596s, 13.843s) 247.401ms log.c2_rt_live_cache:[2.716s][info ][gc] GC(1) Pause Full (Metadata GC Threshold) 16M->16M(3755M) (2.663s, 2.716s) 53.486ms log.c2_rt_live_cache:[4.088s][info ][gc] GC(3) Pause Full (Metadata GC Threshold) 29M->23M(3755M) (4.035s, 4.088s) 53.156ms log.c2_rt_live_cache:[15.022s][info ][gc] GC(5) Pause Full (System.gc()) 37M->27M(3755M) (14.800s, 15.022s) 221.677ms I used -Xmx4g -Xms4g -Xmn1g -XX:SurvivorRatio=3 -XX:InitialSurvivorRatio=3 -XX:-UseAdaptiveSizePolicy -Xlog:gc=debug
27-01-2016

I think the improvement in 8146987 helps this problem and, if it does, will be a good indication of what the problem is.
22-01-2016

Load heap with live objects (referenced from an array) and does System.gc()'s. Lets some objects die and does more System.gc()'s and then lets them all die and does more System.gc()'s. Repeat. The objective is to do full GC's when the density of live objects is high. Compare ParallelGC and SerialGC pause times.
22-01-2016

Possibilities regarding why the ParallelGC is slower. The marking between ParallelGC and MSC is very similar except that the ParallelGC adds the size of found objects to the sum of live data for its region. The add is done with a CAS so is expected to be slower but I found that much slower to be surprising. That work could be done in a separate, parallel phase but would be another touch of all live objects. MSC does that work serially during phase 2. The compact-and-update ParallelGC phase has the serial delayed update that can be done in parallel. The delayed update might be avoided if a pointer to the Klass of the partial-object is saved instead of the size of the partial-object extending into the region. That latter size would have to be calculated and doing that maybe faster or slower. The compact-and-update ParallelGC phase also calculates the new location of an object every time that object is encountered during the compact-and-update phase. The MSC has the new location of the object in the object's forwarding pointer. Repeatedly calculating the location of an object may be the source of ParallelGC slowness.
12-11-2013

I wrote a small benchmark that loads mostly long lived objects and ran SerialGC and ParallelGC with 1 thread and printed out the phase times for each. The benchmark does System.gc() after loading the objects. These are the numbers for the first 2 System.gc()'s. SerialGC 2.487: [Full GC (System.gc()) 2.487: [phase 1, 0.5491560 secs] 3.036: [phase 2, 0.2051150 secs] 3.242: [phase 3, 0.4565430 secs] 3.698: [phase 4, 0.0645980 secs] 3.767: [Full GC (System.gc()) 3.767: [phase 1, 0.5495550 secs] 4.317: [phase 2, 0.2051800 secs] 4.522: [phase 3, 0.4535720 secs] 4.975: [phase 4, 0.0000120 secs] ParallelGC with 1 thread 2.449: [pre compact, 0.0000220 secs] 2.449: [Full GC (System.gc()) 2.449: [marking phase, 2.0911130 secs] 2.449: [par mark, 2.0891420 secs] 4.538: [reference processing, 0.0000390 secs] 4.538: [class unloading, 0.0018430 secs] 4.540: [summary phase, 0.0000310 secs] 4.540: [adjust roots, 0.0010390 secs] 4.541: [compaction phase, 4.4939690 secs] 4.541: [drain task setup, 0.0000190 secs] 4.541: [dense prefix task setup, 0.0000060 secs] 4.542: [steal task setup, 0.0000060 secs] 4.542: [par compact, 2.7032380 secs] 7.245: [deferred updates, 1.7900900 secs] 9.035: [post compact, 0.0096480 secs] 9.048: [pre compact, 0.0000190 secs] 9.048: [Full GC (System.gc()) 9.048: [marking phase, 2.0778220 secs] 9.048: [par mark, 2.0759630 secs] 11.124: [reference processing, 0.0000370 secs] 11.124: [class unloading, 0.0017540 secs] 11.126: [summary phase, 0.0000220 secs] 11.126: [adjust roots, 0.0003250 secs] 11.126: [compaction phase, 0.6602890 secs] 11.127: [drain task setup, 0.0000120 secs] 11.127: [dense prefix task setup, 0.0000070 secs] 11.127: [steal task setup, 0.0000060 secs] 11.127: [par compact, 0.6597360 secs] 11.786: [deferred updates, 0.0000150 secs] 11.786: [post compact, 0.0086050 secs] Of note is that the SerialGC marking phase (phase1) takes 0.5491560 seconds and the ParallelGC marking phase takes 2.0911130 seconds. The SerialGC adjust points and compaction takes (0.4565430 + 0.0645980) and the UseParallelGC compaction phase (does pointer update and compaction) takes 4.4939690.
11-11-2013

John Coomes refreshed the BlockData fix for work on a regression reported by PSR. Since we figured that this Mission Control bug may have the same root case I tried his patch on MC and the attached JFR file. I ran the tests on my Mac and used this command line for OSX: java -XX:+UseParallelGC -XX:+PrintGCDetails -XX:+PrintGCCause -Xmx2200m -Xms2200m -Xmn500m -Xdock:icon=../Resources/Eclipse.icns -XstartOnFirstThread -Dorg.eclipse.swt.internal.carbon.smallFonts -jar mc.jar -ws cocoa With these settings there is just one full GC while opening the JFR recording. Here are three runs for each build: ParallelGC with BlockData fix: [Full GC (Ergonomics) [PSYoungGen: 63968K->0K(448000K)] [ParOldGen: 1211277K->1062213K(1740800K)] 1275245K->1062213K(2188800K) [PSPermGen: 38971K->38960K(78336K)], 7.8566330 secs] [Times: user=27.99 sys=0.10, real=7.85 secs] [Full GC (Ergonomics) [PSYoungGen: 63968K->0K(448000K)] [ParOldGen: 1216520K->1061398K(1740800K)] 1280488K->1061398K(2188800K) [PSPermGen: 38978K->38968K(78336K)], 8.4267350 secs] [Times: user=30.25 sys=0.10, real=8.43 secs] [Full GC (Ergonomics) [PSYoungGen: 63968K->0K(448000K)] [ParOldGen: 1212596K->1062219K(1740800K)] 1276564K->1062219K(2188800K) [PSPermGen: 38975K->38964K(78336K)], 8.3140840 secs] [Times: user=29.27 sys=0.10, real=8.31 secs] ParallelGC without BlockData fix: [Full GC (Ergonomics) [PSYoungGen: 63968K->0K(448000K)] [ParOldGen: 1215426K->1061290K(1740800K)] 1279394K->1061290K(2188800K) [PSPermGen: 38969K->38958K(78208K)], 8.4507600 secs] [Times: user=28.16 sys=0.10, real=8.45 secs] [Full GC (Ergonomics) [PSYoungGen: 63968K->0K(448000K)] [ParOldGen: 1212605K->1062239K(1740800K)] 1276573K->1062239K(2188800K) [PSPermGen: 38976K->38965K(78208K)], 8.1841620 secs] [Times: user=28.48 sys=0.10, real=8.19 secs] [Full GC (Ergonomics) [PSYoungGen: 63968K->0K(448000K)] [ParOldGen: 1216677K->1061368K(1740800K)] 1280645K->1061368K(2188800K) [PSPermGen: 38977K->38966K(78208K)], 9.0421660 secs] [Times: user=31.20 sys=0.11, real=9.04 secs] As you can see the pause times are comparable, so the BlockData fix does not seem to help all the way here. For comparison I also ran with SerialGC: [Full GC (Permanent Generation Full)[Tenured: 1259714K->469941K(1740800K), 2.0594960 secs] 1612953K->469941K(2201600K), [Perm : 43007K->42994K(43008K)], 2.0595560 secs] [Times: user=2.05 sys=0.00, real=2.06 secs] [Full GC (Permanent Generation Full)[Tenured: 1270368K->469942K(1740800K), 2.0543930 secs] 1614439K->469942K(2201600K), [Perm : 43008K->42995K(43008K)], 2.0544600 secs] [Times: user=2.10 sys=0.01, real=2.05 secs] [Full GC (Permanent Generation Full)[Tenured: 1279288K->470013K(1740800K), 2.0594670 secs] 1630427K->470013K(2201600K), [Perm : 43007K->42994K(43008K)], 2.0595400 secs] [Times: user=2.06 sys=0.00, real=2.06 secs] SerialOld collections seem to be four times faster than ParallelOld. I also ran with CMS and G1. For both of these all the pauses that the concurrent cycles introduce are shorter than the young GC times. So, booth of these seem to be good choices for MC. I have not done a comparison of the young GC times for CMS and G1. I guess that is what will make the difference between them.
23-04-2013

I guess not important now, but the problem with running JMC on windows was that you're using the solaris version jmc: From the log: !MESSAGE Bundle update@plugins/org.eclipse.swt.gtk.solaris.x86_3.8.1.v3834e.jar was not resolved.
21-03-2013

Fixing this will require and enhancement as described in the comments. This is not a regression. Will request a deferral for the next 7update.
20-03-2013

There used to be a solution to this problem that involved the implementation of a side data structure (BlockData that was part of RegionData) that divided a region into blocks and saved information for each block that sped up the calculation of live data to the left of an object (sliding compaction being from right to left). The BlockData took up more space but a larger region size could be used to it was a win (saving on half as many regions out weighed the BlockData). This solution worked but a summary phase had to be added for the BlockData and that summary phase took a non-negligible amount of time (even if done in parallel) and was serial time so was judged not to be worth doing. At the time there was no benchmark / application that indicated that the chosen size for the region was too large so this looked like a solution for a non problem and cost performance to boot. The BlockData implementation was removed. I tried to patch it back into a current hsx25 but it was apparently broken at the time it was removed (large block of code was was under #if 0). If this is implemented again, the calculation of the BlockData (formerly done in a summarize phase) should be done lazily during the MoveAndUpdate phase. Fixing this will require a few weeks. I would estimate 4 weeks to code, debug, and performance test. Less if the previous implementation is not too badly broken.
20-03-2013

I thought that the problem was lots of live data in the regions being compacted and implemented a change to artificially boost the size of the dense prefix to see if that helped. If this is a transitory problem at startup (loading lots of long lived data that gets promoted and densely populates the old gen) maybe it would help to fully compact the old gen gradually. This did help with the length of the full GC pauses but more full GC's were needed to fully collect the old gen (not unexpected).
19-03-2013

The code that calculates the new location and slides the object was fixed with regard to the scaling with the region size and calls to live_words_in_range() but the adjust_pointers() is still susceptible to the scaling problem. Cut the region size in half (512words to 256words) and time in live_words_in_range() was approximately cut in half.
18-03-2013

The UseParallelOldGC collector divides the heap into regions and does collection by region. For an object AA in a region the destination of AA after the collection depends on the size of live objects ahead of AA in its region. The live_words_in_range() is used to find the size of the live objects ahead of AA. There originally was an assumption (for good performance) that there were not many live objects in a region. Investigating whether that assumption is still true.
15-03-2013

On windows $ ../jdk1.7.0_14/bin/java -jar mc.jar An error has occurred. See the log file C:\cygwin\home\jmasamit\jfr\configuration\1363360661906.log. Attaching that log file.
15-03-2013

log file from attempt on windows XP
15-03-2013

I'll report more precisely what happened on my windows laptop when I try it again (don't have it handy right now).
14-03-2013

Can someone suggest a machine I can use to debug this? I've tried several without any success. I'm using Java(TM) SE Runtime Environment (build 1.7.0_14-ea-b16) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b35, mixed mode) I download the jmc-product-<platform>.zip to a directory and unzip it. I download the jfr-profile.jfr to that directory. I run /export/jmasa/jdk1.7.0_14/bin/java -showversion -XX:+UseParallelGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xmx3g -Xms3g -Xmn1g -jar mc.jar On intelsdv03 (linux x64) I see Xlib: extension "RANDR" missing on display "localhost:10.0". and only the splash screen appears and nothing else happens. On intelsdv01 (solaris x86) I see Begin task: Starting reading events Reading chunk Parsing events 18.964: [GC [PSYoungGen: 786432K->45054K(917504K)] 786432K->196611K(3014656K), 0.3374758 secs] [Times: user=2.32 sys=0.32, real=0.34 secs] 20.476: [GC [PSYoungGen: 831486K->131064K(917504K)] 983043K->917001K(3014656K), 3.5172083 secs] [Times: user=21.38 sys=4.15, real=3.52 secs] Reading chunk Parsing events Reading chunk Parsing events Task finished! and a message in the JMC window that says "Error initializing component" and a stack trace that start with org.eclipse.swt.SWTException: Unable to load graphics library [Cairo is required] (java.lang.NoClassDefFoundError: Could not initialize class org.eclipse.swt.internal.cairo.Cairo) at org.eclipse.swt.SWT.error(SWT.java:4361) On solaris-sparcv9 (jano-app) I see !SESSION 2013-03-14 08:20:57.454 ----------------------------------------------- eclipse.buildId=unknown java.version=1.7.0_14-ea java.vendor=Oracle Corporation BootLoader constants: OS=solaris, ARCH=sparcv9, WS=gtk, NL=en Command-line arguments: -ws gtk !ENTRY org.eclipse.osgi 4 0 2013-03-14 08:21:15.813 !MESSAGE An error occurred while automatically activating bundle org.eclipse.ui.workbench (196). !STACK 0 org.osgi.framework.BundleException: The activator org.eclipse.ui.internal.WorkbenchPlugin for bundle org.eclipse.ui.workbench is invalid ... I tried on windows but couldn't get it to work either.
14-03-2013

This version worked for me: Java(TM) SE Runtime Environment (build 1.7.0_14-ea-b15) Java HotSpot(TM) Server VM (build 24.0-b34, mixed mode) Make sure to open the file by: File -> Open File ... You can't open the file by doing: java <VM flags> -jar mc.jar jfr-profile.jfr
14-03-2013

What version of 7 should I use?
14-03-2013

The logs can easily be obtained by starting JMC: java <VM flags> -jar mc.jar and loading the attached JFR file.
07-03-2013

Were GC logs with -XX:+PrintGCDetails collected for these runs (UseParallelGC and CMS)?
06-03-2013

Just a note that we are now running with CMS per default if JMC is launched using the launcher in JDK/bin.
13-02-2013

On my machine I get these Full GC times: -XX:+UseSerialGC -Xmx4g -Xms4g -Xmn1g: ~3 seconds -XX:+UseParallelGC -Xmx4g -Xms4g -Xmn1g: ~15 seconds On another machine the relationship was 3 vs 28 seconds for the Full GC. I did run a "perf record" run, and got that we spent 50 % of the time in ParMarkBitMap::live_words_in_range.
08-01-2013

Attached the recording referred to in the bug description.
07-01-2013