JDK-6413516 : 4 unusual very long ParNew GC
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 1.4.2
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: solaris_9
  • CPU: sparc
  • Submitted: 2006-04-14
  • Updated: 2010-04-02
  • Resolved: 2006-10-10
Related Reports
Duplicate :  
Relates :  
Description
FULL PRODUCT VERSION :
"1.4.2_09" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_09-b05) Java HotSpot(TM) 64-Bit Server VM (build 1.4.2_09-b05, mixed mode)

FULL OS VERSION :
SunOS mnbas-prop-01 5.9 Generic_117171-05 sun4u sparc SUNW,Netra-T12

EXTRA RELEVANT SYSTEM CONFIGURATION :
64GB RAM

A DESCRIPTION OF THE PROBLEM :
On 24/03/2006, with a customer, we made a stress test on our server application.
About 100 clients were connected during this test.
We experienced 4 unusual very long ParNew GC.
5360.712/14:35:50 : duration 427s (7min7s)
6229.767/14:50:20 : duration 426s (7min8s)
6658.591/14:57:29 : duration 453s (7min33s)
7112.431/15:05:03 : duration 479s (7min59s)
The 3 last GCs occured consecutively with almost no duration for application time.
The server ran for only 1h30 when these long GCs appeared.
We had to restart our application, because it appeared "frozen" for 22min40s.
According to GC logs, only 13GB (out of 32GB) were used.

Here are the JVM options :
java -d64 -XX:+DisableExplicitGC -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:-CMSParallelRemarkEnabled -XX:PermSize=64m -XX:CMSInitiatingOccupancyFraction=30 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:xxx

--- GC logs extracts
5360.712: [GC  {Heap before GC invocations=3514:
Heap
par new generation   total 53184K, used 53004K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K,  99% used [0xfffffff6b2400000,
0xfffffff6b57c3338,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
to   space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
concurrent mark-sweep generation total 33738880K, used 13533825K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23016K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
5360.713: [ParNew: 53004K->0K(53184K), 427.4508010 secs]
13586830K->13537652K(33792064K) Heap after GC invocations=3515:
Heap
par new generation   total 53184K, used 0K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K,   0% used [0xfffffff6b2400000,
0xfffffff6b2400000,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
to   space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
concurrent mark-sweep generation total 33738880K, used 13537652K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23016K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
} , 427.4515902 secs]
  Total time for which application threads were stopped: 427.4592638
seconds
---
6229.767: [GC  {Heap before GC invocations=3516:
Heap
par new generation   total 53184K, used 53120K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K, 100% used [0xfffffff6b2400000,
0xfffffff6b57e0000,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
to   space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
concurrent mark-sweep generation total 33738880K, used 13547239K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23021K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
6229.767: [ParNew: 53120K->0K(53184K), 426.6925385 secs]
13600359K->13556730K(33792064K) Heap after GC invocations=3517:
Heap
par new generation   total 53184K, used 0K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K,   0% used [0xfffffff6b2400000,
0xfffffff6b2400000,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
to   space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
concurrent mark-sweep generation total 33738880K, used 13556730K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23021K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
} , 426.6934510 secs]
  Total time for which application threads were stopped: 426.7247834
seconds
---
6658.591: [GC  {Heap before GC invocations=3517:
Heap
par new generation   total 53184K, used 53120K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K, 100% used [0xfffffff6b2400000,
0xfffffff6b57e0000,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
to   space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
concurrent mark-sweep generation total 33738880K, used 13556730K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23025K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
6658.592: [ParNew: 53120K->0K(53184K), 453.2627707 secs]
13609850K->13564527K(33792064K) Heap after GC invocations=3518:
Heap
par new generation   total 53184K, used 0K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K,   0% used [0xfffffff6b2400000,
0xfffffff6b2400000,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
to   space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
concurrent mark-sweep generation total 33738880K, used 13564527K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23025K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
} , 453.2636434 secs]
  Total time for which application threads were stopped: 453.4103892
seconds
---
7112.431: [GC  {Heap before GC invocations=3518:
Heap
par new generation   total 53184K, used 53120K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K, 100% used [0xfffffff6b2400000,
0xfffffff6b57e0000,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
to   space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
concurrent mark-sweep generation total 33738880K, used 13564527K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23025K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
7112.432: [ParNew: 53120K->0K(53184K), 479.2269278 secs]
13617647K->13573097K(33792064K) Heap after GC invocations=3519:
Heap
par new generation   total 53184K, used 0K [0xfffffff6b2400000,
0xfffffff6b5800000, 0xfffffff6b5800000)
eden space 53120K,   0% used [0xfffffff6b2400000,
0xfffffff6b2400000,
0xfffffff6b57e0000)
from space 64K,   0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
0xfffffff6b5800000)
to   space 64K,   0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
0xfffffff6b57f0000)
concurrent mark-sweep generation total 33738880K, used 13573097K
[0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
concurrent-mark-sweep perm gen total 65536K, used 23025K
[0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
} , 479.2278830 secs]
  Total time for which application threads were stopped: 479.3643864
seconds
The full GC logs are available at this address :
http://www.mycom-int.fr/mdds_gc_20060324130625.log.zip



Here are some mails exchanged with a SUN engineer :
------- Forwarded message -------
  From: "Jon Masamitsu" <###@###.###>
  To: "David Tavoularis" <###@###.###>
  Subject: Re: GC Feedback
Date: Thu, 30 Mar 2006 17:49:46 +0300

David,

Do you have a testcase demonstrates this problem that you can send to us?

If not, just a few questions.

Does this problem occur everytime you run your applications?

Is the problem occurring at about the same point in each run
(i.e., after about the same amount of run time or at a point where the heap is about the same in terms of maximum size and amount being used?

Does the application change behavior at a particular point (e.g., start allocating significantly more data or start allocating very large objects).
Your mention 4 long GC's.
I noted 5 and that once the long GC's started they continued (no short GC's after).
I'm looking for something that has changed that would lead to the long GC's.

How many cpus are on your test machine?

Is there anything else running on your test machine when you see this bad behavior?

If you have a second log that shows the problem, we might be able to get so information from it by comparing the two.

Jon

------- Forwarded message -------
  From: "Philippe Bonneau" <###@###.###>
  To: "David Tavoularis" <###@###.###>, ###@###.###
  Subject: Re: GC Feedback
Date: Thu, 30 Mar 2006 19:34:35 +0300

Hello Jon,

Just in order to explain why I will answer "no" to many of your questions, a few explanations on the context :
we have been struggling for several months implementing in our application a special mechanism that self-monitors the tenured occupancy of our server and automatically slows down its activity when the tenured generation fills up (in order to wait for the current CMS to complete), to be able to guarantee that we will NEVER undergo a stop-the-world collection.
In the meantime, our customer had to survive with several daily 20-minute stop-the-world collections.
Now that we have at last installed our long expected "miracle solution" (which mostly works), they have run a stress test session to validate it and they experienced the problem described below with the New generation!
Eventhough we tried to explain that the cause is very different, from the end user's point of view all they see is the system freezing again and some of them question that we fixed anything at all.
They encountered the problem only once, but needless to say that in this context, the customer is certainly not keen on helping us reproducing it. For information, they mobilized 100 people at the same time to conduct the test, so it is not something we can ask them to re-do on demand.

That said, please my answers below, many thanks in advance for your help.

>Do you have a testcase demonstrates this problem that you can send to us?
No, in fact we were hoping that you could give us some leads to write one that would reproduce the symptoms.

> Does this problem occur everytime you run your applications?
Happened only once, under high stress; but given the context we are very
reluctant to tell our customer not to worry and that the problem is very rare.

> Is the problem occurring at about the same point in each run (i.e., after about the same amount of run time or at a point where the heap is about the same in terms of maximum size and amount being used?
Happened only once, but the system has passed successfully this amount of RAM many times.

> Does the application change behavior at a particular point (e.g., start allocating significantly more data or start allocating very large objects).
This is quite possible and is our primary suspicion, but our logs do not report every single click of every single user, so we do not have any specific information.
The sequence of long GCs is easily explained, as the system in under high activity (as you can see above, New GCs are triggered several times per *second*).
Conversely, can you let us know if for example repeated allocations requests of huge objects could cause such symptoms ?

> How many cpus are on your test machine?
It has 16 CPUs. This is unfortunately a live machine ! The problem never occurred on a test server.

> Is there anything else running on your test machine when you see this bad behavior?
Yes, other processes belonging to our application. They are not very suspicious however, because as mentioned above, the problem occurred during a stress test of the concerned process, so we would really be out of luck if the problem was caused by a side effect of an other process (which usually run 24 hours a day without a glitch).

> If you have a second log that shows the problem, we might be able to get so information from it by comparing the two.
We don't

------- Forwarded message -------
  From: "Jon Masamitsu" <###@###.###>
  To: "Philippe Bonneau" <###@###.###>
  Subject: Re: GC Feedback
Date: Thu, 30 Mar 2006 20:30:39 +0300

Phillipe,

Usually when I see such a large change in the ParNew collection times
I think if swapping.  It's hard to positively identify swapping as
a problem unless you're looking for it.  Any chance the customer
was monitoring the swapping activity on his system when the long
GC's occurred?  If you want to see the effect of swapping you
can run a test system with a heap that is larger than physical memory
(with the size of the heap less than the amount of swap space,
of course).

During the development cycle for jdk 6.0 we identified a problem
with ParNew collections of very large objects that exhibited itself
as long pauses.  It is fixed in 6.0.  I believe the fix was also
backported to 5.0.  If you are running into this problem if you
use -XX:+ConcMarkSweepGC with -XX:-UseParNewGC it goes away.

You mention that the customer was seeing 20 minute stop-the-world
collections occasionally when using CMS.  Do you happen to know
how long the tenured generation collections were if the
parallel collector (-XX:+UseParallelGC) is used?  I'll just
wondering if the 20 minute collection time is way out of bounds
for a CMS stop-the-world collection.

Lastly, you guys sound like you definitely know how to tune CMS.
May I ask how you learned to use our collectors?

Jon


------- Forwarded message -------
  From: "Jon Masamitsu" <###@###.###>
  To: "Philippe Bonneau" <###@###.###>
  Subject: Re: GC Feedback
Date: Sat, 01 Apr 2006 04:51:35 +0300

David and Phillipe,

I haven't figured out anything that would explain this problem.
I've sent your mail around to some other members of the group and
we'll see what we can come up with.

The performance problem with large objects
can be seen with a program that allocates
a couple of 120m object that stay alive
through the life of the program and then
allocates smaller objects to force GC's.
Once the large objects get promoted to the
old gen, the par new collections slow down.
You see the affect without cms (just
add -XX:+UseParNewGC).  There will be a big
difference between the young generation
collection times when compared with the
default collector.

In the mean time how did you decide on the size for the
young generation?   It seems a little small (52m) compared to
the size of the old generation.  At the end of the gc log it looked like
the amount of live data in the old generation was about
10g.  Is that the size of the steady state working set.
By default cms uses small survivor spaces.   Something
like  -XX:SurvivorRatio=6 and -XX:MaxTenuringThreshold=15
might help slow down promotions into the old generation.

Jon

------- Forwarded message -------
  From: "Jon Masamitsu" <###@###.###>
  To: "Philippe Bonneau" <###@###.###>
  Subject: Re: GC Feedback
Date: Tue, 04 Apr 2006 01:57:14 +0300

Philippe and David,

We have not come up with any good explanation for the behavior you
are seeing.  I would suggest that you file a bug.

This may well come back to us.  If it does, we'll assign
it to an engineer that can put some dedicated time in on
it.  It's like nothing we've seen though so will likely
be hard to figure out.

Jon

THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

EXPECTED VERSUS ACTUAL BEHAVIOR :
No long GCs when lots of available memory
REPRODUCIBILITY :
This bug can be reproduced rarely.

Comments
EVALUATION 6433335 is now fixed in 5u10 expected to GA later this year, in 7.0 b02 and will be fixed in 6.0u1. If you want a test binary before the binarries are officially available please contact your Java support / service representative, or check http://developers.sun.com/services or contact us via the channel you previously used. Please refer to 6433335 for a possible workaround to this problem.
10-10-2006

EVALUATION Bug 6433335 is an example of situation where a large object is not required but for which long pauses during the young generation collection can occur. In this situation large free chunks are the cause of the long pauses. There is a fix in progress for 6433335. The current state of the fix elminates the long pauses but has an added cost to all young generation collections. If the submitter would like to try the fix to see if it eliminates the very long pauses, a VM with the fix (current state where there is an additional overhead on the the young generation collections) can be provided. Please contact us if interested.
12-07-2006

EVALUATION Attached is a test case that demonstrates the effects of bug 6298694 and 3 logs log.cms_parnew - 1.4.2 with cms and parnew young gen collector log.cms_serial - 1.4.2 with cms and serial young gen collector log.6 - 6.0 with cms and parnew young gen collector The testcase allocates small objects to fill the young gen and cause minor collections, allocates a large object (which lives to the end of execution) and then allocations small objects again. log.cms_parnew shows the dramatic slow down in minor collections after the allocation of the large object log.cms_serial shows the lack of a dramatic slow down log.6 shows run with the fix for 6298694 The command line for the runs was java -server -Xmx768m -Xmn128m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=16 -XX:+PrintGCDetails -showversion gctest2.Main
27-04-2006

EVALUATION Very likely a duplicate of: CR 6298694: "bad performance with big object in heap" We request that the customer run a JVM with that fix. CR 6298694 is fixed in Mustang b74+ and in 5.0u8b01+. (See related in "Comments" field.)
18-04-2006