JDK-6459113 : CMS+ParNew: wildly different ParNew pause times depending on heap shape caused by allocation spread
Type:Bug
Component:hotspot
Sub-Component:gc
Affected Version:5.0u7,6
Priority:P3
Status:Resolved
Resolution:Fixed
OS:generic,solaris_9
CPU:generic,x86
Submitted:2006-08-10
Updated:2010-12-08
Resolved:2006-11-14
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
EVALUATION
This bug fix should be backported along with that for 6433335.
13-09-2006
SUGGESTED FIX
The following fix has been putback to 7.0 and should be backported to
6u1, 5u10 and as appropriate 1.4.2_XX.
Event: putback-to
Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/main/gc_baseline
(jano.sfbay:/export/disk05/hotspot/ws/main/gc_baseline)
Child workspace: /net/prt-web.sfbay/prt-workspaces/20060912150719.ysr.bot/workspace
(prt-web:/net/prt-web.sfbay/prt-workspaces/20060912150719.ysr.bot/workspace)
User: ysr
Comment:
---------------------------------------------------------
Job ID: 20060912150719.ysr.bot
Original workspace: neeraja:/net/spot/scratch/ysr/bot
Submitter: ysr
Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2006/20060912150719.ysr.bot/
Webrev: http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2006/20060912150719.ysr.bot/workspace/webrevs/webrev-2006.09.12/index.html
Fixed 6433335: ParNewGC times spiking, eventually taking up 20 out of every 30 seconds
Fixed 6459113: CMS+ParNew: wildly different ParNew pause times depending on heap shape caused by allocation spread
Webrev: http://analemma.sfbay/net/spot/scratch/ysr/bot/webrev
(a) 6433335:
ParNew's chunking of the card-table makes use of block-start
calls for chunk surgery prior to assignment. These calls are
in addition to the usual ones where block-start calls are
used to start the scanning of the first card of a range of
dirty cards. The logarithmic BOT was expected to keep the cost
of such calls, well, logarithmic, in the size of the block
being described, i.e. O(log n), n the size of the block.
Unfortunately, as implemented, the cost was not
asymptotically logarithmic, but rather logarithmic up to
n = 16^3 = 4K cards = 2 MB and linear i.e. O(n) for
the balance.
The problem in 6433335 was that under certain kinds of
heap shapes and object sizes and lifetimes, the CMS heap
could end up with large free blocks. The chunking of the
card-table would then run into O(n) cost of block-start
calls on these large free chunks.
There are several orthogonal fixes to this problem, all of
which are additive. This putback accomplishes the most major
of those, which is to make the BOT asymptotically logarithmic.
A consequence of that change, however, is that the "block-split"
operation can now become potentially costlier. The previous
asymptotic cost of a split was bounded by a constant. For
moderate size blocks, the cost was O(p.log s), p+s=n, n the
size of the block being split, p the size of the prefix of
the split and s the size of the suffix. The actual cost was however
dictated by a tight loop in which a BOT entry is first read
and tested and then possibly modified, thus making for a
large constant in the big-O. We modified the BOT fixup
operation so that we do not need to do any BOT reads at
all, just writes. This won back any potential loss from
the new block-split cost which is asymptotically O(p.log s).
This solved the customer's performance problem with no regression
in performance elsewhere. (The main concern had been that the
increased asymptotic cost of splitting would increase the
variance in the ParNew times due to the large hit one might take
as a result of splitting a large block to replenish the
free lists. Too, the initial scavenges as well as perm gen
allocation would be more adversely affected, causing an
increase in start-up. Indeed fixing the logBOT problem did
bring up these issues, but fixing the BOT fixup won back the
loss in performance.)
As we mentioned above in passing, there are further possible
improvements, including reducing the block-start calls for
ParNew chunking. These are deferred for later (under a different
bug id we will be opening for the purpose).
(b) 6459113:
The problem here was that when we expand the old generation
we were not taking care to coalesce the newly obtained space
to the end of an existing previously co-terminal free block
in the heap. Depending on allocation patterns and lifetimes
this could potentially cause avoidable fragmentation,
defeat CMS's coalition heuristics that try to maintain a large
corpus of free storage in the upper part of the generation,
and because of the presence of large free blocks lower in
the address space, increase block-start costs for ParNew's
card-table chunking. This also fixes 6446077 reported by
Shutterfly and independently by a Java forum user as well.
Thanks to Mark Susko for interfacing w/customer (General Dynamics),
to Poonam Bajaj for help with 5uXX instrumentation for customer
and verifying fix using test case from Shuttrefly, and to
Stephen "cool-as-a-cucumber" Fitch and Mark Susko for keeping a
hot escalation cool, allowing us the luxury of exploring all
avenues of solution including several false starts at a constant
time block split operation using a non-canonical BOT representation
which was not to be.
Reviewed by: Jon Masamitsu, Tony Printezis; Poonam Bajaj (partial)
Fix Verified: yes
Verification Testing:
. 6433335: testing by the customer
. 6459113: . test case from customer (Cinnober) attached to bug report
. also independently tested using Shutterfly's test case (by Poonam)
Other testing: +/- CMS +/- VerifyBlockOffsetArray
. PRT
. refworkload
Performance testing:
. refworkload (JBB with CMS+ParNew gives a nice 20% ParNew
improvement and a 5% bottom-line improvement on Niagara)
. Alacrity (no change)
Files:
update: src/share/vm/memory/blockOffsetTable.cpp
update: src/share/vm/memory/blockOffsetTable.hpp
update: src/share/vm/memory/compactibleFreeListSpace.cpp
update: src/share/vm/memory/compactibleFreeListSpace.hpp
Examined files: 3876
Contents Summary:
4 update
3872 no action (unchanged)
13-09-2006
EVALUATION
See suggested fix, comments and workaround. We should fix this in 6u1
and backport as appropriate since it can cause performance surprises.
15-08-2006
SUGGESTED FIX
In addition to the fix for 6433335, when the heap expands, the new
block should be coalesced with the "big guy" that we try to maintain
at the end of the heap, rather than adding it independently to
the big chunk dictionary as currently done. This will prevent the
"allocation spread" with the large free block intervening in the
middle.
15-08-2006
WORK AROUND
-XX:-UseParNewGC reduces the slow down substantially (same as 6433335);
alternatively, set -Xmx and -Xms the same to avoid the "allocation
spread" that tickles this behaviour (at least with this program).