Bug ID: JDK-7133038 G1: Some small profile based optimizations

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 7u4

Priority: P4
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2012-01-24
Updated: 2013-09-18
Resolved: 2012-03-24

JDK 7	JDK 8	Other
7u4Fixed	8Fixed	hs23Fixed

While looking over some collect/analyze profiles measuring data cache misses, branches, and branch mispredicts some "high" metric items were identified in the following routines:

HeapRegion::oops_on_card_seq_iterate_careful()
* High DC misses when attempting to read the klass of the current object in both loops.
* High number of branch mispredicts in the body of the second loop.

instanceKlass::oop_oop_interate_[*]_nv()
* High number of DC misses while iterating over and de-referencing the reference fields in an object.

G1BlockOffsetArray::forward_to_block_containing_addr_slow()
* High number of DC misses while dereferencing objects during BOT walking.

FilterOutOfRegionClosure::do_oop_nv()
* High number of branches and branch mispredicts.

G1ParCopyHelper::copy_to_survivor_space()
* High number of mispredicts when calculating the object size (coming from size_given_klass).

Proposed changes:

HeapRegion::oops_on_card_seq_iterate_careful()
* High DC misses when attempting to read the klass of the current object in both loops.
-> Add a prefetch to prefetch the next object after we obtain the size of the current
object. Adding such a prefetch to second loop looks like the better candidate. I don't
think that there is enough of a code window between the prefetch in iteration n and use
in iteration n+1.

* High number of branch mispredicts in the body of the second loop.
-> The body of the second loop is made up of a 3-way if-statement. The body of two of the
clauses is the same. If we make the conditional statement "less" branchy then we should
be able to reduce this.

instanceKlass::oop_oop_interate_[*]_nv()
* High number of DC misses while iterating over and de-referencing the oop maps associated with reference fields in an object.
-> Simple. Prefeth the next oop map entry.

G1BlockOffsetArray::forward_to_block_containing_addr_slow()
* High number of DC misses while dereferencing objects during BOT walking.
-> Adding prefetching to these loops is little bit more tricky. We can't add a prefetch after we obtain the size of the current block - there is not enough of code window between the prefetch and the subsequent use. Instead if we use a fixed prefetch amount and issue the prefetch before reading the block size then we might get enough of a code window.

FilterOutOfRegionClosure::do_oop_nv()
* High number of branches and branch mispredicts.
-> Most of these are coming from the concurrent refinement path way and are coming as a result of calling the virtual do_oop() routine in the closure(s) applied by the FilterOutOfRegionClosure. Using specialization so that the non-virtual _nv version of the do_oop() of these closures is called should help.

G1ParCopyHelper::copy_to_survivor_space()
* High number of mispredicts when calculating the object size (coming from size_given_klass).
-> It was thought that refactoring and flattening the if-statement in the routine might have given some positive results. After performing such a refactoring and generating the assembly - I don't see any different in the branches in the generated code.

EVALUATION http://hg.openjdk.java.net/lambda/lambda/hotspot/rev/b4ebad3520bb

22-03-2012

EVALUATION http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/b4ebad3520bb

27-01-2012

EVALUATION See description.

24-01-2012