While looking over some collect/analyze profiles measuring data cache misses, branches, and branch mispredicts some "high" metric items were identified in the following routines:
HeapRegion::oops_on_card_seq_iterate_careful()
* High DC misses when attempting to read the klass of the current object in both loops.
* High number of branch mispredicts in the body of the second loop.
instanceKlass::oop_oop_interate_[*]_nv()
* High number of DC misses while iterating over and de-referencing the reference fields in an object.
G1BlockOffsetArray::forward_to_block_containing_addr_slow()
* High number of DC misses while dereferencing objects during BOT walking.
FilterOutOfRegionClosure::do_oop_nv()
* High number of branches and branch mispredicts.
G1ParCopyHelper::copy_to_survivor_space()
* High number of mispredicts when calculating the object size (coming from size_given_klass).
Proposed changes:
HeapRegion::oops_on_card_seq_iterate_careful()
* High DC misses when attempting to read the klass of the current object in both loops.
-> Add a prefetch to prefetch the next object after we obtain the size of the current
object. Adding such a prefetch to second loop looks like the better candidate. I don't
think that there is enough of a code window between the prefetch in iteration n and use
in iteration n+1.
* High number of branch mispredicts in the body of the second loop.
-> The body of the second loop is made up of a 3-way if-statement. The body of two of the
clauses is the same. If we make the conditional statement "less" branchy then we should
be able to reduce this.
instanceKlass::oop_oop_interate_[*]_nv()
* High number of DC misses while iterating over and de-referencing the oop maps associated with reference fields in an object.
-> Simple. Prefeth the next oop map entry.
G1BlockOffsetArray::forward_to_block_containing_addr_slow()
* High number of DC misses while dereferencing objects during BOT walking.
-> Adding prefetching to these loops is little bit more tricky. We can't add a prefetch after we obtain the size of the current block - there is not enough of code window between the prefetch and the subsequent use. Instead if we use a fixed prefetch amount and issue the prefetch before reading the block size then we might get enough of a code window.
FilterOutOfRegionClosure::do_oop_nv()
* High number of branches and branch mispredicts.
-> Most of these are coming from the concurrent refinement path way and are coming as a result of calling the virtual do_oop() routine in the closure(s) applied by the FilterOutOfRegionClosure. Using specialization so that the non-virtual _nv version of the do_oop() of these closures is called should help.
G1ParCopyHelper::copy_to_survivor_space()
* High number of mispredicts when calculating the object size (coming from size_given_klass).
-> It was thought that refactoring and flattening the if-statement in the routine might have given some positive results. After performing such a refactoring and generating the assembly - I don't see any different in the branches in the generated code.