While looking over some collect/analyze profiles measuring data cache misses, branches, and branch mispredicts some "high" metric items were identified in the following routines: HeapRegion::oops_on_card_seq_iterate_careful() * High DC misses when attempting to read the klass of the current object in both loops. * High number of branch mispredicts in the body of the second loop. instanceKlass::oop_oop_interate_[*]_nv() * High number of DC misses while iterating over and de-referencing the reference fields in an object. G1BlockOffsetArray::forward_to_block_containing_addr_slow() * High number of DC misses while dereferencing objects during BOT walking. FilterOutOfRegionClosure::do_oop_nv() * High number of branches and branch mispredicts. G1ParCopyHelper::copy_to_survivor_space() * High number of mispredicts when calculating the object size (coming from size_given_klass). Proposed changes: HeapRegion::oops_on_card_seq_iterate_careful() * High DC misses when attempting to read the klass of the current object in both loops. -> Add a prefetch to prefetch the next object after we obtain the size of the current object. Adding such a prefetch to second loop looks like the better candidate. I don't think that there is enough of a code window between the prefetch in iteration n and use in iteration n+1. * High number of branch mispredicts in the body of the second loop. -> The body of the second loop is made up of a 3-way if-statement. The body of two of the clauses is the same. If we make the conditional statement "less" branchy then we should be able to reduce this. instanceKlass::oop_oop_interate_[*]_nv() * High number of DC misses while iterating over and de-referencing the oop maps associated with reference fields in an object. -> Simple. Prefeth the next oop map entry. G1BlockOffsetArray::forward_to_block_containing_addr_slow() * High number of DC misses while dereferencing objects during BOT walking. -> Adding prefetching to these loops is little bit more tricky. We can't add a prefetch after we obtain the size of the current block - there is not enough of code window between the prefetch and the subsequent use. Instead if we use a fixed prefetch amount and issue the prefetch before reading the block size then we might get enough of a code window. FilterOutOfRegionClosure::do_oop_nv() * High number of branches and branch mispredicts. -> Most of these are coming from the concurrent refinement path way and are coming as a result of calling the virtual do_oop() routine in the closure(s) applied by the FilterOutOfRegionClosure. Using specialization so that the non-virtual _nv version of the do_oop() of these closures is called should help. G1ParCopyHelper::copy_to_survivor_space() * High number of mispredicts when calculating the object size (coming from size_given_klass). -> It was thought that refactoring and flattening the if-statement in the routine might have given some positive results. After performing such a refactoring and generating the assembly - I don't see any different in the branches in the generated code.
|