JDK-8059461 : Refactor IndexSet for better performance
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 8u20,9,10
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2014-09-30
  • Updated: 2019-01-28
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Related Reports
Blocks :  
If you run jdk9-dev on Nashorn/Octane suite with -XX:+CITime enabled:

~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/bin/java -XX:+CITime -jar ~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/jre/lib/ext/nashorn.jar -Dnashorn.typeInfo.disabled=false --class-cache-size=0 --persistent-code-cache=false -scripting --log=time test/script/basic/run-octane.js -- --iterations 1 

Then you will see this profile:

The remarkable time is spent in regalloc, "Coalesce 2". If you dive there in profiler, you will realize we spend most of the time in update_ifg, modifying the IFG neighbors for a new (merged) live range. I see little-to-none easy things to do algorithmic-wise here:

void PhaseConservativeCoalesce::update_ifg(uint lr1, uint lr2, IndexSet *n_lr1, IndexSet *n_lr2) {
  // Some neighbors of intermediate copies now interfere with the
  // combined live range.
  IndexSetIterator three(&_ulr);
  while ((neighbor = three.next()) != 0)
    if( _phc._ifg->neighbors(neighbor)->insert(lr1) )
      lrgs(neighbor).inc_degree( lrg1.compute_degree(lrgs(neighbor)) );

But, the native profiler shows we spend a lot of time walking the actual IndexSet containing the IFG. 
Because of that, I tried to see how IndexSet reacts to simple change of bit distribution:

  // The length of the preallocated top level block array
  enum { preallocated_block_list_size = 16 };

  // The lengths of the index bitfields
  enum { bit_index_length = 5,
         word_index_length = 3,
         block_index_length = 8 // not used

Current IndexSet allocates 32 bits per word, 8 words per block, and pre-allocates 16 blocks.

Tuning the word count up (coarsening the blocks) yields a considerable improvement in compile time:

* Baseline:  32 bits per word, 8 words per block, 16 blocks pre-allocated:

  Total compilation time   : 508.008 s
    C2 Compile Time:      393.537 s
       Regalloc:            162.268 s
         Coalesce 2:           53.094 s
  Average compilation speed :    22518 bytes/s

* 32 bits per word, 16 words per block, 8 blocks pre-allocated:

 Total compilation time   : 488.732 s
    C2 Compile Time:      377.150 s
       Regalloc:            149.755 s
         Coalesce 2:           50.160 s
  Average compilation speed :    22568 bytes/s

* 32 bits per word, 32 words per block, 4 blocks pre-allocated:

  Total compilation time   : 461.385 s
    C2 Compile Time:      349.802 s
       Regalloc:            126.326 s
         Coalesce 2:           28.198 s
  Average compilation speed :    23928 bytes/s

* 32 bits per word, 64 words per block, 2 blocks pre-allocated:

  Total compilation time   : 468.099 s
    C2 Compile Time:      351.622 s
       Regalloc:            131.058 s
         Coalesce 2:           29.227 s
  Average compilation speed :    23669 bytes/s

* 32 bits per word, 128 words per block, 1 block pre-allocated:

  Total compilation time   : 464.916 s
    C2 Compile Time:      349.978 s
       Regalloc:            129.082 s
         Coalesce 2:           24.206 s
  Average compilation speed :    23423 bytes/s

Therefore, the block chain in IndexSet is not bringing the performance benefits, and allocating a single large block is 
better for performance (better locality). With headlines like that, I think we would be better off gutting the IndexSet internals,
and replacing it with BitMap. BitMap is dynamically resizable, and so we arguably would have similar footprint. 
Moving to BitMap will additionally bring 64 bits per word on 64-bit platforms, since BitMap allocates words with uintptr_t,
not uint32_t as IndexSet does.
To add on data point I ran a 32-words per block + 4 blocks pre-allocated experiment through valgrind, and I can see that on the kind of C2 compilations we see in typical startup scenarios this actually increase # instructions retired in various methods of IndexSet and IndexSetIterator (more than doubled IndexSetIterator::advance_and_next in one profile).

I looked at this briefly last week, and it was not very conclusive. Previous data was gathered before we had live node renumbering (JDK-8129847), and now IndexSet density might be different. Still have to remeasure this with Nashorn at current jdk/jdk.

A review thread was started but never concluded: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-November/016149.html [~shade], to me it seems that tuning the current IndexSet to "32 bits per word, 32 words per block, 4 blocks pre-allocated" got a respectable speed-up close to the full rewrite in http://cr.openjdk.java.net/~shade/8059461/webrev.03/ but seems it would not be encumbered with the same concerns that patch got caught up in. Would it be appropriate to revive this with the simpler tune-up first and then (possibly) revisit the BitMap approach as a follow-up?

More cleanups: http://cr.openjdk.java.net/~shade/8059461/webrev.03/

Looks really nice! Thanks, Aleksey, for investigating this

Update to proof-of-concept: http://cr.openjdk.java.net/~shade/8059461/webrev.02/ Improves compilation speed even more drastically: (baseline was properly remeasured as well) http://cr.openjdk.java.net/~shade/8059461/perf.webrev.02/ $ grep "compilation speed" baseline-* baseline-1.log: Average compilation speed : 23460 bytes/s baseline-2.log: Average compilation speed : 23012 bytes/s baseline-3.log: Average compilation speed : 22670 bytes/s baseline-4.log: Average compilation speed : 22320 bytes/s baseline-5.log: Average compilation speed : 22090 bytes/s $ grep "compilation speed" patched-* patched-1.log: Average compilation speed : 24267 bytes/s patched-2.log: Average compilation speed : 25415 bytes/s patched-3.log: Average compilation speed : 24285 bytes/s patched-4.log: Average compilation speed : 24178 bytes/s patched-5.log: Average compilation speed : 24185 bytes/s $ grep "Regalloc:" baseline-* baseline-1.log: Regalloc: 152.231 s baseline-2.log: Regalloc: 161.494 s baseline-3.log: Regalloc: 148.808 s baseline-4.log: Regalloc: 157.748 s baseline-5.log: Regalloc: 161.993 s $ grep "Regalloc:" patched-* patched-1.log: Regalloc: 139.891 s patched-2.log: Regalloc: 124.626 s patched-3.log: Regalloc: 129.345 s patched-4.log: Regalloc: 133.072 s patched-5.log: Regalloc: 135.596 s

Proof-of-concept: http://cr.openjdk.java.net/~shade/8059461/webrev.01/ Improves compilation speed: http://cr.openjdk.java.net/~shade/8059461/baseline-nashorn.log Average compilation speed : 21963 bytes/s http://cr.openjdk.java.net/~shade/8059461/webrev-01-nashorn.log Average compilation speed : 22522 bytes/s (N.B. comparing the compiler times seems dubious for Nashorn, because the number of compiled methods is different. But, since we are compiling lots of methods, the average compilation speed seems to be a good metric).