JDK-8146801 : Allocating short arrays of non-constant size is slow
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • CPU: x86
  • Submitted: 2016-01-11
  • Updated: 2021-02-02
  • Resolved: 2016-03-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b112Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Description
When allocating an array of statically-known size, our current hot-path zeroing strategy seems to split out the zeroing into the individual stores when the size is small. However, this does not happen at all for the arrays of non-constant size, which sets us up for the significant penalty when allocating small arrays.

Benchmark:
 http://cr.openjdk.java.net/~shade/8146801/EmptyArrayBench.java
 http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar

Performance data:
  http://cr.openjdk.java.net/~shade/8146801/notes.txt

The crux of the issue seems to be a large "rep stos" setup cost (see [1]). Note that Agner argues [2] that rep instructions are still future-proof, because they allow CPUs to select the appropriate implementation. It seems we only need to cater for the setup costs here. In C2, we avoid "rep stos" on small arrays when the size is known statically. In C1, we always do the looped mov, which is amusingly faster than C2-ish attempt at "rep stos"-ing on small arrays. It might be worthwhile to check the array size in zeroing path, and do the looped initialization for small sizes.

C2 non-constant size = 8: 12.610 �� 0.193  ns/op 
 http://cr.openjdk.java.net/~shade/8146801/c2-field-8.perfasm

C2 constant size = 8: 4.681 �� 0.135  ns/op
  http://cr.openjdk.java.net/~shade/8146801/c2-const-8.perfasm

C1 non-constant size = 8: 6.839 �� 0.103  ns/op
 http://cr.openjdk.java.net/~shade/8146801/c1-field-8.perfasm

C1 constant size = 8:  6.843 �� 0.079  ns/op
  http://cr.openjdk.java.net/~shade/8146801/c1-const-8.perfasm

[1] http://www.agner.org/optimize/optimizing_assembly.pdf , 17.9, "Moving blocks of data (All processors)"
[2] http://www.agner.org/optimize/optimizing_assembly.pdf , 17.9, "Moving data on future processors"
Comments
Candidate webrev, that improves field_* tests almost up to const_* tests performance: http://cr.openjdk.java.net/~shade/8146801/webrev.02/
29-02-2016

RFR: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2016-February/021720.html
29-02-2016