JDK-8132375 : Investigate performance regressions on Sparc
Type:Sub-task
Component:hotspot
Sub-Component:compiler
Affected Version:9
Priority:P2
Status:Resolved
Resolution:Fixed
Submitted:2015-07-27
Updated:2015-09-04
Resolved:2015-09-04
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
We still have some significant performance regressions with Compact Strings on Sparc:
http://cr.openjdk.java.net/~huntch/string-density/reports/String-Density-SPARC-Microbenchmarks.pdf
We need to investigate possibilities to fix / mitigate these.
Comments
The most recent performance evaluation on Sparc looks good:
http://cr.openjdk.java.net/~thartmann/sd/sparc/benchmarks.pdf
Closing this as solved.
04-09-2015
I looked in detail at Charlie's Sparc performance evaluation [1] and created an evaluation similar to what Sandhya did for x86:
http://cr.openjdk.java.net/~thartmann/sd/sparc/benchmarks.pdf
http://cr.openjdk.java.net/~thartmann/sd/sparc/benchmarks.ods
I marked all quotients < 0.9 in red (regression) and all > 1.1 in green (improvement) and also added the x86 sheet as reference. Regressions that show up on both platforms are marked in orange.
Charlie spotted the following regressions:
1) String construction
On Sparc we don't have vector instructions as fast as those on x86 to hide the overhead of string compression/inflation. We therefore have to pay for the compression attempt while creating a string. This is especially costly if the string turns out to be not compressible since we then have to throw away the array, allocate a new one and start copying again.
Assuming that a non-latin1 character usually shows up right at the beginning of the string, we could slightly improve things by looking at the first character before allocating the array and bail out immediately if it is not compressible:
public static byte[] toBytes(char[] val, int off, int len) {
+ if (len > 0 && !canEncode(val[off])) {
+ return null;
+ }
byte[] ret = new byte[len];
This improves the case where the non-latin1 character is the first character of the string (cmp2_beg):
## Baseline ##
Benchmark (size) Mode Cnt Score Error Units
ConstructBench.cmp1 1 avgt 50 11.433 �� 0.260 ns/op
ConstructBench.cmp1 64 avgt 50 19.718 �� 0.893 ns/op
ConstructBench.cmp1 4096 avgt 50 993.784 �� 27.419 ns/op
ConstructBench.cmp2_beg 1 avgt 50 11.420 �� 0.147 ns/op
ConstructBench.cmp2_beg 64 avgt 50 20.364 �� 1.307 ns/op
ConstructBench.cmp2_beg 4096 avgt 50 977.194 �� 32.629 ns/op
ConstructBench.cmp2_end 1 avgt 50 11.443 �� 0.396 ns/op
ConstructBench.cmp2_end 64 avgt 50 20.219 �� 1.159 ns/op
ConstructBench.cmp2_end 4096 avgt 50 981.872 �� 28.179 ns/op
## String density ###
Benchmark (size) Mode Cnt Score Error Units
ConstructBench.cmp1 1 avgt 50 12.363 �� 0.308 ns/op
ConstructBench.cmp1 64 avgt 50 19.056 �� 0.461 ns/op
ConstructBench.cmp1 4096 avgt 50 494.942 �� 11.556 ns/op
ConstructBench.cmp2_beg 1 avgt 50 16.717 �� 0.472 ns/op
ConstructBench.cmp2_beg 64 avgt 50 30.229 �� 3.006 ns/op
ConstructBench.cmp2_beg 4096 avgt 50 1051.894 �� 32.097 ns/op
ConstructBench.cmp2_end 1 avgt 50 16.701 �� 0.312 ns/op
ConstructBench.cmp2_end 64 avgt 50 29.484 �� 1.131 ns/op
ConstructBench.cmp2_end 4096 avgt 50 1462.369 �� 22.692 ns/op
## String density (patched) ##
Benchmark (size) Mode Cnt Score Error Units
ConstructBench.cmp1 1 avgt 50 11.850 �� 0.232 ns/op
ConstructBench.cmp1 64 avgt 50 18.265 �� 0.743 ns/op
ConstructBench.cmp1 4096 avgt 50 513.200 �� 19.783 ns/op
ConstructBench.cmp2_beg 1 avgt 50 12.940 �� 0.258 ns/op
ConstructBench.cmp2_beg 64 avgt 50 21.744 �� 1.379 ns/op
ConstructBench.cmp2_beg 4096 avgt 50 973.557 �� 16.971 ns/op
ConstructBench.cmp2_end 1 avgt 50 12.941 �� 0.364 ns/op
ConstructBench.cmp2_end 64 avgt 50 29.077 �� 0.506 ns/op
ConstructBench.cmp2_end 4096 avgt 50 1462.440 �� 17.049 ns/op
Besides that, we have a quite significant regression for string construction on x86 as well.
2) String.toCharArray()
The reported 40% regression for a non-compressible string (cmp < 1) only shows up with a string size of 1 due to the overhead of coder selection. It's amortized for larger strings:
## Baseline ##
Benchmark (cmp) (seed) (size) Mode Cnt Score Error Units
ToCharArray.test 0 12345678900 4096 avgt 50 2767.987 �� 45.359 ns/op
ToCharArray.test 0.5 12345678900 4096 avgt 50 2762.768 �� 42.118 ns/op
ToCharArray.test 1 12345678900 4096 avgt 50 2761.386 �� 51.675 ns/op
## String density ##
Benchmark (cmp) (seed) (size) Mode Cnt Score Error Units
ToCharArray.test 0 12345678900 4096 avgt 50 2781.387 �� 48.589 ns/op
ToCharArray.test 0.5 12345678900 4096 avgt 50 2753.779 �� 39.920 ns/op
ToCharArray.test 1 12345678900 4096 avgt 50 3978.167 �� 102.646 ns/op
## String density (disabled) ##
Benchmark (cmp) (seed) (size) Mode Cnt Score Error Units
ToCharArray.test 0 12345678900 4096 avgt 50 2666.398 �� 8.994 ns/op
ToCharArray.test 0.5 12345678900 4096 avgt 50 2666.003 �� 8.041 ns/op
ToCharArray.test 1 12345678900 4096 avgt 50 2662.619 �� 3.251 ns/op
Of course, if the string is compressible (cmp = 1) there is a regression because we have to inflate it.
3) StringBuilder.append()
I pushed some minor optimizations, I don't think there is much more we can do in the following cases:
- concat.ConcatCharBench.test_char2_cmp1
- concat.ConcatCharBench.test_cmp1_char2
- concat.ConcatStringsBench.test_cmp1_cmp2
- concat.ConcatStringsBench.test_cmp2_cmp1
The problem here is that we need to inflate cmp1 which is costly on Sparc.
We also have a regression for long concatenation:
- concat.ConcatLongBench.test_cmp2_long
- concat.ConcatLongBench.test_long_cmp2
C2 only optimizes string-int concatenation. The regression also shows up on x86. It's weird though that there is no regression for the cmp1_long case (needs investigation).
4) String.indexOf(String str)
I was not able to reproduce any regressions with CompactStrings enabled. I executed 5 warmup iterations and 10 runs with default VM flags. Please note that we don't have an intrinsic for String.indexOf on Sparc.
5) String.compareTo()
The cross coder intrinsic performs worse than the baseline. We have the same problem on x86. Maybe we can slightly optimize it.
6) String.codePoint*
These are all executed with size 1 and only measure the coder selection overhead. We have the same regression on x86.
7) String.charAt(int index)
Slight regression due to coder selection. Same problem on x86.
8) Encoding benchmarks
Seems like we still have some issues with the encoding benchmarks. The regressions with size 1 are again explainable by coder selection overhead but there are more regressions especially if we disable CompactStrings!
### Summary ###
We have the following Sparc specific problems:
- Compression/inflation is expensive but we also have a regression on x86. Probably not much we can do here.
- We have a regression of String.indexOf() if CompactStrings is disabled.
- The String.compareTo cross coder intrinsic performs bad.
- Do we need a StringCoder.hasNegatives() intrinsic on Sparc?
General problems:
- Why is string concatenation of long_cmp1 performing better than long_cmp2?
- Encoding benchmarks perform bad, even worse if CompactStrings is disabled.