Bug ID: JDK-8191328 Avoid unnecessary overhead in CRC32C

Type: Enhancement
Component: core-libs
Sub-Component: java.util.jar

Priority: P4
Status: New
Resolution: Unresolved
OS: generic
CPU: generic

Submitted: 2017-11-15
Updated: 2017-11-28

Pure Java implementation of java.util.CRC32C does branching by ByteOrder.nativeOrder() inside main loop. In some circumstances it may cost ~30% of time. E.g. I used to see it on x86 and aarch64 with

-XX:DisableIntrinsic=_updateBytesCRC32C -Xcomp

Those branches can be moved out of loops without bloating class code.

This may help the case then Hotspot intrinsic is disabled, missing for the platform or then it is some other VM.

Fixed C1 http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/ Updated benchmark http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/CRC32CAltBench.java x86 results, JDK 10: Tiered before 375 �� 6 ns/op after 334 �� 3 ns/op 11% Tiered with Graal (JVMCI) before 356 �� 7 ns/op after 327 �� 6 ns/op 8% Tiered with AOT compiled benchmark (non-tiered ) before 1308 �� 58 ns/op after 1010 �� 8 ns/op 1.3x Tiered with -XX:MaxInlineLevel=0 before 660 �� 4 ns/op after 338 �� 3 ns/op 1.9x C1 before 498 �� 4 ns/op after 495 �� 4 ns/op same Interpreter before 40844 �� 333 ns/op after 24777 �� 624 ns/op 1.7x

28-11-2017

A benchmark that is easier to experiment with (no need to build jdk or to turn off intrinsics): http://cr.openjdk.java.net/~dchuyko/8191328/CRC32CAltBench.java Some more x86 results, JDK 9: default tiered before 380.957 �� 11.621 ns/op after 350.838 �� 5.149 ns/op -XX:MaxInlineLevel=0 before 656.791 �� 8.216 ns/op after 340.999 �� 2.686 ns/op -Xint before 36113.441 �� 197.716 ns/op after 26928.593 �� 133.309 ns/op

17-11-2017

http://cr.openjdk.java.net/~dchuyko/8191328/webrev.00/ nativeOrder() checks are now outside loops, xor and load operations for crc construction in loop now have less data dependencies. Consider following benchmark CRC32CBench.calcCRC32C: http://cr.openjdk.java.net/~dchuyko/8189177/crc32c/CRC32CBench.java running for size=512 with -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_updateBytesCRC32C -Xcomp x86 Core i5 before: 1013.365 �� 6.539 ns/op after: 360.292 �� 1.941 ns/op (2.8x) aarch64 Cavium ThunderX2 before: 4853.402 �� 8.232 ns/op after: 650.265 �� 1.541 ns/op (7.5x) Also it is finally ~2x faster than Hadoop's pure Java implementation that doesn't use Unsafe instead of being slower https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/PureJavaCrc32C.java in this performance test (with same options) https://github.com/dchuyko/hadoop/blob/HADOOP-15033/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/Crc32PerformanceTest.java And it is of course still ~2-3x slower than intrinsic implementations.

15-11-2017