Bug ID: JDK-8157495 SHA-3 Hash algorithm performance improvements (~12x speedup)

Type: Enhancement
Component: security-libs
Sub-Component: java.security
Affected Version: 9

Priority: P4
Status: Resolved
Resolution: Fixed
CPU: generic

Submitted: 2016-05-21
Updated: 2017-05-19
Resolved: 2016-06-10

JDK 9
9 b123Fixed

Performance of SHA3 (SUN Provider) Hash algorithm was checked.

As result we got poor performance. It's 50% slower than alternative BouncyCastle java SHA3 implementation.
Moreover, SHA3 implementation has shown very huge memory consumption. In average allocation rate is 400 bytes per byte of hashed mesage for SHA3-512 and 200 bytes per per byte of hashed mesage for SHA3-256.
So calculating digest for 1Mb original data we got:
Allocaton rate (SHA3-512) = 380 Mb/op (per means per a single digest invocation),
Allocaton rate (SHA3-256) = 202 Mb/op.

Here is  performance results:

Benchmark                                (algorithm)        (dataSize)                  Score   Error  Units
MessageDigestBench.digest     SHA3-224         1048576                  11.537 �� 0.059  ops/s
MessageDigestBench.digest     SHA3-256         1048576                  11.088 �� 0.085  ops/s
MessageDigestBench.digest     SHA3-384         1048576                    8.459 �� 0.079  ops/s
MessageDigestBench.digest     SHA3-512         1048576                    5.851 �� 0.059  ops/s

Suggested patch fully eliminates memory consumption. Allocation rate is ~871 bytes per digest invocation and doesn't depend on input data size.
SHA3 performance was impoved by ~6x times:

Benchmark                                (algorithm)     (dataSize)               Score   Error  Units
MessageDigestBench.digest     SHA3-224     1048576                  62.418 �� 0.319  ops/s
MessageDigestBench.digest     SHA3-256     1048576                  58.933 �� 0.575  ops/s
MessageDigestBench.digest     SHA3-384     1048576                  44.358 �� 4.114  ops/s
MessageDigestBench.digest     SHA3-512     1048576                  30.029 �� 3.492  ops/s

Performance results after last modifications (the same platform as above) Benchmark (algorithm) (dataSize) (provider) Mode Cnt Score Error Units MessageDigestBench.digest SHA3-224 1048576 thrpt 40 115.893 �� 1.681 ops/s MessageDigestBench.digest SHA3-256 1048576 thrpt 40 109.004 �� 1.782 ops/s MessageDigestBench.digest SHA3-384 1048576 thrpt 40 81.439 �� 2.039 ops/s MessageDigestBench.digest SHA3-512 1048576 thrpt 40 59.334 �� 1.268 ops/s
27-05-2016
Yet another performance improvements were found: 10. Merge functions smPi & smRho. Make rotation together with permutations. Speedup: ~100% (2x times) Thus, overall speedup is ~12x times. new webrev file is attached.
27-05-2016
Here is a detailed explanation of performance modifications. Transient performance improvements and memory consumption reduction are shown relatevely to the previous point. Memory consumption was measuted for 1Mb input data and SHA3-512 algorithm. 1. Minor code cleanup: - replace circularShiftLeft to Long.rorateLeft - replace v^0xF...F to ~v 2. inplace "state" tranformations (don't allocate additional memory) 3. SHA3::smTheta - total unroll of loops over local arrays - local arrays elimination (changed to locals) Speedup: ~4% Allocation rate: reduced from 380 Mb to 340 Mb 4. SHA3::smRho - inplace operation (don't allocated new "lanes" array) Speedup: ~12% Allocation rate: reduced from 340 Mb to 228 Mb 5. SHA3::smPi - inplace "lines" permutation. In order to achieve inplace permutation, it was checked how data are moving. As result allocation-free chain of data moves was discovered, but it's required total loop unroll, like: long tmp = a[2][0]; a[2][0] = a[0][1]; a[0][1] = a[1][1]; a[1][1] = a[1][4]; a[1][4] = a[4][2]; a[4][2] = a[2][4]; a[2][4] = a[4][0]; a[4][0] = a[0][2]; a[0][2] = a[2][2]; a[2][2] = a[2][3]; a[2][3] = a[3][4]; a[3][4] = a[4][3]; a[4][3] = a[3][0]; a[3][0] = a[0][4]; a[0][4] = a[4][4]; a[4][4] = a[4][1]; a[4][1] = a[1][3]; a[1][3] = a[3][1]; a[3][1] = a[1][0]; a[1][0] = a[0][3]; a[0][3] = a[3][3]; a[3][3] = a[3][2]; a[3][2] = a[2][1]; a[2][1] = a[1][2]; a[1][2] = tmp; Speedup: ~20% Allocation rate: reduced from 228 Mb to 116 Mb 6. SHA3::smChi - inplace operation. Done ny inner loop unroll and usage of local variables. Speedup: ~60% Allocation rate: reduced from 116 Mb to 4 Mb 7. SHA3 - makes "lines" arrays chached. (no more lines allocations) Speedup: ~3% Allocation rate: reduced from 4 Mb to 980 bytes. (no more dependancy from input data size) 8. SHA3::smRho - precalculate shift values for each "lines" cell. Speedup: ~80% 9. inflate "lines" two-dimetional array (was long[5][5]) to single-dimentional (long[25]) Speedup: ~40% Allocation rate: reduced 980 bytes to 871 bytes.
21-05-2016
Webrev is available at http://cr.openjdk.java.net/~skuksenko/crypto/8157495/
21-05-2016