JDK-8209862 : CipherCore performance improvement
  • Type: Bug
  • Component: security-libs
  • Sub-Component: javax.crypto
  • Affected Version: 12
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2018-08-22
  • Updated: 2019-03-19
  • Resolved: 2018-10-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 12 JDK 8 Other
11.0.2Fixed 12 b16Fixed 8u201Fixed openjdk7uFixed
Related Reports
Relates :  
Relates :  
Description
Please, consider performance improvement for CipherCore.
http://cr.openjdk.java.net/~skuksenko/crypto/8209862/

Preface. 
https://bugs.openjdk.java.net/browse/JDK-8207775 add required data zeroing. That causes massive performance regression:
Regressions caused by JDK-8207775
 (Legend: <algorithm> <keyLength>/<dataSize> <regression Lin64>/<regression Win64>)
AESBench.decrypt        
AES/CBC/NoPadding___          128/01024      -17.4% / -3.9%
AES/CBC/NoPadding___         128/16384      -3.8% / -4.3%
AES/CBC/PKCS5Padding  128/16384      -8.2% / -6.0% 
AES/ECB/NoPadding___          128/01024       -7.3% / -7.6%
AES/ECB/PKCS5Padding   128/16384             0 / -8.6%

AESGSMBench.decrypt     
AES/GCM/NoPadding       128/01024        -4.4% / -3.9%

AESBench.encrypt        
AES/CBC/PKCS5Padding    128/16384           0 / -2.60%

DESedeBench.decrypt     
DESede/CBC/NoPadding___        168/16384        0 / -7.20%          
DESede/CBC/PKCS5Padding 168/16384         0 / -3.70%

DESedeBench.encrypt     
DESede/ECB/NoPadding___        168/16384        0 / -7.30%

In general negative performance effect caused by zeroing can't avoided. But in some cases, CipherCore can be optimized.
Here is list of performance speedup by suggested patch:
Performance improvements by suggested modification
(Legend: <algorithm> <keyLength>/<dataSize> <speedup Lin64>/<speedup Win64>)
AESBench.decrypt        
AES/CBC/NoPadding___         128/_1024     68.10% / 40.20%
AES/CBC/NoPadding___         128/16384   52.20% / 79.10%
AES/CBC/PKCS5Padding  128/16384   38.70% / 72.60%
AES/ECB/NoPadding___         128/_1024     29.40% / 23.90%
AES/ECB/NoPadding___         128/16384   11.60% / 33.50%
AES/ECB/PKCS5Padding  128/16384   15.30% / 38.30%

AESGSMBench.decrypt     
AES/GCM/NoPadding___         128/_1024      7.10% / 7.10%
AES/GCM/NoPadding___         128/16384    9.20% / 2.10%
AES/GCM/PKCS5Padding  128/16384    9.00% / 0   

AESBench.encrypt        
AES/CBC/PKCS5Padding    128/16384    2.50% / 0   
AES/ECB/NoPadding___           128/_1024               0  / 10.50%

DESedeBench.decrypt     
DESede/CBC/PKCS5Padding 168/16384               0 / 3.40%   
DESede/ECB/NoPadding___        168/16384     4.00% / 4.40%
DESede/ECB/PKCS5Padding 168/16384               0 / 5.00%   

DESedeBench.encrypt     
DESede/ECB/NoPadding___       168/16384     6.50% / 0   
DESede/CBC/PKCS5Padding 168/16384     3.90% / 4.10%

That not only covers almost all regression caused by additional zeroing, but gives additional performance benefits.

The idea of the modification:
- CipherCore contains 2 methods:
  doFinal(byte[], int, int)
  doFinal(byte[], int, int, byte[], int )
  The first method allocates output array internally and invokes the second doFinal. 
- At the same time, the second doFinal method contains a lot of checks and additional actions to work properly with user-provider output array. All these actions may be avoided if output array was allocated internally.

What was done:
- Some parts of the code (which can't be eliminated by knowing output array details) from method doFinal(byte[], int, int, byte[], int) were extracted to other methods (checkReinit(),prepareInputBuffer(),checkOutputCapacity()).
- doFinal(byte[], int, int, byte[], int ) was manually inlined to doFinal(byte[], int, int).
- massive manual constant propagation and dead code elimination (I have to note that hotspot JIT is unable to perform all such optimizations, JIT doesn't have enough information).

The key performance factor here is not elimination of some checks. But the fact that we can avoid unnecessary data copying and corresponds zeroing.

Comments
Changing to type 'bug'. Given the performance regression, I think it better suits the matter.
19-10-2018

Fix Request: Performance edits to address regressions that came in from JDK-8207775 (a fix pending 11.0.x integration). Security jtreg and TCK testing performed.
15-10-2018

Thanks for the suggested patch. I'll need to take a closer look.
04-09-2018