JDK-7107611 : sun.security.pkcs11.SessionManager is scalability blocker
  • Type: Enhancement
  • Component: security-libs
  • Sub-Component: java.security
  • Affected Version: 7
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2011-11-02
  • Updated: 2015-05-07
  • Resolved: 2014-03-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 JDK 9 Other
7u91Fixed 8u40Fixed 9 b08Fixed naResolved
Related Reports
Duplicate :  
Relates :  
Relates :  
sun.security.pkcs11.SessionManager does synchronized access to Session pools, that became a huge scalability blocker on T4.
Patch is suggested.
The attached webrev.zip contain suggested patch, where Session pools are done with concurrent collections.

As result we got (on T4-4):
SPECjvm2008:crypto.rsa - 7x times boost,
SPECjvm2008:crypto.signverify - 2.5x times boost,
SPECjbb2012(encrupted transport) - 2.7x times boost.

The changeset pushed is not the same as the patch provided. There are discussion during code review and other minor stylistic changes made from the original patch. The main changes were to not have a session pool only for P11Cipher and to use ConcurrentLinkedDeque instead of ConcurrentLinkedQueue. The major performance gains came from the replacement of the locked around the non-concurrent list.

Below are the performance of AES before and after using the provided test with 64byte data sizes. The number left of the colon is the number of threads. Before changes: 2: 37422428 ops/m 4: 60937179 ops/m 6: 29093646 ops/m 8: 17251277 ops/m 10: 16154306 ops/m 24: 17456554 ops/m 48: 18082039 ops/m 64: 17291983 ops/m 128: 16613593 ops/m after change: 2: 50136654 ops/m 4: 88389334 ops/m 6: 111576546 ops/m 8: 89187233 ops/m 10: 66404341 ops/m 24: 62942152 ops/m 48: 57753738 ops/m 64: 55762932 ops/m 128: 54586763 ops/m The dip in performance after 8 threads appears to be an NSS issues. A bug was filed by the submitter to NSS with the fixes. https://bugzilla.mozilla.org/show_bug.cgi?id=731128 https://bugzilla.mozilla.org/show_bug.cgi?id=731126

Intel provided changes that improve performance for SessionManager and symmetric cipher performance that helps scaling performance on linux and ivy bridge up to 5x on larger multi-core systems on as small as 64 bytes data size with AES. The SessionManager changes alone should help the session contention for RSA, but with JDK-7092821 that performance gain may not be realized. I saw no performance change for RSA,

There is no timeframe when JDK-7092821 will be addressed. Performance results to this bug cannot be tested until that bug is fixed. Additionally, this bug does not take into account the ucrypto provider on Solaris which provides much better AES performance than PKCS11 can ever. This was originally a P4 until I moved it into P3 to work on it, I'm moving back to a P4. I am also changing this to an Enhancement because there is no failure here, just performance, JDK-7092821 is an Enhancement too.

7092821 needs to be addressed before the performance from these changes show according to Sergey

What are the options provided to the spec tests to reproduce this? I ran the default settings for crypto.rsa and had no significant difference between patched and unpatched jdk's. Thanks..

Was there a discussion about this bug that was not listed in the comments? If so, can it be attached into the bug please. It's sad that I spent so much time on the microbenchmark that does AES when now it sounds like the issue is related more to RSA & sign/verify.

Increasing priority back to P3. This issue merits P3 based on ILW guidelines as its a significant performance improvement across several crypto focused workloads.

> I have not run any of the SPEC tests. The numbers provided for the SPEC tests seem unreal and if the performance was that great, > I would have expected to see the microbenchmark perform nearly as well, but I didn't. Don't spend time on microbenchmarks, when you able to run SPECjvm2008:crypto.rsa, and that would be prove, because of SPEC benchmarks correlate with the real world performance. And I have to note that the mirobenchmark was created by 3d party, and original issue should be considered without this micro.

I have run the microbenchmark test provided and have found little significant performance increase running on a T4 using enc_alg. It was more noticeable for 16byte operations with 1000 threads, but that was all. When it came to 128 bytes, the advantage was negligible. When I looked at the source of the test I made some modifications. I had to modify the test as the "score" seemed arbitrary, for example there was a divide by 6 that I do not see a reason for. I changed it to ops/sec which is what we are trying to measure. Secondly the input data was the same for every doFinal() call, which using the same data tends to get tainted by system cache speeding up repeated operations. When looking at the profile of runs it was true that the ConcurrentHashMap showed up less as a hot function than the current LinkedList. However, the functions that matter yf_aes128_cbc_encrypt were nearly identical in their timing. The change mostly shifted the lock to another part of the system. Below is 1000 threads running 512byte input. ConcurrentHashMap 171.650 171.650 <Total> 32.513 32.513 __lwp_park 16.091 24.887 mutex_lock_impl 8.796 8.796 mutex_trylock_adaptive 5.744 13.610 _malloc_unlocked 4.963 6.725 mutex_unlock 4.373 4.373 yf_aes128_cbc_encrypt LinkedList 174.062 174.062 <Total> 31.592 31.592 __lwp_park 14.980 23.617 mutex_lock_impl 8.636 8.636 mutex_trylock_adaptive 7.735 10.087 sun.security.pkcs11.Token.releaseSession(sun.security.pkcs11.Session) 6.144 6.144 java.util.Arrays.copyOfRange(char[], int, int) 5.514 12.809 _malloc_unlocked 4.533 4.653 sun.security.pkcs11.Session.id() 4.413 6.084 mutex_unlock 4.243 4.243 yf_aes128_cbc_encrypt I have not run any of the SPEC tests. The numbers provided for the SPEC tests seem unreal and if the performance was that great, I would have expected to see the microbenchmark perform nearly as well, but I didn't. I saw a 20% performance increase at 16 byte and near none at 128 bytes. At 16 bytes all the time is spent in the JVM and nearly none doing encryption. At this moment this looks more like a performance increase for a spec benchmark rather than the real world. I maybe wrong, just my opinion with the data I've seen in the tests and read in the bug.