JDK-8015237 : Parallelize string table scanning during strong root processing
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: hs23,hs24,hs25
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2013-05-22
  • Updated: 2014-04-08
  • Resolved: 2013-06-25
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 Other
7u40Fixed 8Fixed hs24Fixed
Related Reports
Relates :  
Relates :  
Description
During strong root processing of the GC, the intern string table is considered a single root and the task to scan the entire table is claimed and performed by a single thread.

With some work loads we have seen the scanning of the string table to be the main dominator of the GC and the other worker threads wait in the termination protocol until the thread that successfully claims the string table completes the scan.

A better approach is to have multiple GC worker threads scan the table in parallel - similar to how we use multiple GC worker threads to scan the Java threads' stacks.

The string table is a regular hash table with each bucket being the head of a singly linked list. A simple approach would be to have each participating worker claim a chunk of buckets and scan the strings in those buckets. This still has the potential of having unbalanced scan times if the lengths of the bucket chains are massively unbalanced. Fortunately this does not seem to be a problem for the hash functions and we see a more or less uniform distribution. 
Comments
I just noticed that this is a P4 - you do not need Release Team approval for P4 or P5 bugs. Please remove the critical-request label so it no longer shows up on the critical request list.
20-06-2013

Any of the existing GC tests with CMS/ParNew or G1 with ParallelGCThreads > 1 should test the code.
20-06-2013

is it possible to create regression test here?
20-06-2013

I will approve this - provided that 1) there's SQE-OK and 2) the release team reviews and doesn't have any concerns. Please get SQE to review and mark with SQE-OK if they agree to this fix
20-06-2013

7u40-critical-request justification: The changes for this CR are needed in 7u40 (hs24) to help eliminate (or reduce) performance issues seen when running FMW with G1.
19-06-2013

JBS created a backport for the push of these changes into hs25: JDK-8016915 - Parallelize string table scanning during strong root processing I've modified this CR to remove the multiple fix versions and will use this CR # to push the changes to hs24.
19-06-2013

Exactly as in description. Ran some performance numbers with Kitchensink to gauge the synchronization overhead: 0-threads 1-thread 1-thread-cas 1-thread-chunked Min: 0.200 0.700 0.300 0.300 Max: 6.900 12.200 8.800 11.200 Avg: 0.658 1.637 0.785 0.794 The 0-threads column is the serial code (i.e. a single thread performing the scan without synchronization); the 1-thread column uses a single thread claim one bucket at a time (using an atomic add instruction). The 1-thread-cas column uses a a single thread which claims 20 buckets at a time using a CAS operation. The 1-thread-chunked uses a single thread to claim 20 buckets using an atomic add. The overheads are very similar but the code for the atomic add case is much simpler.
22-05-2013