JDK-8222518 : Remove unnecessary caching of Parker object in java.lang.Thread
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 13
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2019-04-16
  • Updated: 2021-02-08
  • Resolved: 2019-04-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 13
11.0.11Fixed 13 b19Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
The original implementation of Unsafe.unpark simply extracted the JavaThread reference from the java.lang.Thread oop and if non-null extracted the Parker instance from it and invoked unpark. This was racy however as the native JavaThread could terminate at any time and deallocate the Parker.

That logic was fixed by JDK-6271298 which used of combination of type-stable-memory "event" objects for the Parker, along with use of the Threads_lock to obtain the initial reference to the Parker (from a JavaThread guaranteed to be alive), together with caching the native Parker pointer in a field of java.lang.Thread. Even though the native thread may have terminated the Parker was still valid (even if associated with a different thread) and the unpark at worst was a spurious wakeup for that other thread.

When JDK-8167108 introduced Thread Safe-Memory-Reclaimation (SMR) the logic was updated to always use the safe mechanism - we grab a ThreadsListHandle then check the cached field, else lookup the native thread to see if it is alive and locate the Parker instance that way.

With SMR the caching of the Parker pointer no longer serves any purpose - we no longer have a lock-free use-the-cache path versus a lock-using populate-the-cache path. With SMR we've already"paid" for the ability to ensure the native thread can't terminate regardless of whether we lookup the field from the java.lang.Thread or the JavaThread. So we can simplify the code and save a little footprint by removing the cache from java.lang.Thread:

   /*
     * JVM-private state that persists after native thread termination.
     */
    private long nativeParkEventPointer;

and the supporting code from unsafe.cpp and javaClass.*pp in the JVM.
Comments
(11u). I used numactl to confine the java process to a single socket for one run. Variance is much reduced, all but ABQ_F improve with the patch, and ABQ_F is less than one percent worse, which is likely noise. Baseline LBQ 228.671 �� 11.136 ns/op ABQ_NF 435.315 �� 42.927 ns/op ABQ_F 6736.697 �� 125.986 ns/op PBQ 457.800 �� 25.765 ns/op Patched LBQ 225.580 �� 10.837 ns/op ABQ_NF 415.754 �� 16.758 ns/op ABQ_F 6790.031 �� 109.690 ns/op PBQ 444.416 �� 33.785 ns/op baseline patched % improvement patched over baseline LBQ 228.671 225.58 + 1.35 ABQ_NF 435.315 415.754 + 4.49 ABQ_F 6736.697 6790.031 - 0.79 PBQ 457.8 444.416 + 2.92
13-01-2021

Fix Request (11u). Review thread: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2021-January/004622.html See attachments for before/after performance comparison of ProducerConsumer, run on an m4.10xlarge EC2 instance, 2 runs each. This is a dual socket system with 10 physical cores per socket. The runs were were not confined to a single socket. Raw data is somewhat unstable, as David pointed out. Baseline runs: LBQ 531.917 �� 37.158 ns/op ABQ_NF 291.319 �� 23.175 ns/op ABQ_F 8814.959 �� 1300.188 ns/op PBQ 404.465 �� 22.287 ns/op LBQ 505.446 �� 26.927 ns/op ABQ_NF 292.392 �� 35.910 ns/op ABQ_F 8771.988 �� 1384.790 ns/op PBQ 412.276 �� 25.928 ns/op With patch: LBQ 513.046 �� 34.460 ns/op ABQ_NF 300.562 �� 27.080 ns/op ABQ_F 7973.633 �� 1277.665 ns/op PBQ 438.429 �� 47.704 ns/op LBQ 516.090 �� 49.003 ns/op ABQ_NF 286.943 �� 23.436 ns/op ABQ_F 7622.689 �� 1075.665 ns/op PBQ 423.748 �� 44.500 ns/op Averages: baseline patched % improvement patched over baseline LBQ 518.682 514.568 + 0.79 ABQ_NF 291.856 293.753 - 0.65 ABQ_F 8793.474 7798.161 + 11.32 PBQ 408.375 431.089 - 5.56 There is no difference on LBQ and ABQ_NF, a gain on ABQ_F (which David notes is a proxy for the cost of unparking), and a loss on PBQ comparable to the loss on LBQ for the JDK 13 patch. The baseline/patches differences for both ABQ_F and PBQ are within their extreme variances. On balance, little net effect, with perhaps an advantage to the patched version due to a reduction in the cost of unparking.
13-01-2021

I used the JMH org.openjdk.bench.java.util.concurrent.ProducerConsumer.test to investigate the performance issues here. Some raw figures are attached. As usual benchmarking conditions were not favourable and there is a lot of variance and overall I find the results and conclusions unsatisfactory. Here's a summary of the results for JDK 10 b36 (Thread-SMR added) versus b35: LBQ: -1.08% ABQ_NF: -0.42% ABQ_F: 3.2% PBQ: -3.02% The results refer to the difference in ns/op for tests on different Blocking Queue implementations. The F/NF refer to fair and non-fair variants. ABQ_F is the fair ArrayBlockingQueue. It is by far the slowest queue to use under stress as the "fair" aspect results in a lot of context switching as threads get processed strictly in order. It is the best measure of the cost of unparking and we can see that we took a 3% performance hit with the application of Thread_SMR. But the other tests show a performance gain. However if we look at the results (in the attachment) for JDK Baseline (JDK 13) we can see that there has been some further slowdown in that test. But significantly there is no significant difference between the baseline cost and the cost after applying either the removal of the cache, or the restoration of the fast-path of the cache (outside the use of Thread-SMR). While the variance makes it hard to compare I think it safe to say that restoring the cache fast-path does not yield any significant improvement and removing the cache completely does not show any significant performance hit: LBQ: -3.96% ABQ_NF: 2.29% ABQ_F: 1.87% PBQ: 1.60% and could show a gain on LBQ. In conclusion I see no evidence to not go ahead and remove the cache as proposed.
24-04-2019

I'm currently investigating whether the loss of the "lock-free" fast-path when SMR was introduced actually had a performance impact. If it did then I'll change this issue to restore that fast-path. Trying to find suitable benchmarks to run.
16-04-2019