JDK-8132510 : Replace ThreadLocalStorage with compiler/language-based thread-local variables
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2015-07-29
  • Updated: 2022-03-01
  • Resolved: 2015-12-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b99Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
In various parts of the runtime and in compiler generated code we need to get a reference to the (VM-level) Thread* of the currently executing thread. This is what Thread::current() returns. For performance reasons we also have a fast-path on 64-bit where the Thread* is stashed away in a register (g7 on sparc, r15 on x64).

So Thread::current() is actually a slow-path mechanism and it delegates to ThreadLocalStorage::thread().

On some systems ThreadLocalStorage::thread utilizes a caching mechanism to try and speed up access to the current thread. Otherwise it calls into yet another "slow" path which uses the available platform thread-specific-storage APIs.

Compiled code also has a slow-path get_thread() method which uses assembly code to invoke the same platform thread-specific-storage APIs (in some cases - on sparc it simply calls ThreadLocalStorage::thread()). 

8130212 had to fix a problem with the caching mechanism on Solaris and in doing so highlighted that this old ThreadLocalStorage code was put in place to deal with inadequacies of the system provided thread-specific-storage API. In fact on Solaris we even by-pass the public API (thr_getspecific/thr_setspecific) when we can and implement our own version using lower-level APIs available in the T1/T2 threading libraries!

In mid-2015 things have changed considerably and we have reliable and performant support for thread-local variables at the C++ language-level. So the way to maintain the current thread is simply using (current Solaris version):

 // Declaration of thread-local variable
 static __thread Thread * _thr_current

 inline Thread* ThreadLocalStorage::thread()  {
   return _thr_current;
 }

 inline void ThreadLocalStorage::set_thread(Thread* thread) {
   _thr_current = thread;
 } 

But we can go further, by using language-level thread-locals we can completely remove the ThreadLocalStorage class and define things directly in the Thread class itself, and remove the notion of get_thread_slow.
Comments
URL: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/f7dc8eebc3f5 User: lana Date: 2015-12-23 23:04:20 +0000
23-12-2015

URL: http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/f7dc8eebc3f5 User: dholmes Date: 2015-12-04 10:33:44 +0000
04-12-2015

For linux results the performance change is neutral. Some small wins, some small losses - no differences considered statistically significant.
26-11-2015

linux-x32 > bash refworkload/compare -r logs.ref-x32 logs.latest-x32 ============================================================================== logs.ref-x32: Benchmark Samples Mean Stdev Geomean Weight reference_server 10 26494.20 539.81 jetstream 10 210.25 25.52 0.10 scimark 10 1899.76 7.32 0.15 specjbb2000 10 685753.98 6070.64 0.15 specjbb2005 10 430672.44 8281.21 0.25 specjvm98 10 938.23 13.06 0.20 volano25 10 298126.10 29757.85 0.15 ============================================================================== logs.latest-x32: Benchmark Samples Mean Stdev %Diff P Significant reference_server 10 26600.76 646.53 0.40 0.694 * jetstream 10 234.25 18.82 11.41 0.029 * scimark 10 1898.55 9.08 -0.06 0.746 * specjbb2000 10 689120.85 10964.32 0.49 0.410 * specjbb2005 10 421504.77 20534.79 -2.13 0.215 * specjvm98 10 936.24 9.91 -0.21 0.706 * volano25 10 294500.00 27476.49 -1.22 0.780 * ============================================================================== * - Not Significant: A non-zero %Diff for the mean could be noise. If the %Diff is 0, an actual difference may still exist. In either case, more samples would be needed to detect an actual difference in sample means. Alpha for this run: 0.010
26-11-2015

Not having much success getting any performance results. refworkload results for LInux x64: /scratch/daholme/8132510 > bash ./refworkload/compare -r logs.ref-x64 logs.latest-x64/ ============================================================================== logs.ref-x64: Benchmark Samples Mean Stdev Geomean Weight reference_server 10 39188.35 553.92 jetstream 10 296.78 28.80 0.10 scimark 10 1907.22 35.78 0.15 specjbb2000 10 862139.65 6421.06 0.15 specjbb2005 10 484463.18 17609.05 0.25 specjvm98 10 1035.55 18.38 0.15 volano25 10 282982.20 19272.56 0.20 ============================================================================== logs.latest-x64/: Benchmark Samples Mean Stdev %Diff P Significant reference_server 10 38906.31 924.55 -0.72 0.421 * jetstream 10 296.19 20.20 -0.20 0.958 * scimark 10 1919.84 28.85 0.66 0.397 * specjbb2000 10 860822.94 5371.86 -0.15 0.625 * specjbb2005 10 487284.80 21241.05 0.58 0.750 * specjvm98 10 1031.49 17.00 -0.39 0.614 * volano25 10 271116.00 25825.44 -4.19 0.261 * ============================================================================== * - Not Significant: A non-zero %Diff for the mean could be noise. If the %Diff is 0, an actual difference may still exist. In either case, more samples would be needed to detect an actual difference in sample means. Alpha for this run: 0.010
26-11-2015

The code underpinning __thread use is not async-signal-safe, which is not really a surprise as pthread_get/setspecific are not designated async-signal-safe either. The problem, in glibc, is that first access of a TLS variable can trigger allocation [1]. This contrasts with using pthread_getspecific which is benign and so effectively async-signal-safe. So if a thread is executing in malloc and it takes a signal, and the signal handler tries to use TLS (it shouldn't but it does and has gotten away with it with pthread_getspecific), then we can crash or get a deadlock. In the context of the VM the problem only exists for threads that existed before the JVM was loaded. All threads allocated after that will have space for all the TLS variables allocated directly. So the problem scenario is: - external process with existing threads loads the JVM - existing thread is executing critical library function eg malloc, when it takes a process-directed signal (any signal really but a synchronous signal during malloc is already a problem) - JVM signal handler runs and accesses _thr_current which triggers dynamic TLS allocation => deadlock (most likely) As discussed here: http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-November/020508.html Google hit this problem and worked-around it in their own custom launcher. We can't do that. So I've reinstated a very basic ThreadLocalStorage class which will only need two implementations: a POSIX one, and a Windows one. This class is always initialized and ThreadLocalStorage::thread() is used from the signal handlers (as today). For platforms that don't have __thread support they can define USE_LIBRARY_BASED_TLS_ONLY at build time to only use the ThreadLocalStorage implementation. [1] https://sourceware.org/glibc/wiki/TLSandSignals
16-11-2015

Ran testsuite vm.runtime successfully on all core platforms (RBT)
29-10-2015

Thomas Stuefe reports an issue with __thread on linux: On Mon, Aug 3, 2015 at 10:22 PM, David Holmes <david.holmes@oracle.com> wrote: On 4/08/2015 1:38 AM, Thomas Stüfe wrote: we added compiler-level TLS (__thread) for Linux a while ago to do exactly what you are doing, but then had to remove it again. Bugs in the glibc caused memory leaks - basically it looked like glibc was not cleaning TLS related structures. Leak was small but it added up over time. I know you implemented this for Solaris. Just thought I give you a warning, maybe this is something to keep in mind. Thanks for the heads-up! Linux et al are next on the list. I'll put together a simple thread creation test and see if the memory use changes over time. Sounds good. Unfortunately this was a gcc/glibc bug and therefore you may not see the bug on every linux system. I think this may have been the bug: https://sourceware.org/bugzilla/show_bug.cgi?id=12650 Update: I could not observe any memory leak relating to use of __thread variables.
27-10-2015