JDK-8176442 : [aix] assert(_thr_current == 0L) failed: Thread::current already initialized
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 9,10
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: aix
  • Submitted: 2017-03-09
  • Updated: 2017-04-06
  • Resolved: 2017-03-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 10 JDK 9
10Fixed 9 b163Fixed
Related Reports
Relates :  
Description
On AIX, we see sporadic asserts when running the jtreg tests:

-----------

#  Internal Error (/priv/d031900/openjdk/jdk9-hs/source/hotspot/src/share/vm/runtime/thread.cpp:295), pid=1073374, tid=8739
#  assert(_thr_current == 0L) failed: Thread::current already initialized

---------------  T H R E A D  ---------------

Current thread (0xbabababababababa):
[error occurred during error reporting (printing current thread), id 0xe0000000]

------------

A new born thread (usually the AttachListener) wants to initialize Thread::current(). Since JDK-8132510 Thread::current() is implemented with compiler level TLS ("__thread"). Before, it was implemented using pthread library TLS ("pthread_getspecifc" etc). So the code wants to initialize its instance of _thr_current (a __thread variable) but finds it being not NULL. __thread variables should be default be initialized to 0 by the C-Runtime.

In this case the __thread variable is filled with a "0xbababa..." pattern, which after analysis turned out to be the zap value we use in os::free() to mark freed memory before returning it to the C-Runtime.

The memory backing the __thread variables lives in the process data segment, as does the C-heap memory, so an overwrite scenario is possible. In fact, __thread variable locations and malloc() locations are closely interleaved. From the address patterns, it looks like the C-Runtime just mallocs the backing memory for TLS instances as it goaes along, for each new born thread. It does not look like C-Runtime pre-allocates memory for the TLS instances. All this is guesswork though, AIX is closed source, so no way to examine the implementation.

There is a theoretical possibility that this is our fault, that the VM stomps over C-Runtime internal memory. However, after analyzing the issue I think that this is unlikely. It is more likely that the error is with the OS/C-Runtime. Here is why:

1) When examining the order of malloc/free calls, one can observe the malloc call which allocates the memory range which spans the location of the future-to-be bad TLS variable instance. It is malloced (via os::malloc()), then freed again (via os::free). Nothing untoward happens, VM is well behaved. It zaps the memory and hands it back to the C-Runtime. This zap value later shows up as content of the new born threads TLS variable.

2) I never see *existing* TLS variables overwritten, only *new* TLS variables for newborn threads having the wrong initialization value. If we really were stomping around, we should have hit with a certain probability existing TLS variables too, or any other vital memory, and should see more diverse errors. 

3) Error only happens on AIX 5.3, observed on two machines. No error seen on AIX 6.1 and AIX 7.2.

4) os::malloc/os::free establish and check guards (GuardedMemory). So simple cases of overwriters or double frees should be catched.

I could still conceive a highly far fetched scenario  (see comments) where we could be guilty of stomping over C-Runtime memory, but find it unlikely. More likely is that OS/C-Runtime did not correctly initialize the __thread TLS variable for this thread.

I attempted to write a simple C reproduction case, but so far without  success. We will contact IBM support and check for known bugs.

I propose to switch off compiler based TLS and go back to pthread library TLS. (That should be simply, David preserved both code paths and added a compiler switch when he did the original changes for JDK-8132510). There are no real advantages to compiler level TLS, and pthread level TLS used to work for many years for us on AIX without problems. I also like to reduce dependencies to the C-Runtime and compiler on AIX.



Comments
Re-targeting to JDK 9. This fix only affects AIX and is needed on that platform for 9. This needs to be pushed before RDP2.
14-03-2017

That does sound like the OS/runtime is not properly initializing the TLS region.
10-03-2017

Further tests show this error only happens on AIX 5.3 (tested on two machines, tech levels 5300-11 and 5300-09). TLS should be supported from 5300-05 on. No Error on AIX 6.1 and AIX 7.2.
10-03-2017

Hi David, sure, I did all this already. zap value is confirmed to be the value GuardedMemory sets user portion of memory to on release (Changed it, value in __thread changes). The "overwritten" section is an allocation of 32k, in the mids of many other allocations. What happens is: ... os::malloc os::malloc os::malloc os::malloc 32k, range includes the area of the future bad __thread location X os::free this range. Area gets zapped and returned to CRT. So far we did nothing wrong. ... ... __thread gets created for a new born Thread at this location X, but it is not 0 but contains the zap value. Notes: - As you see from the above, this is not an error where the wrong range gets zapped. The zapped range was the allocated range, all is correct, and into this range later the __thread variable is placed by the CRT - Until now I never saw an *existing* - already initialized - __thread value getting overwritten (One would get crashes when accessing Thread functions via Thread::current() and Thread::current() returns 0xbababa.. ). I only ever see a problem with a newly to be initialized __thread variable. If we were indiscriminately overwriting CRT memory, I would expect more diverse errors. - It could still be badly behaving VM code, assuming: 1) the memory backing the __thread variables is pre-allocated and initialized at time A by the CRT, 2) at time B we stamp over this preallocated memory and zap it, C) at time C a new born thread wants a new __thread variable and gets a slot in the preallocated-but-overwritten-memory. Barring any overlooks by me, the only way this could happen would be a double free situation. But this would be a highly intricate situation involving both raw ::free and os::free, because GuardedMemory also guards against simple double free errors (involving just os::free). This sounds all a bit far fetched to me. Currently the alternative explanation - that there is an error in the C-Runtime leading to a __thread variable not to be initialized - seems more likely. I am usually hesitant to accuse the OS or the hardware, but we did had problems with AIX __thread handling in the past. I will examine the problem further.
10-03-2017

I expect TLS is part of thread-stack and/or comes from C-Heap. But the JVM should not be writing arbitrarily through the C-heap, it should only write to C-heap that has been allocated to it. I wonder if the zapping code may sometimes have an off-by-one-page error? Can you examine the zapped memory area to see where it extends? I wonder if you could easily use different zap values to try and determine which code is responsible?
10-03-2017

On AIX tls variables and C heap (and pthread stacks) are all located in the same data segment. C Heap allocations interleave closely with __thread variable locations. Nothing much I can do about that. At least this is what I see on AIX 5.3, maybe it is solved differently in newer releases. As for allocation/initialization, sure. It is handled by the CRT.
10-03-2017

The compiler based TLS variables should not be anywhere that the JVM memory allocation routines will touch! Further the allocation and initialization of the TLS variables in a new thread should all be handled by the C-runtime libraries.
09-03-2017