JDK-8330470 : TLAB initialization may cause div by zero
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17.0.10
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • OS: linux_redhat_8.0
  • CPU: x86_64
  • Submitted: 2024-04-17
  • Updated: 2024-11-26
Related Reports
Relates :  
Relates :  
Description
This seems like a related but slightly different case than fixed in JDK-8308766

We see a crash with SIGFPE at ThreadLocalAllocBuffer::initial_desired_size()

It is reproducible only on a specific set of machines and is not visible anywhere else with the same application. Not sure what is necessary to reproduce it elsewhere.


One possible candidate for a SIGFPE in the code is 

 init_sz  = (Universe::heap()->tlab_capacity(thread()) / HeapWordSize) /
                      (nof_threads * target_refills());

at https://github.com/openjdk/jdk17u/blob/jdk-17.0.10-ga/src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp#L280

HeapWordSize seems to be a constant, but maybe either nof_threads or target_refills() can be zero in some cases?

bits from hs_err_pid:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGFPE (0x8) at pc=0x00007efeed9b5b9c, pid=3048299, tid=3050463
#
# JRE version:  (17.0.10+7) (build )
# Java VM: OpenJDK 64-Bit Server VM (17.0.10+7, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xe68b9c]  ThreadLocalAllocBuffer::initial_desired_size()+0x10c


Stack: [0x00007efd9c21e000,0x00007efd9ca1e000],  sp=0x00007efd9ca1cd20,  free space=8187k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xe68b9c]  ThreadLocalAllocBuffer::initial_desired_size()+0x10c
V  [libjvm.so+0xe68be4]  ThreadLocalAllocBuffer::initialize()+0x24
V  [libjvm.so+0x8bfec4]  attach_current_thread.part.0+0x94
V  [libjvm.so+0x8c023d]  jni_AttachCurrentThread+0x6d
C  0x00007efd9cb5b701
C  0x00007efd9cb5ba4e


Potential workarounds:
* Disable TLAB with -XX:-UseTLAB  - may have large performance impact
* Configure an initial"TLABSize" via JVM parameters -XX:TLABSize=... to try to avoid code-branch which crashes (https://github.com/openjdk/jdk17u/blob/jdk-17.0.10-ga/src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp#L273) - e.g. -XX:TLABSize=2k (must be between 1k and 512k), seems the JDK will only use this as "initial" size and resize properly afterwards, see https://answers.ycrash.io/question/what-is-jvm-startup-parameter--xxtlabsize?q=833

"chatty" logging for tlab-size can be enabled via -Xlog:tlab*=debug,tlab*=trace:file=gc.log:time:filecount=7,filesize=8M (edited) 

Comments
I suspect this relates to JDK-8308341 which is fixed in 21. It was not backported due to the slight adjustment needed to the JNI specification. I don't know if that is a blocker for backporting or not. CORRECTION: there was no JNI spec adjustment. No idea why I thought there was.
24-10-2024

The code for jni_AttachCurrentThread() is as follows: jint JNICALL jni_AttachCurrentThread(JavaVM *vm, void **penv, void *_args) { HOTSPOT_JNI_ATTACHCURRENTTHREAD_ENTRY(vm, penv, _args); if (vm_created == NOT_CREATED) { // Not sure how we could possibly get here. HOTSPOT_JNI_ATTACHCURRENTTHREAD_RETURN((uint32_t) JNI_ERR); return JNI_ERR; } jint ret = attach_current_thread(vm, penv, _args, false); HOTSPOT_JNI_ATTACHCURRENTTHREAD_RETURN(ret); return ret; } However: * vm_created can be IN_PROGRESS while the VM is currently being initialized. I.e. that check can pass while the VM is not fully initialized * vm_created is read without load_acquire, so the change to vm_create maybe be visible to that thread long before other initialization
18-04-2024

Fwiw, the relevant TLAB initialization code did not change in later JDKs.
17-04-2024

How do you run the VM? The stack trace indicates that jni_AttachCurrentThread is used, which indicates use of the invocation API. I.e. maybe your application hosts the VM via create_VM() and then attaches threads? That would help clarifying the circumstances as I can't find a code path where either nof_threads or target_refills() can be zero when started normally. However if threads are attached externally, there could be a race with initialization of the VM and the initialization of the variable target_refills() uses. nof_threads always contains something > 0.
17-04-2024

Process is a standard deployment of Elasticsearch 7.17.5. Looks like a normal main() (see e.g. https://github.com/elastic/elasticsearch/blob/v7.17.5/server/src/main/java/org/elasticsearch/bootstrap/Elasticsearch.java) However there is a Dynatrace Agent injected, could this interfere here?
17-04-2024