JDK-8345970 : pthread_getcpuclockid related crashes in shenandoah tests
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 24,25
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux,linux_alpine
  • CPU: x86_64,aarch64
  • Submitted: 2024-12-11
  • Updated: 2025-01-06
  • Resolved: 2024-12-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 24 JDK 25
24Fixed 25 b03Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Description
We see recently a few of similar crashes ; mostly in shenandoah jtreg jdk24 tests. Those crashes are seen recently on Alpine Linux.
Since 30th Nov 2024 we observed the crashes here (hs jtreg tests) 
gc/shenandoah/TestEvilSyncBug.java#generational   3 times
gc/shenandoah/oom/TestClassLoaderLeak.java   2 times

Maybe shenandoah calls more into pthread_getcpuclockid, and sometimes we call it on "bad" (already terminated?) threads.
Looks like Alpine is even more sensitive to this and pthread_getcpuclockid crashes on such threads?
Can we add a little check or assert for 'good' threads ?

Example :
# SIGSEGV (0xb) at pc=0x00007fd79548e234, pid=24021, tid=24114
#
# JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-adhoc.jenkinsi.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-adhoc.jenkinsi.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64)
# Problematic frame:
# C [ld-musl-x86_64.so.1+0x56234] pthread_getcpuclockid+0x0

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [ld-musl-x86_64.so.1+0x56234] pthread_getcpuclockid+0x0
V [libjvm.so+0x1889bf4] ThreadTimeAccumulator::do_thread(Thread*)+0x14 (shenandoahMmuTracker.cpp:51)
V [libjvm.so+0x18890c0] ShenandoahMmuTracker::fetch_cpu_times(double&, double&)+0x50 (shenandoahMmuTracker.cpp:76)
V [libjvm.so+0x18895ce] ShenandoahMmuTracker::record_young(unsigned long)+0x6e (shenandoahMmuTracker.cpp:100)
V [libjvm.so+0x17db715] ShenandoahGenerationalControlThread::service_concurrent_cycle(ShenandoahHeap*, ShenandoahGeneration*, GCCause::Cause&, bool)+0x1e5 (shenandoahGenerationalControlThread.cpp:618)
V [libjvm.so+0x17dc0c8] ShenandoahGenerationalControlThread::service_concurrent_normal_cycle(ShenandoahGenerationalHeap*, ShenandoahGenerationType, GCCause::Cause)+0x128 (shenandoahGenerationalControlThread.cpp:581)
V [libjvm.so+0x17dcde2] ShenandoahGenerationalControlThread::run_service()+0x642 (shenandoahGenerationalControlThread.cpp:229)
V [libjvm.so+0xabca5b] ConcurrentGCThread::run()+0x1b (concurrentGCThread.cpp:48)
V [libjvm.so+0x1a8b2d6] Thread::call_run()+0xb6 (thread.cpp:232)
V [libjvm.so+0x15aa58a] thread_native_entry(Thread*)+0x17a (os_linux.cpp:849)

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007fd7792b8b68
Comments
A pull request was submitted for review. Branch: jdk24 URL: https://git.openjdk.org/jdk/pull/22933 Date: 2025-01-06 18:03:20 +0000
06-01-2025

Yes, looks like the same crash. I shall backport the PR to jdk24.
06-01-2025

I've also noticed this assert (however in jdk24) , again on Linux Alpine : test gc/shenandoah/TestAllocObjectArrays.java#generational # Internal Error (/priv/jenkins/client-home/workspace/openjdk-24u-linux_alpine_x86_64-dbg/jdk/src/hotspot/os/linux/os_linux.cpp:4323), pid=26906, tid=26918 # assert(status == 0) failed: clock_gettime error: Invalid argument # # JRE version: OpenJDK Runtime Environment (24.0.1) (fastdebug build 24.0.1-internal-adhoc.jenkinsi.jdk) # Java VM: OpenJDK 64-Bit Server VM (fastdebug 24.0.1-internal-adhoc.jenkinsi.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x15a2cbe] os::Linux::fast_thread_cpu_time(int) [clone .part.0]+0xe V [libjvm.so+0x15a2cbe] os::Linux::fast_thread_cpu_time(int) [clone .part.0]+0xe (os_linux.cpp:4323) V [libjvm.so+0x15a80b0] os::current_thread_cpu_time(bool)+0x0 (os_linux.cpp:5060) V [libjvm.so+0x188a084] ThreadTimeAccumulator::do_thread(Thread*)+0x14 (shenandoahMmuTracker.cpp:51) V [libjvm.so+0x1889550] ShenandoahMmuTracker::fetch_cpu_times(double&, double&)+0x50 (shenandoahMmuTracker.cpp:76) V [libjvm.so+0x1889a5e] ShenandoahMmuTracker::record_young(unsigned long)+0x6e (shenandoahMmuTracker.cpp:100) V [libjvm.so+0x17db7d5] ShenandoahGenerationalControlThread::service_concurrent_cycle(ShenandoahHeap*, ShenandoahGeneration*, GCCause::Cause&, bool)+0x1e5 (shenandoahGenerationalControlThread.cpp:618) V [libjvm.so+0x17dc188] ShenandoahGenerationalControlThread::service_concurrent_normal_cycle(ShenandoahGenerationalHeap*, ShenandoahGenerationType, GCCause::Cause)+0x128 (shenandoahGenerationalControlThread.cpp:581) V [libjvm.so+0x17dcea2] ShenandoahGenerationalControlThread::run_service()+0x642 (shenandoahGenerationalControlThread.cpp:229) V [libjvm.so+0xabdceb] ConcurrentGCThread::run()+0x1b (concurrentGCThread.cpp:48) V [libjvm.so+0x1a8bc56] Thread::call_run()+0xb6 (thread.cpp:232) V [libjvm.so+0x15a99fa] thread_native_entry(Thread*)+0x17a (os_linux.cpp:860) Is this somehow related to your fix ? (but if so we need a backport to jdk24 I guess)
30-12-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/shenandoah-jdk21u/pull/145 Date: 2024-12-13 18:26:36 +0000
13-12-2024

Changeset: 2ce53e88 Branch: master Author: William Kemper <wkemper@openjdk.org> Date: 2024-12-13 17:41:26 +0000 URL: https://git.openjdk.org/jdk/commit/2ce53e88481659734bc5424c643c5e31c116bc5d
13-12-2024

I see Ramki is already on top of this in the PR. As he says the real fix is to ensure you manage the lifecycles correctly and stop processing threads that could have terminated.
12-12-2024

Could you verify JDK-8345501 after this bug fixed.
12-12-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/22693 Date: 2024-12-11 22:32:00 +0000
11-12-2024

Btw setting the issue to Linux Alpine only might be sufficient for the current crashes, but this documentation says https://pubs.opengroup.org/onlinepubs/9799919799/functions/pthread_getcpuclockid.html "If an implementation detects use of a thread ID after the end of its lifetime, it is recommended that the function should fail and report an [ESRCH] error." so it seems to be only a recommendation that the function handles 'bad' thread IDs nicely. (hopefully we are at least with non-ancient glibc versions on the safe side)
11-12-2024

Seems we call pthread_getcpuclockid with a 'bad' / invalid threadid and this might crash on Alpine according to https://bugs.openjdk.org/browse/JDK-8240187 discussion : https://www.openwall.com/lists/musl/2020/02/10/6
11-12-2024