JDK-8208159 : runtime/jni/terminatedThread/TestTerminatedThread.java fails intermittently on Linux-X64
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 11,12
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: linux_ubuntu
  • CPU: x86_64
  • Submitted: 2018-07-24
  • Updated: 2021-11-09
  • Resolved: 2018-10-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 12
12Resolved
Related Reports
Duplicate :  
Relates :  
Description
The following test failed due to an exception on Linux X64
in the slowdebug config using jdk11+23 bits: 

runtime/jni/terminatedThread/TestTerminatedThread.java

The test only failed in 1 of 3 'slowdebug' bits runs so I'm
tagging this bug as intermittent. It did not fail at all in the
'fastdebug' or 'release' bits runs.

Here is a snippet from the log file: 

----------System.out:(11/312)----------
Working with thread: Thread[Thread-1,5,main],  in state: RUNNABLE
Calling suspend ...
Calling resume ...
Calling getStackTrace ...
[]
Calling setName("NewName") ...
Calling interrupt ...
Calling stop ...
Calling getThreadCpuTime ...
Native thread is running and attaching as daemon ...
Native thread terminating
----------System.err:(13/924)----------
java.lang.RuntimeException: Invalid ThreadCpuTime returned = 107789556 expected = -1
        at TestTerminatedThread.main(TestTerminatedThread.java:88)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:115)
        at java.base/java.lang.Thread.run(Thread.java:834)

JavaTest Message: Test threw exception: java.lang.RuntimeException: Invalid ThreadCpuTime returned = 107789556 expected = -1
JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Invalid ThreadCpuTime returned = 107789556 expected = -1
----------rerun:(29/3557)*----------
Comments
Here are the logs for my jdk-12+15 Linux-X64 sightings: $ unzip -l jdk-12+15_linux.8208159.zip Archive: jdk-12+15_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12653 2018-10-12 20:56 jdk-12+15_2/TestTerminatedThread.jtr.release --------- ------- 12653 1 file
14-10-2018

Here are the logs for my jdk-12+14 Linux-X64 sightings: $ unzip -l jdk-12+14_linux.8208159.zip Archive: jdk-12+14_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12732 2018-10-05 19:37 jdk-12+14_2/TestTerminatedThread.jtr.slowdebug --------- ------- 12732 1 file
13-10-2018

The check for -1 will no longer be a failure and is being handled by JDK-8211931, so closing this as a duplicate.
12-10-2018

[~dcubed] Sorry Dan I misunderstood your test change and to be honest I don't get what it is trying to show - other than recycling is a rare event. So I'm back to assuming this is thread-id recycling. This should only happen when there is a lot of concurrent thread/process creation on the machine and has not been observed during any regular testing runs. So what to do? I don't want to relax the test to account for a result that only occurs with stress testing. Should I mark it @intermittent with a comment about stress testing? Thanks.
02-10-2018

> That suggests to me that the problem is not thread-id recycling, > but that there is a delay in the original thread id being marked as invalid. [~dholmes] - Sorry, I don't understand why you no longer think this is related to thread-id recycling. Please take a closer look at my changes to the test. The retry loop that I added to the test includes the point of thread creation. So I'm not retrying with the thread whose thread-id was recycled. I'm retrying with a completely new thread and that new thread did not have it's thread-id recycled after it terminated. + for (int i = 0; i < N_RETRIES; i++) { + Thread t = createTerminatedThread();
01-10-2018

Here are the logs for my jdk-12+13 Linux-X64 sightings: $ unzip -l jdk-12+13_linux.8208159.zip Archive: jdk-12+13_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12734 09-27-2018 22:47 jdk-12+13_1/TestTerminatedThread.jtr.fastdebug 12738 09-27-2018 20:53 jdk-12+13_1/TestTerminatedThread.jtr.slowdebug --------- ------- 25472 2 files
01-10-2018

Thanks Dan! That suggests to me that the problem is not thread-id recycling, but that there is a delay in the original thread id being marked as invalid. But that is inconsistent with the results of the pthread_kill debugging logic which showed the id was invalid before we continued with the rest of the test. This is very odd and I'm not at all sure what to do about it, if anything.
23-09-2018

Here are the logs for my jdk-12+12 Linux-X64 sightings: $ unzip -l jdk-12+12_linux.8208159.zip Archive: jdk-12+12_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 13196 2018-09-21 01:26 jdk-12+12_1/TestTerminatedThread.jtr.release --------- ------- 13196 1 file Update: Here's a snippet of the log that shows the retry and the temporary exception throw: ----------System.out:(36/1064)---------- Working with thread: Thread[Thread-1,5,main], in state: RUNNABLE Calling suspend ... Calling resume ... Calling getStackTrace ... [] Calling setName("NewName") ... Calling interrupt ... Calling stop ... Calling getThreadCpuTime ... Invalid ThreadCpuTime returned = 9918157 expected = -1 Retrying the test case. Working with thread: Thread[Thread-2,5,main], in state: RUNNABLE Calling suspend ... Calling resume ... Calling getStackTrace ... [] Calling setName("NewName") ... Calling interrupt ... Calling stop ... Calling getThreadCpuTime ... Okay: getThreadCpuTime() reported -1 as expected Calling getThreadUserTime ... Okay: getThreadUserTime() reported -1 as expected Calling getThreadInfo ... "NewName" daemon prio=5 Id=26 RUNNABLE (in native) Calling getThreadInfo with stack ... "NewName" daemon prio=5 Id=26 RUNNABLE (in native) Test had a total of 1 failure(s) that were handled by a retry. Native thread is running and attaching as daemon ... Native thread terminating Native thread is running and attaching as daemon ... Native thread terminating ----------System.err:(13/898)---------- java.lang.RuntimeException: Failing the test to check the retry log output. at TestTerminatedThread.main(TestTerminatedThread.java:134) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:835) JavaTest Message: Test threw exception: java.lang.RuntimeException: Failing the test to check the retry log output. JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Failing the test to check the retry log output. ----------rerun:(29/3532)*----------
23-09-2018

I updated the test to do a retry when this specific error is encountered back on 09.10 and I've been using the modified test in my Thread-SMR stress test runs and in my nightly jdk/jdk test runs. I've attached the modified test as diffs.8208159.txt.00. The test includes a temporary exception throw for when the retry code has been used. My jdk-12+12 Thread-SMR stress test run has had an instance of this test failure and only one retry was needed to make the test pass. I'll attach the log file when all three rounds of Thread-SMR stress testing are finished.
21-09-2018

IIRC in the actual pthreads implementation on LInux, thread ids and process ids all come from the same pool. > if the underlying native thread happens to get recycled, then some of those "things" no longer return the expected values. Right. The initial fix and test was just trying to account for the ESRCH error that might arise. But if the id is recycled then we have a different set of failure modes. We'd have to keep a map of live (JavaThread, pthread_t) pairs and when a new thread attaches check that we don't already appear to have a record of that pthread_t. But that is a lot of overhead to try and detect a very unlikely and obscure error. Further, if we do detect it what can we do about it? It's the pre-existing thread that is in error and we can't make it do anything as it no longer has a physical execution context. At best print a warning in the unrelated thread that detects the problem.
09-09-2018

Here are the logs for my jdk-12+10 Linux-X64 sightings: $ unzip -l jdk-12+10_linux.8208159.zip Archive: jdk-12+10_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12557 2018-09-08 02:51 jdk-12+10_2/TestTerminatedThread.jtr.slowdebug --------- ------- 12557 1 file
09-09-2018

Here's the relevant section from the pthreads(7) man page on my Ubuntu 16.04 machine: Thread IDs Each of the threads in a process has a unique thread identifier (stored in the type pthread_t). This identifier is returned to the caller of pthread_create(3), and a thread can obtain its own thread identifier using pthread_self(3). Thread IDs are guaranteed to be unique only within a process. (In all pthreads functions that accept a thread ID as an argument, that ID by definition refers to a thread in the same process as the caller.) The system may reuse a thread ID after a terminated thread has been joined, or a detached thread has terminated. POSIX says: "If an appli‐ cation attempts to use a thread ID whose lifetime has ended, the behav‐ ior is undefined." So I think (but have not proven) that we're in the territory of that last paragraph. The test has run the native code, called pthread_join() and returned to the caller. In the Java part of the test, we still have a valid java.lang.Thread object and we're doing various things with it... if the underlying native thread happens to get recycled, then some of those "things" no longer return the expected values. I'm still trying to figure out a way to prove this situation (to myself at least)... I happened to get a sighting yesterday, but I didn't have all my revised debug code in place. I did an overnight run with three parallel test runs (1 each - release, fastdebug, slowdebug) and didn't get a single hit with the revised debug code... sigh...
07-09-2018

Thanks Dan! Note that the recycled thread id need not be within the same Java process - so no guarantee your Guarantee would fail.
06-09-2018

The debug code that I'm playing with: - Adds a "jint (JNICALL *SkipDetachCurrentThread)(JavaVM *vm)" function so that the test can tell the VM that it is skipping the "normal" call to JNI DetachCurrentThread(). This debug function sets a flag on the JavaThread: void set_skipped_jni_detach() { _skipped_jni_detach = true; } - Adds debug code that uses new JavaThread accessor: bool skipped_jni_detach() { return _skipped_jni_detach; } to detect when a non -1 value is being returned by jmm_GetThreadCpuTime() for a thread that has skipped the JNI DetachCurrentThread() call. If we detect this situation, then we search the current ThreadsList for another JavaThread with the same java_thread->osthread()->thread_id() - If we have found use of a duplicate thread_id() then we fail a guarantee. With the above code in place, I kicked of the usual JavaTest runs done by the Thread-SMR Stress kit. The TestTerminatedThread.java test did not fail in 'release', 'fastdebug' or 'slowdebug' runs. So I launched 5 doit_loop runs each for 'release', 'fastdebug' and 'slowdebug' config and let them run over night. So 15 executions of TestTerminatedThread.java running in parallel... Did not see a single failure: $ grep -v PASS doit_loop_8208159.[1-9]*/doit_loop_8208159.log | sort -t. -n -k 2 doit_loop_8208159.1/doit_loop_8208159.log:Run #100736... doit_loop_8208159.2/doit_loop_8208159.log:Run #100654... doit_loop_8208159.3/doit_loop_8208159.log:Run #100673... doit_loop_8208159.4/doit_loop_8208159.log:Run #101048... doit_loop_8208159.5/doit_loop_8208159.log:Run #100628... doit_loop_8208159.6/doit_loop_8208159.log:Run #58740... doit_loop_8208159.7/doit_loop_8208159.log:Run #58242... doit_loop_8208159.8/doit_loop_8208159.log:Run #58694... doit_loop_8208159.9/doit_loop_8208159.log:Run #58281... doit_loop_8208159.10/doit_loop_8208159.log:Run #58230... doit_loop_8208159.11/doit_loop_8208159.log:Run #26158... doit_loop_8208159.12/doit_loop_8208159.log:Run #26470... doit_loop_8208159.13/doit_loop_8208159.log:Run #26326... doit_loop_8208159.14/doit_loop_8208159.log:Run #26372... doit_loop_8208159.15/doit_loop_8208159.log:Run #26141... which is all very disappointing. Update: I keep forgetting that the M&M layer has a few different ways to get CPU time. I had added debug code to jmm_GetThreadCpuTime(), but jmm_GetThreadCpuTimeWithKind() is the function used by this particular failing test... sigh... updating the debug package...
06-09-2018

I wrote some code on Friday to examine the possibility of a recycled thread id, but I didn't get a chance to run the code until Tuesday due to the jdk-11+28 Thread-SMR stress test run I was doing. I haven't gotten a hit with the new code in place yet so no new info...
05-09-2018

So with 16 parallel test jobs I wonder if the original thread id has been recycled by the time we try to use it? If that is what is happening then there is no way to determine that. To account for this the test would have to allow for unexpected success - which somewhat weakens the test. [~dcubed] Is it worth fixing the test to account for this corner case?
05-09-2018

Here are the logs for my jdk11+28 Linux-X64 sightings: $ unzip -l jdk-11+28_linux.8208159.zip Archive: jdk-11+28_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12607 2018-09-01 01:32 jdk-11+28_1/TestTerminatedThread.jtr.fastdebug --------- ------- 12607 1 file
04-09-2018

A failure showed up in my jdk/jdk11_baseline auto-build-and-test setup: $ unzip -l test_failures.2018-08-30-002627.zip Archive: test_failures.2018-08-30-002627.zip Length Date Time Name --------- ---------- ----- ---- 12364 2018-08-30 00:50 test_failures.2018-08-30-002627/TestTerminatedThread.jtr.fastdebug --------- ------- 12364 1 file So that marks my first JDK11 failure outside of my Thread-SMR testing rig. The test part of my auto-build-and-test setup also runs with 16 parallel tests (like the Thread-SMR testing rig).
31-08-2018

A failure showed up in my jdk/jdk_baseline auto-build-and-test setup: $ unzip -l test_failures.2018-08-22-232846.zip Archive: test_failures.2018-08-22-232846.zip Length Date Time Name --------- ---------- ----- ---- 12288 2018-08-22 23:53 test_failures.2018-08-22-232846/TestTerminatedThread.jtr.fastdebug --------- ------- 12288 1 file So that marks my first JDK12 sighting.
31-08-2018

Many thanks Dan! I'm stumped.
17-08-2018

No sightings in my jdk-11+26 Thread-SMR testing last week. So far there is one sighting in this week's jdk-11+27 Thread-SMR testing. Here are the logs for my jdk11+27 Linux-X64 sightings: $ unzip -l jdk-11+27_linux.8208159.zip Archive: jdk-11+27_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12521 2018-08-17 05:38 jdk-11+27_1/TestTerminatedThread.jtr.release --------- ------- 12521 1 file Here's a snippet of the log file with David's log output: ----------System.out:(12/354)---------- Working with thread: Thread[Thread-1,5,main], in state: RUNNABLE Calling suspend ... Calling resume ... Calling getStackTrace ... [] Calling setName("NewName") ... Calling interrupt ... Calling stop ... Calling getThreadCpuTime ... Native thread is running and attaching as daemon ... Native thread terminating Native thread was gone after 0 iterations ----------System.err:(13/918)---------- java.lang.RuntimeException: Invalid ThreadCpuTime returned = 5442674 expected = -1 at TestTerminatedThread.main(TestTerminatedThread.java:88) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:115) at java.base/java.lang.Thread.run(Thread.java:834) JavaTest Message: Test threw exception: java.lang.RuntimeException: Invalid ThreadCpuTime returned = 5442674 expected = -1 JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Invalid ThreadCpuTime returned = 5442674 expected = -1 ----------rerun:(29/3529)*---------- There is no sign of this message in the log file: warning("FastCPU clocks are not being used!");
17-08-2018

I've added the above patch to my Linux repo for my jdk-11+26 Thread-SMR stress testing runs this weekend.
10-08-2018

The only thing I can think of (that doesn't involve bizarre OS behaviour) is if somehow "fast clocks" are not being used. diff -r 554bb4e2d10d src/hotspot/os/linux/os_linux.cpp --- a/src/hotspot/os/linux/os_linux.cpp +++ b/src/hotspot/os/linux/os_linux.cpp @@ -1298,6 +1298,8 @@ _supports_fast_thread_cpu_time = true; _pthread_getcpuclockid = pthread_getcpuclockid_func; } + if (!_supports_fast_thread_cpu_time) + warning("FastCPU clocks are not being used!"); }
10-08-2018

Please let me know if you have a new experiment to run. jdk-11+26 should be tagged sometime today and I'm planning to kick off another Thread-SMR stress test run sometime today.
09-08-2018

Thanks Dan! This is very interesting - in a bad way :). The thread has terminated; pthread_kill non-zero return confirms that, yet we can still get a CPU clock for the terminated thread! Makes no sense.
07-08-2018

Here are the logs for my jdk11+25 Linux-X64 sightings: $ unzip -l jdk-11+25_linux.8208159.zip Archive: jdk-11+25_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12535 2018-08-03 04:44 jdk-11+25_1/TestTerminatedThread.jtr.release --------- ------- 12535 1 file Here's a snippet of the log file with David's log output: ----------System.out:(12/354)---------- Working with thread: Thread[Thread-1,5,main], in state: RUNNABLE Calling suspend ... Calling resume ... Calling getStackTrace ... [] Calling setName("NewName") ... Calling interrupt ... Calling stop ... Calling getThreadCpuTime ... Native thread is running and attaching as daemon ... Native thread terminating Native thread was gone after 0 iterations ----------System.err:(13/921)---------- java.lang.RuntimeException: Invalid ThreadCpuTime returned = 15962372 expected = -1 at TestTerminatedThread.main(TestTerminatedThread.java:88) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:115) at java.base/java.lang.Thread.run(Thread.java:834) JavaTest Message: Test threw exception: java.lang.RuntimeException: Invalid ThreadCpuTime returned = 15962372 expected = -1 JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Invalid ThreadCpuTime returned = 15962372 expected = -1 ----------rerun:(29/3532)*---------- I think this is the interesting line: > Native thread was gone after 0 iterations
06-08-2018

I've added the change to my Linux Thread-SMR stress repo and will include it in my next run which should be jdk-11+36 on Thursday of this week. Will keep you posted.
31-07-2018

[~dcubed] Dan, can you apply the following patch and reproduce the failure and see what it produces please: diff -r 2ce72467c4e8 test/hotspot/jtreg/runtime/jni/terminatedThread/libterminatedThread.c --- a/test/hotspot/jtreg/runtime/jni/terminatedThread/libterminatedThread.c +++ b/test/hotspot/jtreg/runtime/jni/terminatedThread/libterminatedThread.c @@ -25,6 +25,7 @@ #include <stdlib.h> #include <pthread.h> +#include <signal.h> #include <string.h> #include "jni.h" @@ -92,5 +93,9 @@ exit(1); } + while (pthread_kill(thread, 0) == 0) { + res++; + } + printf("Native thread was gone after %d iterations\n", res); return nativeThread; }
31-07-2018

The failure suggests that the OS still recognizes the pthread_t of the terminated thread and has not yet marked it dead or recycled it. I can not yet reproduce this - it may well be OS dependent. Seems strange to be intermittent though.
31-07-2018

Here are the logs for my jdk11+24 linux sightings: $ unzip -l jdk-11+24_linux.8208159.zip Archive: jdk-11+24_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12570 2018-07-27 01:21 jdk-11+24_1/TestTerminatedThread.jtr.slowdebug --------- ------- 12570 1 file
30-07-2018

Here are the logs for my jdk11+23 linux sightings: $ unzip -l jdk-11+23_linux.8208159.zip Archive: jdk-11+23_linux.8208159.zip Length Date Time Name --------- ---------- ----- ---- 12577 2018-07-21 01:49 jdk-11+23_1/TestTerminatedThread.jtr.slowdebug --------- ------- 12577 1 file
24-07-2018