Bug ID: JDK-8334167 Test java/lang/instrument/NativeMethodPrefixApp.java timed out

Other
tbdUnresolved

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/20154 Date: 2024-07-12 09:22:54 +0000
12-07-2024
> Or is unpark() throwing an Error practically unlikely? OOME is possible, in which case the implementation retries indefinitely. SOE is possible but not really feasible to reliably recover. If an agent instruments code in this area then all manner of breakage is possible, as is executing arbitrary code in the transform method when the CFLH/transform is the context of critical code. We've eliminated the loading of supporting classes as much as possible in this area. Note that the issues under discussion here are not limited to virtual threads, it applies to much of the core platform. JVMTI agents are less prone to some of the issues.
10-07-2024
> Still not quite right. -Duse.JTREG_TEST_THREAD_FACTORY=Virtual is just an internal (Oracle) way to document that a test task is being run with the make option JTREG_TEST_THREAD_FACTORY=Virtual. Think of it as a comment. Noted
09-07-2024
> So it appears that when -Duse.JTREG_TEST_THREAD_FACTORY=Virtual is set, like you note, it ends up being a way to launch "jtreg" command with the "-testThreadFactory=Virtual" option (and another related option) through "make". Still not quite right. -Duse.JTREG_TEST_THREAD_FACTORY=Virtual is just an internal (Oracle) way to document that a test task is being run with the make option JTREG_TEST_THREAD_FACTORY=Virtual. Think of it as a comment. It actually does nothing and the test task would be run the same without it. Using JTREG_TEST_THREAD_FACTORY=Virtual with "make test" triggers the jtreg use of the Virtual.java class, and as you point out, this is where the test.thread.factory property is set to "Virtual". > It's odd that we have 3 different system properties for the same thing. Not 3 properties, but there are 3 mechanism in play here, and they are not all different ways of triggering the use of virtual threads. They are part of a single mechanism to do such. It starts with (1) specifying JTREG_TEST_THREAD_FACTORY=Virtual on the "make test" command line, which triggers (2) running jtreg with -testThreadFactory=Virtual and using the Virtual.java class, and (3) Virtual.java sets the test.thread.factory property to "Virtual". And there is more after this (ProcessTools noting the test.thread.factory property and passing it on to spawned test processes, and also nsk/share/MainWrapper doing something similar for debugee processes). And there is still more. When launching a test's main thread, jtreg uses Virtual.newThread() to create it as a virtual thread. This happens as a result of the "test thread factory" abstraction triggered by running jtreg with -testThreadFactory=Virtual. jtreg uses reflection to load the specified testThreadFactory class and then uses newThread() on that class to create the test's main thread. So running jtreg with -testThreadFactory=Virtual means that Virtual.newThread() is used to create the main test thread.
09-07-2024
Hello Chris, > -Duse.JTREG_TEST_THREAD_FACTORY=Virtual actually does nothing. What actually triggers the use of virtual threads as you describe above is passing JTREG_TEST_THREAD_FACTORY=Virtual on the "make test" command line. > ... > Using JTREG_TEST_THREAD_FACTORY=Virtual will trigger JTREG to setup test.thread.factory=Virtual, which is what some tests look at to determine if they should run in Virtual thread mode. You are right of course. I went back and looked at the ProcessTools library code and that too, like you note, uses/checks for the presence of the "test.thread.factory" system property (and not "use.JTREG_TEST_THREAD_FACTORY") to decide whether to use a virtual thread to launch the main method: ... String testThreadFactoryName = System.getProperty("test.thread.factory"); if (testThreadFactoryName != null) { args.addAll(addTestThreadFactoryArgs(testThreadFactoryName, Arrays.asList(command))); ... So it appears that when -Duse.JTREG_TEST_THREAD_FACTORY=Virtual is set, like you note, it ends up being a way to launch "jtreg" command with the "-testThreadFactory=Virtual" option (and another related option) through "make". Turns out, jtreg framework when it notices the "-testThreadFactory" option, then launches the JVM running these tests with a system property "-Djtreg.test.thread.factory=Virtual" and also passes along the system property "-Duse.JTREG_TEST_THREAD_FACTORY=Virtual". Notice there that jtreg uses its own specific system property "jtreg.test.thread.factory" (yes, yet another variant). So when the test's JVM is launched, it's launched as: java -Djtreg.test.thread.factory=Virtual -Djtreg.test.thread.factory.path=.../jtregTestThreadFactory.jar ... -Duse.JTREG_TEST_THREAD_FACTORY=Virtual ... com.sun.javatest.regtest.agent.MainWrapper .... (the above command snippet is taken from the rerun command of the test we have been investigating) Notice that nowhere in this launch command, the "test.thread.factory" system property is set. Turns out this "test.thread.factory" system property is only set dynamically at runtime within the "Virtual" thread factory class https://github.com/openjdk/jdk/blob/master/test/jtreg_test_thread_factory/src/share/classes/Virtual.java#L28: static { // This property is used by ProcessTools and some tests try { System.setProperty("test.thread.factory", "Virtual"); } catch (Throwable t) { // might be thrown by security manager } } It's odd that we have 3 different system properties for the same thing. Thank you for correcting me there, I wouldn't even have noticed this detail, had you not mentioned it.
09-07-2024
Still haven't read through it all, but I noted you said: > For the rest of this analysis we won't get into the class' bytecode transformation, because it plays no role in this issue. Neither does the use of Instrumentation.setNativeMethodPrefix(...) calls. What's necessary is the presence of ClassFileTransformer(s) with merely a System.out.println(...) call in their transform() method, which then return null (implying no transformation to the class). And then later: > java.lang.ClassCircularityError: jdk/internal/event/Event Is this the same issue described in JDK-8164165 (I see now that David already reference this CR), and also recently brought up in: JDK-8335619: Add an @apiNote to j.l.i.ClassFileTransformer to warn about recursive class loading and ClassCircularityErrors For which there is a PR out.
08-07-2024
I haven't read through all of this yet, but just wanted to make one correction to something you wrote early on: > The use of -Duse.JTREG_TEST_THREAD_FACTORY=Virtual is indeed crucial and in fact is necessary and plays a crucial role in reproducing this issue. This system property is recognized by both jtreg framework as well as the previously noted JDK test library's jdk.test.lib.process.ProcessTools.executeTestJava(...). What this system property instructs the ProcessTools is that when a java application is launched, the java application's main class' main() method must be invoked through a virtual thread (instead of the usual platform thread). This isn't quite how it works. -Duse.JTREG_TEST_THREAD_FACTORY=Virtual actually does nothing. What actually triggers the use of virtual threads as you describe above is passing JTREG_TEST_THREAD_FACTORY=Virtual on the "make test" command line. However, doing this give no indication in the mach5 "JVM args" that we are doing this type of virtual threads testing, so mach5 will add -Duse.JTREG_TEST_THREAD_FACTORY=Virtual to "JVM args" purely for the sake of the user looking at "JVM args". Using JTREG_TEST_THREAD_FACTORY=Virtual will trigger JTREG to setup test.thread.factory=Virtual, which is what some tests look at to determine if they should run in Virtual thread mode.
08-07-2024
I think that the main result here is that j.u.c doesn't expect (and so doesn't correctly deal) with an exception thrown from LockSupport.park. That may require addressing. The point about j.u.c locks loading more classes in a way that could interfere with transformation in some way, or the fact that transformation could lead to System.out.println calling itself recursively are just a fact of life for class transformation: it is sensitive to the implementation details of the code it transforms. If a problem doesn't happen here, it would happen elsewhere. Having said that, because println is commonly used in debugging any kind of code, perhaps it should guard against a recursive invocation with a ScopedValue.
08-07-2024
A couple of things to add. PrintStream (and other java.io classes) using ReentrantLock is temporary. The default setting for jdk.io.useMonitors has been flipped in the loom repo so that the java.io classes are using monitors again. This means less places where instrumentation may be inserted and probably few classes loaded so less chance for a ClassFileTransformer to run arbitrary code at critical times (in this case, calling the transformer is calling System.out when in "critical code"). In main line, parking while pinned records attempts to record a JFR event in parkOnCarrierThread. This is changed in the loom repo so that the JFR event is recorded in the VM. The class VirtualThreadPinnedEvent is removed. This removes the class loading when the first virtual thread is pinned although there is binding of a native method that could cause other code to execute. There are a few other robustness changes in the loom repo but they are more in the OOME scenario which isn't the case here.
08-07-2024
> However, in this case, when it failed to park on the carrier thread, the VirtualThread state after being moved from PARKING to RUNNING, stayed at RUNNING. Whether or not that needs changes is something that might need to be looked into. I don't think this should be changed. If loading VirtualThreadPinnedEvent fails then the non-OOME error will be throw while in the RUNNING state ( it would be very broken to throw in the PARKING state). For main line then we can could preload VirtualThreadPinnedEvent, a temporary thing until the changes in the loom repo get to main line.
08-07-2024
At things stand, AQS does retry on OOME but may need some robustness changes to deal with other linkage errors. Back in JDK 5/6, there were several reports of deadlocks and ClassCircularityError due to agents instrumenting core classes and/or running arbitrary code in the ClassFileTransformer. For virtual threads then it's very risky to instrument the parking code or to execute code in the ClassFileTransformer.transform method as it might consume the parking permit and/or cause recursive parking. As noted, VirtualThreadPinnedEvent has been removed in the loom repo so this at least removes the transform of that class. Ron's suggestion to prevent the PrintStreams for System.out/err from nesting might help a bit but there is still a lot of scope to shoot oneself by instrumenting core classes or having a transform method run arbitrary code when called to transform classes in java.base.
08-07-2024
Hello Ron, > I think that the main result here is that j.u.c doesn't expect (and so doesn't correctly deal) with an exception thrown from LockSupport.park. That's correct. Additionally [~rpressler], we might also have to take a look at the state transitions in VirtualThread class. The code comment in that class has this: RUNNING -> PARKING // Thread parking with LockSupport.park PARKING -> PARKED // cont.yield successful, parked indefinitely PARKING -> PINNED // cont.yield failed, parked indefinitely on carrier PARKED -> UNPARKED // unparked, may be scheduled to continue PINNED -> RUNNING // unparked, continue execution on same carrier UNPARKED -> RUNNING // continue execution after park However, in this case, when it failed to park on the carrier thread, the VirtualThread state after being moved from PARKING to RUNNING, stayed at RUNNING. Whether or not that needs changes is something that might need to be looked into.
08-07-2024
Our goal now is to try and understand what happened to the virtual thread after it was added in the internal queue of AbstractQueuedSynchronizer while waiting to acquire the ReentrantLock. This specific part took me several days to narrow down - my lack of knowledge of the virtual thread implementation, plus the fact that System.out/System.err messages couldn't be used to introduce debug logs, plus the fact that it sometimes takes 1000 odd runs to reproduce this. Also because trying to capture any kind of data at the "right moment" would end up either not reproducing the issue or introducing some other problem (sometimes JVM crash) due to the changes to the core layer as well as ClassFileTransformer(s) being in play. I had to keep aside this investigation for a day just to device a way in my mind to try and capture relevant data at the right moment for this virtual thread. I managed to come up with something (tricky but it did its job) and I managed to capture relevant details at the right moments of that virtual thread. Each virtual thread instance maintains a "state" and it goes through a state transition from being "NEW" to "TERMINATED". My goal was to capture this state transition sequence for our virtual thread, plus any contextual detail of why and where that state transition was happening. I also wanted to track the park() and unpark() of the virtual thread to see if something is going wrong there. When the reproducer hangs, the data that this experiment collected showed the following (interesting) state transition sequence for the virtual thread that we were interested in (the one which had disappeared): started of with STARTED state transitioned to RUNNING state then transitioned to PARKING state then transitioned back to RUNNING state then transitioned to TERMINATED state The first STARTED and RUNNING states are understandable. A virtual thread is transitioned to PARKING state when some code calls LockSupport.park() for that thread. So the state transition to PARKING matches our code flow in AbstractQueuedSynchronizer which first creates a Node instance (for the virtual thread) and adds it to the queue and then calls LockSupport.park() to park the (virtual) thread. What's surprising is that the virtual thread transitioned to RUNNING state after the PARKING state. Looking at the code in java.lang.VirtualThread.park(), the implementation first sets the state to PARKING and then invokes yieldContinuation() to yield the continuation corresponding to this virtual thread. The yieldContinuation() method returns a boolean indicating whether or not the continuation was yielded. A continuation may not yield if the thread is pinned - specifically due to either it's in critical section or has a native frame or is holding a monitor. When yieldContinuation() returns false implying the thread didn't yield, then the implemenation of VirtualThread.park(), first switches the state back to RUNNING and then switches the state to either TIMED_PINNED or PINNED (in the context of this test, the state would be PINNED) and then parks the carrier thread of this virtual thread (and thus suspending the carrier thread). We do notice that state transitioned to RUNNING after PARKING, which implies the continuation didn't yield. But the state never transitioned to PINNED nor was the carrier thread parked (the thread dump doesn't show any carrier thread from the ForkJoinPool to be around). So for whatever reason, the parking of the virtual thread itself failed. This is a surprise. Do also note that the virtual thread later transitioned to TERMINATED from RUNNING. That does match our observation that the virtual thread has disappeared. We will come back to the TERMINATED state a bit later, but for now, what's interesting is that the parking of the virtual thread itself has failed and gone wrong for some reason. After the call to park(), the virtual thread hasn't parked and now has its state as RUNNING. The implications of this is that our previous theory that the virtual thread was successfully parked by the AbstractQueuedSynchronizer while waiting for a lock and might have failed during unpark no longer holds. So we go back to the code in AbstractQueuedSynchronizer's acquire and the crucial detail there is that in its implementation it first creates the Node corresponding to the (about to wait) (virtual) thread and adds it to the queue with status = 1 and waiter set to the (virtual) thread. Only after that it proceeds to call LockSupport.park() on that (virtual) thread to make it wait. So we now have to understand what happens if LockSupport.park() fails and the virtual thread isn't actually parked. Do remember that the AbstractQueuedSynchronizer's queue state that we previously saw was of the form: (next) node N1 = [status = 0, waiter = VirtualThread named "main"], node N2 = [status = 1, waiter = "old-m-a-i-n" platform thread] which means that after the node for the virtual thread was added into the queue, the platform thread during its unlock did reset the status to 0 for this node and in its signalNext() did indeed invoke LockSupport.unpark() on the virtual thread. It's a different matter what it means to call unpark() if the corresponding virtual thread wasn't previously successfully parked. So our next goal is to understand what caused the LockSupport.park() on the virtual thread to not succeed and did that call just return back or did it fail with some exception and where/how did that call exit out from. Also what happened when the platform thread later invoked a LockSupport.unpark() on the (non-parked) virtual thread, when it found that thread in the queue of the AbstractQueuedSynchronizer. I ran a few more additional runs this time with a goal of capturing some stacktrace from VirtualThread.park() and some details from the implementation of VirtualThread.unpark(), in addition to the state capturing that I had previously. The stacktrace capturing turned out trickier than I expected. For virtual threads, some stackframes are ommitted, for example from the output of Thread.dumpStack(). The first few captured stacktrace raised more questions than the answers I was looking for because of these missing frames (and of course lack of prior knowledge of how virtual threads are implemented). After going through the internal code, I figured out that I had to use the "-XX:+ShowCarrierFrames" and "-XX:+ShowHiddenFrames" JVM options to capture all relevant frames that I was interested in. That was a big help and helped understand what's going on. Before going to the stacktrace from VirtualThread.park(), let's first see what VirtualThread.unpark() did when it was invoked on the virtual thread which was supposed to be parked but wasn't (and instead was in RUNNING state due to the yield not succeeding). The implementation in VirtualThread.unpark() is almost a no-op when the virtual thread isn't in a state that it expects for a parked virtual thread to be in. RUNNING state is one such state where VirtualThread.unpark is almost a no-op and almost does nothing and returns back. The data I collected from this method proved that. Note that I keep saying "almost" because VirtualThread.unpark() does do a certain thing even when the state is RUNNING, but that isn't relevant in the context of this issue. The fact that VirtualThread.unpark() is a no-op gets us closer to understanding this hang. What this means is that when the platform thread, during the unlock() on the lock, noticed the virtual thread in the AbstractQueuedSynchronizer's queue of waiters, it first set the status of the Node N1 to 0 on the corresponding node and then signalled the virtual thread through LockSupport.unpark(). LockSupport.unpark() then internally called VirtualThread.unpark() which ended up being a no-op for that virtual thread. That effectively meant that this virtual thread which was expected to be responsible for removing that node N1 from the queue never did anything to remove that node. This matches with what we see in the queue state of AbstractQueuedSynchronizer. It now finally comes down to understanding what happened after AbstractQueuedSynchronizer.acquire() added a node N1 for this virtual thread in the queue and the subsequent call to LockSupport.park() didn't go as planned. The stacktrace that I collected for that call, reveals the issue (ignore the line numbers in some of those classes, they won't match mainline because of some additional code I had to add): java.lang.ClassCircularityError: jdk/internal/event/Event at java.base/java.lang.VirtualThread.parkOnCarrierThread(VirtualThread.java:701) at java.base/java.lang.VirtualThread.park(VirtualThread.java:637) at java.base/java.lang.System$2.parkVirtualThread(System.java:2735) at java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54) at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990) at java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:154) at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:323) at java.base/jdk.internal.misc.InternalLock.lock(InternalLock.java:74) at java.base/java.io.PrintStream.writeln(PrintStream.java:826) at java.base/java.io.PrintStream.println(PrintStream.java:1170) at Main$Agent$NoTransformationTransformer.transform(Main.java:78) at java.instrument/java.lang.instrument.ClassFileTransformer.transform(ClassFileTransformer.java:242) at java.instrument/sun.instrument.TransformerManager.transform(TransformerManager.java:188) at java.instrument/sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:610) at java.base/java.lang.VirtualThread.run(VirtualThread.java:337) at java.base/java.lang.VirtualThread$VThreadContinuation$1.run(VirtualThread.java:223) at java.base/jdk.internal.vm.Continuation.enter0(Continuation.java:324) at java.base/jdk.internal.vm.Continuation.enter(Continuation.java:316) at java.base/jdk.internal.vm.Continuation.enterSpecial(Native Method) at java.base/jdk.internal.vm.Continuation.run(Continuation.java:252) at java.base/java.lang.VirtualThread.runContinuation(VirtualThread.java:259) at java.base/java.lang.VirtualThread$$Lambda/0x000000fe0104c1c0.run(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute(ForkJoinTask.java:1726) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute(ForkJoinTask.java:1717) at java.base/java.util.concurrent.ForkJoinTask$InterruptibleTask.exec(ForkJoinTask.java:1641) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1458) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2034) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:189) Notice that the virtual thread started to run and reached java.lang.VirtualThread.run(VirtualThread.java:337). That line in my modified version of code corresponds to: if (VirtualThreadStartEvent.isTurnedOn()) { which triggered the classload of jdk.internal.event.VirtualThreadStartEvent. Since the reproducer (and the actual test) have been configured with ClassFileTransformer(s) (through the javaagent) it triggered our code in the test/reproducer specific transform() method at Main$Agent$NoTransformationTransformer.transform(Main.java:78) which as previously noted calls System.out.println(...). So that call then leads to the AbstractQueuedSynchronizer.acquire where we first add the Node to the queue and then attempt to park using LockSupport.park(). The stacktrace shows that part: at java.base/java.lang.VirtualThread.park(VirtualThread.java:637) at java.base/java.lang.System$2.parkVirtualThread(System.java:2735) at java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54) at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754) As noted previously, the VirtualThread.park() first attempted to yield the continuation but the continuation didn't yield. Captured data shows that the reason why it couldn't yield and yieldContinuation() returned false was because it considered the virtual thread as pinned due to the call stack having a NATIVE frame. That matches with the captured stacktrace which shows Continuation.enterSpecial(Native Method) as one of the call frame. Since yield didn't succeeed, the VirtualThread.park() implementation then switched the state to RUNNING and attempted to park the carrier thread. These stack frames in that stacktrace match that (ignore the line numbers): at java.base/java.lang.VirtualThread.parkOnCarrierThread(VirtualThread.java:701) at java.base/java.lang.VirtualThread.park(VirtualThread.java:637) Turns out when trying to park on carrier thread, the virtual thread runs into java.lang.ClassCircularityError for jdk/internal/event/Event class: java.lang.ClassCircularityError: jdk/internal/event/Event at java.base/java.lang.VirtualThread.parkOnCarrierThread(VirtualThread.java:701) at java.base/java.lang.VirtualThread.park(VirtualThread.java:637) Line 701 of VirtualThread.parkOnCarrierThread (in my modified version) corresponds to: event = new VirtualThreadPinnedEvent(); so while trying to park on the carrier thread after being unable to yield the continuation, the virtual thread triggered the classload of jdk.internal.event.VirtualThreadPinnedEvent. Remember that this classload of VirtualThreadPinnedEvent happened in the same call sequence as the classload and transform of VirtualThreadStartEvent, from within the ClassFileTransformer's transform() method. Both these classes extend the jdk.internal.event.Event class against which the ClassCircularityError is being reported. This effectively meant that the VirtualThread.park() and thus LockSupport.park() failed exceptionally when it was called from AbstractQueuedSynchronizer.acquire() after it had internally added this as a waiting Node to its queue. The AbstractQueuedSynchronizer.acquire() has no catch blocks around this parking code and that exception effectively propagated all the way out of the call to System.out.println(...) that was invoked from our ClassFileTransformer's transform() method. But where did that exception disappear, because we neither see it in the logs nor does the application fail or exit. We will come back to this later. The important detail now is that, the Reentrant lock used by System.out has become corrupt because the queue state maintained in AbstractQueuedSynchronizer is invalid, since after adding the waiting Node to the queue, the park() of the thread failed exceptionally and the code flow exited the acquire() method. That effectively means that the removal of the node that was expected to happen when the thread was woken up within acquire() method will never run. That explains the captured state of the queue and the hang. So that brings us to the conclusion of what's causing these hangs - a platform thread and a virtual thread compete for a Reentrant lock concurrently. Platform thread acquires the lock. A Node representing the virtual thread gets added into a waiting queue maintained by the ReentrantLock, through AbstractQueuedSynchronizer. After adding to the queue the LockSupport.park() is called on the virtual thread which then results in the virtual thread yield not succeeding because of a native frame on the call stack. That then results in an attempt to park the carrier thread of the virtual thread which fails exceptionally. The exceptionally failing call to LockSupport.park() causes corruption of the queue in AbstractQueuedSynchronizer. The platform thread which had won the race to acquire the ReentrantLock, subsequently calls unlock() and that changes the queue state of the virtual thread's Node and the platform thread then attempts to unpark the virtual thread, but that ends up being a no-op because the virtual thread was never parked successfully in first place. The platform thread goes about its business and a subsequent call to System.out.println(...) leads it back to lock acquistion, but due to the queue corruption in AbstractQueuedSynchronizer, it never acquires the lock and hangs forever. So what happened to the virtual thread which switched to RUNNING state and later to TERMINATED. And also what happened to the java.lang.ClassCircularityError: jdk/internal/event/Event exception that was thrown. These test timeouts never reported that exception. To answer the latter question about the exception that isn't reported, let's focus on these call frames in that above stacktrace: at java.base/java.io.PrintStream.println(PrintStream.java:1170) at Main$Agent$NoTransformationTransformer.transform(Main.java:78) at java.instrument/java.lang.instrument.ClassFileTransformer.transform(ClassFileTransformer.java:242) at java.instrument/sun.instrument.TransformerManager.transform(TransformerManager.java:188) at java.instrument/sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:610) at java.base/java.lang.VirtualThread.run(VirtualThread.java:337) So the classload of VirtualThreadStartEvent has trigged our transformer code which issues a System.out.println. Our transformer code is called from sun.instrument.TransformerManager.transform(TransformerManager.java:188). Very interestingly that line in sun.instrument.TransformerManager.transform() does this: try { transformedBytes = transformer.transform( module, loader, classname, classBeingRedefined, protectionDomain, bufferToUse); } catch (Throwable t) { // don't let any one transformer mess it up for the others. // This is where we need to put some logging. What should go here? FIXME } So that code in sun.instrument.TransformerManager calls the reproducer/test's ClassFileTransformer.transform() and swallows any error/exception thrown from that call. In this case it ate up our java.lang.ClassCircularityError and continued as if no exception was thrown. It even has a FIXME in that code (that code comment is from the actual code). This exception being eaten up then meant that the code in java.lang.VirtualThread.run(VirtualThread.java:337) which triggered the classload of VirtualThreadStartEvent returned from the classload and continued with the next lines of code. At this point, remember the state of the virtual thread is RUNNING (but that's just due to accidental and corrupt reasons). The next lines of code in java.lang.VirtualThread.run() implementation involves invoking the actual task that is assigned to the virtual thread, which in this cases is running the test application's main() method. It does that correctly, the main() method gets invoked on the virtual thread, the main() method then completes normally - remember the main method of the test and the reproducer merely uses some arbitrary code and check if any exception was thrown. Since the java.lang.ClassCircularityError that was thrown got eaten up in sun.instrument.TransformerManager, the main() method never sees it and thus completes normally. That then transitions the virtual thread's state from RUNNING to TERMINATED. That matches the captured state transitions of this virtual thread. That also means the virtual thread exits normally and thus disappears (even its carrier thread after a while) from the thread dumps, thus explaining their absence. We now understand what's causing these timeouts. The next question is how do we fix the test case. Fixing of the test I think is relatively easy, but I wouldn't want to get to that yet. What this test has exposed is several areas that might need inputs from people familiar in that area, to understand if anything needs to be fixed/changed/addressed in those parts as well. Here are a few questions I have: - java.util.concurrent.locks.LockSupport.park(), is it allowed/expected to throw any kind of exception/errors? The API specification doesn't note any, neither for platform threads nor for virtual threads? - java.util.concurrent.locks.LockSupport.unpark(), is it allowed/expected to throw any exception/error. In the context of this issue, it isn't throwing any exception, but since we are dealing with this area, it would be good to understand the semantics. - Should AbstractQueuedSynchronizer do something to prevent this situation which leads to queue corruption or at least identify this queue corruption if it happens? I don't think it can do anything here given what's happened, but it would be good to hear from Viktor [~vklang], Alan [~alanb] and others. Maybe the clarifications on the semantics of LockSupport.park()/unpark() might help decide if anything should be done. - There are some places in VirtualThread.run(), parkOnCarrierThread() and such which have a defensive mechanism to avoid JFR events from interfering with the method's actual implementation. Mechanism's like catching OutOfMemoryError when constructing the JFR event. Should it be broadened to address issues like this one where a ClassCircularityError (of course, not specifically adding a catch clause for this error) when loading a JFR event class lead to the VirtualThread's state transition to go wrong? Note that the code comments of state transition in VirtualThread seem to indicate that a transition from PARKING to RUNNING (and eventually staying on RUNNING instead of moving to PINNED) isn't expected to happen, yet it did happen here. - Should any error/exception be allowed to propagate from the call to parkOnCarrierThread(), without the VirtualThread doing any cleanup or any terminal action? - Should we address the FIXME in sun.instrument.TransformerManager.transform where the exception is eaten up?
08-07-2024
What we have to narrow down now is why the Reentrant lock acquisition hangs forever. To do that, we'll have to understand the internal implementation of the ReentrantLock. I am including those details here because that internal implementation too is playing a role in what's going on here. I would like to note that I don't have deep knowledge of the code in java.util.concurrent.locks.ReentrantLock and its related classes. It's only during the investigation of this issue, that I went through that code and what I note next is based on that brief understanding of the code. Others familiar with this code can correct me if I misstate any of the following details. ReentrantLock provides a lock() and unlock() API to acquire and release the lock. The implementation heavily relies on the implementation in java.util.concurrent.locks.AbstractQueuedSynchronizer class. Lock acquistion, through the lock() call ends up in the acquire() method of this AbstractQueuedSynchronizer class. The AbstractQueuedSynchronizer internally maintains a ordered queue with a head and a tail and elements in between. Each element in that ordered queue is represented as a "Node" and each Node keeps track of the "waiter" which corresponds to a java.lang.Thread instance that is waiting to acquire the lock. Each node additionally also has a "status" field which when set to "1" implies that the waiter Thread tracked by that node has to be "signalled" when the lock has been released by some other thread which was previously owning it. The AbstractQueuedSynchronizer first checks if the lock has been handed out to some other caller. If not, the lock is immediately granted to the caller. If the lock is already being held by some other caller, then the acquire() method first creates a Node instance and adds it to the queue. This Node instance will have its status set to "1" and "waiter" set to the current thread which called the lock() method (and wasn't granted the lock). After the Node has been added to the queue, the caller thread is then made to wait, for the lock to be available, by suspending it through a call to java.util.concurrent.locks.LockSupport.park(). In the context of this issue, it's a non-timed park() which means it will park() the thread until there's a unpark() called for that thread (either before park() was called or after it). So if there's no corresponding LockSupport.unpark() for this specific thread then it will be parked forever. When a thread which was granted the lock, calls unlock(), like in the finally block of the System.out.println(...)'s implementation, the implementation in AbstractQueuedSynchronizer internally updates it state and also checks the queue to see if there is a next Node which has status=1 i.e. if there's a waiting caller who needs to be "signalled". This it does in the internal "signalNext()" implementation, which checks for the presence of the next Node and if "status" is set to "1" then it first resets the status to "0" and then calls LockSupport.unpark() on the "waiter" Thread that was tracked by that next Node. It resets the status to "0" to imply that the waiter Thread has been signalled to wakeup and shouldn't be signalled anymore. It's important to note that, in this entire flow of unlock(), the next Node isn't removed from the queue - it's merely signalled through the waiter Thread. Upon signalling any next Node through LockSupport.unpark(), the unlock() method returns to the caller and due to the signalling, the other thread will be unparked and it will start concurrently executing code from where it was suspended. Remember that the waiter Thread was suspended in acquire() method of the AbstractQueuedSynchronizer (while waiting for the lock). The implementation of AbstractQueuedSynchronizer is such that it is the responsibility of the unparked thread to remove the Node that corresponds to itself from the queue. The queue state is critical and if the state in the queue goes wrong then it can lead to hangs like the one we are investigating. It took several efforts, some not straightforward ones, to be able to capture the state of the queue at the right moments. Usually it would be trivial to add System.out/System.err.println(...) at interesting places within the JDK code to get these relevant details, but we can't use that here because we are running into an issue with System.out/System.err.println(...) related code and sprinkling them here would just tamper the reproducer. Keeping aside how the queue state was captured, it became clear from the captured data that when this hang happens, the queue will look something like this (in that order): (next) node N1 = [status = 0, waiter = VirtualThread named "main"], node N2 = [status = 1, waiter = "old-m-a-i-n" platform thread] Additionally, the ReentrantLock maintains a "owner" which corresponds to the Thread that has been currently granted the lock. The state of that "owner" when this hang happens showed that "owner" was null, implying no one was owning the lock when the platform thread was hung waiting for the lock. Focusing on the queue state above, what it shows is that the virtual thread is the first one in the queue waiting for the lock acquisition and the platform thread is waiting after the virtual thread for the same lock. The crucial detail here is that the "status" field of the node N1 which corresponds to the virtual thread which is supposed to be granted the lock next, is set to 0. That's a bad sign and that implies that the virtual thread isn't going to be signalled and thus won't be unparked when the lock becomes available. Looking at the code in AbstractQueuedSynchronizer, the only time we can have a Node with status = 0 in the queue is when (as noted previously), during a unlock() call in some other thread, the status of N1 was reset to 0 and then LockSupport.unpark() was called on the virtual thread. That would then have (or should have) woken up the VirtualThread which should then have removed this N1 Node from the queue. But the queue state shows that N1 is still around in the queue and from what we have seen in the thread dump, there's no virtual thread in the thread dump. So it must mean that the virtual thread was probably unparked but for whatever reason it didn't remove its corresponding node N1 from the queue. Looking at the code, that shouldn't have happened. The other important detail of that captured queue state is that, in that ordered queue, after the virtual thread, the next thread waiting in line for the lock is the platform thread. This is understandable because as we saw in the thread dump, the platform thread is indeed parked waiting for the lock. Furthermore, remember that there are only 2 threads in this test which are concurrently accessing the internal ReentrantLock of the System.out instance. Given the above queue state, that would then mean that the lock was previously successfully acquired by the platform thread and the virtual thread was made to wait (node N1). After that, the platform thread also would have called unlock() which would have made the status = 0 on the virtual thread's node N1 and invoked LockSupport.unpark() on the virtual thread. And at a later moment the platform thread again called System.out and thus requested to acquire a lock and this time was made to wait and added to the queue as N2, since for whatever reason the unparked() virtual thread didn't remove its node N1 from the queue. So that essentially translates to this sequence of events: - platform thread calls System.out.println(...) from the test specific ClassFileTransformer's transform() method. Successfully gets the internal lock and is then in the process of writing out the message. - concurrently the virtual thread too calls System.out.println(...) from the ClassFileTransformer's transform() method and is put into the wait queue to acquire the lock. - platfrom thread completes writing the message and the System.out.println(...) call internally releases the lock, which then sets the status = 0 on the virtual thread's N1 node in the queue and unparks() the virtual thread - virtual thread should then have removed its node from the queue and acquired the lock and printed the message and released the lock. for whatever reason, it appears that the virtual thread didn't remove its Node from the queue and maybe didn't even print the message and just vanished. - platform thread concurrently calls another System.out.println(...) from the test specific ClassFileTransformer's transform() method and this time is put into the queue as node N2, after node N1. The platform thread will continue to remain as node N2 in the queue waiting for the lock forever, because the queue state has gone corrupt since the (virtual) thread which was supposed to remove node N1 upon being signalled is no where around. That theory is based on the queue state and the "owner" state and matches the thread dump. So given these details, the next focus will be on understanding what happened to the virtual thread and why it didn't remove the node from the queue and did something go wrong during its unpark. I'll add those details as a separate comment next.
08-07-2024
This past week I've gone through all these linked failures and then come up with a trivial reproducer which reproduces this issue locally for me on a macos M1. It takes several runs (sometimes 1000 odd) to run into the same hang that is seen these test timeouts, but it does reproduce. I have now narrowed down the issue and understand what's going on. There are multiple things at play in this hang (which eventually causes the timeout): - Virtual threads internal implementation - Internal implementation in System.out/System.err - java.util.concurrent.locks.ReentrantLock internal implementation - Implementation and semantics of java.util.concurrent.locks.LockSupport - Usage and implementation of java.lang.instrument.Instrumentation A combination of a sequence of events in the code path of all these above involved constructs leads to the hang. The root cause (which I'll explain as I go along) is now understandable but several questions arise on the expectations of some of these APIs and constructs. Before explaining the cause of these hangs, I'll provide a brief overview of what the test does and where it currently hangs. The test/jdk/java/lang/instrument/NativeMethodPrefixApp.java is a test that has been around for a long time (more than a decade). Its goal is to verify the semantics of java.lang.instrument.Instrumentation.setNativeMethodPrefix(ClassFileTransformer transformer, String prefix) public API. This API was introduced in Java 1.6 more than a decade back as part of https://bugs.openjdk.org/browse/JDK-6263317. At a high level the Instrumentation.setNativeMethodPrefix() method can be used by application code to attach a prefix to "native" method names of a class. For example, if the class "Foo" has a native method "bar" then the application code (through an javaagent) can configure call Instrumentation.setNativeMethodPrefix() to set a prefix "blah_" to native methods. The application code through a java.lang.instrument.ClassFileTransformer may then add an implementation for a "blah_bar" method on "Foo" class and that method's implementation can do any application specific logic before calling the native "bar" method. Essentially this allows applications to instrument (native) methods which don't have bytecode associated with it. JDK-6263317 and the linked CCC to that issue has very good explanation of what this feature is all about and how it is expected to work, so I won't go into that part. Coming back to the NativeMethodPrefixApp test itself. It has an javaagent which sets up 3 ClassFileTransformer(s) each of which is configured with a unique native method prefix through Instrumentation.setNativeMethodPrefix(...). The ClassFileTransformer implementation in that test then uses ClassFile API to generate the necessary bytecode for native methods of every class that is transformed. Until recently, the ClassFileTransformer implementation in that test used to use ASM to generate the bytecode. Additionally, for the sake of test execution debuggability, the implementation in the test specific ClassFileTransformer invokes "System.out.println(...)", each time its transform() method gets called (remember this part, we will come back to this later). The test then, in its main() method, has some code which loads arbitrary classes to exercise the ClassFileTransformer and then finally verifies that there are no errors in this code execution. So the test itself is pretty straightforward. For the rest of this analysis we won't get into the class' bytecode transformation, because it plays no role in this issue. Neither does the use of Instrumentation.setNativeMethodPrefix(...) calls. What's necessary is the presence of ClassFileTransformer(s) with merely a System.out.println(...) call in their transform() method, which then return null (implying no transformation to the class). The local reproducer I have just registers these ClassFileTransformer(s), 3 of them (just to match the test case) and does something like this in the transform() method: private static class NoTransformationTransformer implements ClassFileTransformer { @Override public byte[] transform(final ClassLoader loader, final String className, final Class<?> classBeingRedefined, final ProtectionDomain protectionDomain, final byte[] classfileBuffer) { System.out.println(this + " classname " + className); return null; // we don't do any transformations } } public static void premain(final String agentArgs, final Instrumentation instrumentation) throws Exception { ... // register class transformers which don't do any transformation instrumentation.addTransformer(new NoTransformationTransformer(), true); instrumentation.addTransformer(new NoTransformationTransformer(), false); instrumentation.addTransformer(new NoTransformationTransformer(), true); ... } As I noted, this test has been around for more than a decade and has been stable. Very recently we did a relatively trivial and unrelated refactoring in this test. Previously, until that refactoring, we used to have a "@run shell" test action which used to generate the jar file with the main class and the javaagent class. That jar was then used when launching this test through a separate "@run main/othervm" action. In the refactoring, we removed the "@run shell" and instead introduced a "@run driver" which generates the same jar with the main class and javaagent and then uses the JDK test library's jdk.test.lib.process.ProcessTools.executeTestJava(...) method to launch the test. The intermittent test timeouts reported in this issue have started happening since that refactoring. For the rest of this discussion, we won't go into the refactoring part itself, since based on what I've narrowed it down to, it plays no role in this issue. All these test timeouts noticed in our CI jobs have so far been when the test was launched with "-Duse.JTREG_TEST_THREAD_FACTORY=Virtual -XX:+UseZGC" (and any additional arguments). My investigation locally has confirmed that -XX:+UseZGC plays no role in this and I can reproduce this (locally) without -XX:+UseZGC too. So we will leave ZGC out of this discussion. So that leaves us with -Duse.JTREG_TEST_THREAD_FACTORY=Virtual. The use of -Duse.JTREG_TEST_THREAD_FACTORY=Virtual is indeed crucial and in fact is necessary and plays a crucial role in reproducing this issue. This system property is recognized by both jtreg framework as well as the previously noted JDK test library's jdk.test.lib.process.ProcessTools.executeTestJava(...). What this system property instructs the ProcessTools is that when a java application is launched, the java application's main class' main() method must be invoked through a virtual thread (instead of the usual platform thread). jtreg in its internal implementation of "@run main" knows how to do it (and does it internally in its com.sun.javatest.regtest.agent.MainWrapper class). We won't go into jtreg part in this discussion, since we aren't using that here (remember we use "@run driver" in this test) and instead we use the ProcessTools.executeTestJava() method to launch the java application as a new Process. The implementation of ProcessTools.executeTestJava() internally launches a new "java" process whose main class is "jdk.test.lib.process.ProcessTools" (and has a main() method). ProcessTools.main() merely acts a tiny wrapper around the actual application's main class' main() method. When this wrapper ProcessTools.main() notices the presence of -Duse.JTREG_TEST_THREAD_FACTORY=Virtual, in order to launch the target application's main() method it first creates a virtual thread and uses a Runnable which through reflection calls the actual application's main() method. The ProcessTools.main() then start()s the virtual thread and invokes (virtual)Thread.join() to wait for the completion of the application's main method. In the context of this test, the application's main method is the NativeMethodPrefixApp main method (which like previously noted has some trivial arbitrary code to exercise the ClassFileTransformer(s) that have been registered through the Instrumentation class for that java process). So ProcessTools.main() essentially does this (only the relevant snippet): if (testThreadFactoryName.equals("Virtual")) { ... Thread vthread = Thread.ofVirtual().unstarted(() -> { try { mainMethod.invoke(null, new Object[] { classArgs }); } catch (InvocationTargetException e) { ... } catch (Throwable error) { ... } }); Thread.currentThread().setName(OLD_MAIN_THREAD_NAME); // OLD_MAIN_THREAD_NAME = "old-m-a-i-n" vthread.setName("main"); vthread.start(); vthread.join(); So when the test is executing in this java process, there are 2 threads of interest here - one the platform thread which has now been named "old-m-a-i-n" and the other a virtual thread created and started by this platform thread and named "main". The platform thread waits on the virtual thread to complete. The virtual thread executes the "main()" method of the test whereas the platform thread merely waits for that virtual thread to complete running the test's main() method. However, since the java process in which both these threads are running, was launched with a "-javaagent:...", execution of any code that leads to classloading in any of these 2 threads will execute the code in the 3 test specific ClassFileTransformer instances that we previously noted. As already noted, the code in the ClassFileTransformer (among other things) has a "System.out.println(...)" call. Effectively, this means that although the "old-m-a-i-n" platform thread won't be doing anything more than waiting on the virtual thread, any classloading within those calls like "vthread.start()" and "vthread.join()" can make its way into the test specific ClassFileTransformer's "System.out.println(...)" calls. For the virtual thread which is running the test's main() method, it's much more natural that any code in the test's main() method will make its way into the test specific ClassFileTransformer's "System.out.println(...)" calls (in fact, that's the whole point of the test - to make sure those transformers are exercised). What this brings us to is the fact that a platform thread and a virtual thread can concurrently end up in (the same or different instance of) the test specific ClassFileTransformer's transform() method which has calls to "System.out.println(...)". It's important to remember this part since this is one factor (of the 5 that we noted at the start) that contributes to this issue. Coming to the System.out.printXXX(...) implementation (and its corresponding System.err.printXXX()) - it internally uses an instance of java.util.concurrent.locks.ReentrantLock (System.out and System.err have a separate instance of ReentrantLock each) to serialize concurrent calls to printXXX methods. System.out/err can be configured (explicitly) to use monitors instead of ReentrantLock for the serializing the calls, but that's not relevant in the current discussion. Essentially the implementation in System.out.println(...) boils down to this (snippet): lock.lock(); // "lock" ultimately ends up being a java.util.concurrent.locks.ReentrantLock try { ... // write out the content } finally { lock.unlock(); } So every call to System.out.println(...) will end up first acquiring a lock, write the content and finally releasing the lock. Pretty basic. In the context of this issue, that translates to the "old-m-a-i-n" platform thread and the virtual thread concurrently trying to acquire a lock, print a message upon successfully acquiring the lock and then release the lock. Let's now take a look at a thread dump for one such timed out test: { "tid": "1", "name": "old-m-a-i-n", "stack": [ "java.base\/jdk.internal.misc.Unsafe.park(Native Method)", "java.base\/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)", "java.base\/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:154)", "java.base\/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:323)", "java.base\/jdk.internal.misc.InternalLock.lock(InternalLock.java:74)", "java.base\/java.io.PrintStream.writeln(PrintStream.java:826)", "java.base\/java.io.PrintStream.println(PrintStream.java:1170)", "NativeMethodPrefixAgent$Tr.transform(NativeMethodPrefixAgent.java:65)", "java.instrument\/java.lang.instrument.ClassFileTransformer.transform(ClassFileTransformer.java:242)", "java.instrument\/sun.instrument.TransformerManager.transform(TransformerManager.java:188)", "java.instrument\/sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:610)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:734)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)", "java.base\/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)", "java.base\/java.lang.VirtualThread.joinNanos(VirtualThread.java:848)", "java.base\/java.lang.Thread.join(Thread.java:2000)", "java.base\/java.lang.Thread.join(Thread.java:2093)", "jdk.test.lib.process.ProcessTools.main(ProcessTools.java:1009)" ] }, { "tid": "23", "name": "Reference Handler", "stack": [ "java.base\/java.lang.ref.Reference.waitForReferencePendingList(Native Method)", "java.base\/java.lang.ref.Reference.processPendingReferences(Reference.java:246)", "java.base\/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:208)" ] }, { "tid": "24", "name": "Finalizer", "stack": [ "java.base\/java.lang.Object.wait0(Native Method)", "java.base\/java.lang.Object.wait(Object.java:374)", "java.base\/java.lang.Object.wait(Object.java:348)", "java.base\/java.lang.ref.NativeReferenceQueue.await(NativeReferenceQueue.java:48)", "java.base\/java.lang.ref.ReferenceQueue.remove0(ReferenceQueue.java:166)", "java.base\/java.lang.ref.NativeReferenceQueue.remove(NativeReferenceQueue.java:89)", "java.base\/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:173)" ] }, { "tid": "25", "name": "Signal Dispatcher", "stack": [ ] }, { "tid": "32", "name": "Common-Cleaner", "stack": [ "java.base\/jdk.internal.misc.Unsafe.park(Native Method)", "java.base\/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1852)", "java.base\/java.lang.ref.ReferenceQueue.await(ReferenceQueue.java:79)", "java.base\/java.lang.ref.ReferenceQueue.remove0(ReferenceQueue.java:151)", "java.base\/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:229)", "java.base\/jdk.internal.ref.CleanerImpl.run(CleanerImpl.java:140)", "java.base\/java.lang.Thread.run(Thread.java:1575)", "java.base\/jdk.internal.misc.InnocuousThread.run(InnocuousThread.java:186)" ] }, { "tid": "33", "name": "Notification Thread", "stack": [ ] }, { "tid": "39", "name": "Attach Listener", "stack": [ "java.base\/java.lang.Thread.getStackTrace(Thread.java:2389)", "java.base\/jdk.internal.vm.ThreadDumper.dumpThreadToJson(ThreadDumper.java:270)", "java.base\/jdk.internal.vm.ThreadDumper.dumpThreadsToJson(ThreadDumper.java:242)", "java.base\/jdk.internal.vm.ThreadDumper.dumpThreadsToJson(ThreadDumper.java:206)", "java.base\/jdk.internal.vm.ThreadDumper.dumpThreadsToFile(ThreadDumper.java:117)", "java.base\/jdk.internal.vm.ThreadDumper.dumpThreadsToJson(ThreadDumper.java:85)" ] } ], "threadCount": "7" The important detail in that thread dump is: "name": "old-m-a-i-n", "stack": [ "java.base\/jdk.internal.misc.Unsafe.park(Native Method)", "java.base\/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)", "java.base\/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:154)", "java.base\/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:323)", "java.base\/jdk.internal.misc.InternalLock.lock(InternalLock.java:74)", "java.base\/java.io.PrintStream.writeln(PrintStream.java:826)", "java.base\/java.io.PrintStream.println(PrintStream.java:1170)", "NativeMethodPrefixAgent$Tr.transform(NativeMethodPrefixAgent.java:65)", "java.instrument\/java.lang.instrument.ClassFileTransformer.transform(ClassFileTransformer.java:242)", "java.instrument\/sun.instrument.TransformerManager.transform(TransformerManager.java:188)", "java.instrument\/sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:610)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:734)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)", "java.base\/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)", "java.base\/java.lang.VirtualThread.joinNanos(VirtualThread.java:848)", "java.base\/java.lang.Thread.join(Thread.java:2000)", "java.base\/java.lang.Thread.join(Thread.java:2093)", "jdk.test.lib.process.ProcessTools.main(ProcessTools.java:1009)" and the other important detail is that there's no sign of the virtual thread (named "main") that was supposed to be executing the test's main method and is the thread on which this platform thread is waiting on through a call to (virtual)Thread.join() (last line in that stacktrace above -> jdk.test.lib.process.ProcessTools.main(ProcessTools.java:1009). For each of the test timeouts reported in this JBS issue, you will find this same pattern in the thread dumps. In this investigation I've used the output of "jcmd Thread.dump_to_file" command to analyze the thread dumps, so the absence of the virtual thread in the thread dump isn't due to using a tool which isn't capable of listing virtual threads. So what this thread stack of the "old-m-a-i-n" platform thread tells us is that, when ProcessTools.main invoked the (virtual)Thread.join(), it internally (through the CountDownLatch.await and then the AbstractQueuedSynchronizer.acquire calls) lead to a classload (of some class), which triggered the test specific ClassFileTransformer's transform() method which then triggered a call to System.out and that then lead to a call to ReentrantLock.lock(). You will notice that the call to ReentrantLock.lock() has parked the "old-m-a-i-n" platform thread, which implies that some other thread has acquired that ReentrantLock and hasn't released it and the ReentrantLock is waiting for it to be released to allow for this platform thread to be unparked and move forward with its execution. From what we are seeing in these test timeouts, that never happens - the other thread doesn't appear to have released the lock. Typically, you would consider this to be a programming error where some code didn't call an unlock() on the lock that it acquired. In the context of this issue though, you will notice that the lock belongs to the System.out (a PrintStream instance) and is an internal detail of that class and the lock is never handed out in anyway for it to be abused. Plus we have already noticed that the unlock() is correctly called in a finally block. Plus this test isn't susceptible to StackOverFlow or OutOfMemoryError issues (I verified that). So this situation where it appears that some other thread which invoked System.out.printXXX method and never released a lock, must never have happened. Furthermore, in this entire java process, the only 2 threads that call System.out.printXXX are the "old-m-a-i-n" platform thread and the virtual thread (named "main"). So it's either the "old-m-a-i-n" platform thread which would have acquired this lock or the virtual thread. Since the thread dump shows the platform thread is waiting for the acquisition, that must then mean it's the virtual thread which has been granted that lock and appears to have not released it. Unless of course there's a bug in re-entrancy implementation of the ReentrantLock - there isn't, so we won't get into the reentrancy part. In this thread dump (and all other thread dumps of these test timeouts) only "old-m-a-i-n" platform thread is present and the the virtual thread is nowhere to be seen. So we will focus on trying to figure out what happened to the virtual thread and why it never released that lock. This already has become a long comment and I haven't yet explained some other details which contribute to this issue, so I'll stop this comment here and add additional details of the issue and the conclusion as separate comments. At this point what we have established is what causes the test timeout - the platform thread waits in the System.out.println(...) call for acquisition of a ReentrantLock which is never granted. What causes this situation is what we will see in the next comments.
08-07-2024
I have been able to reproduce this locally and I see what's going on. I am investigating more to understand the root cause of what's causing this. I'll add some details soon. Initial analysis appears to suggest that this could be a genuine bug and not just a test issue.
03-07-2024
This was the test we had recently refactored. I'll take a look.
13-06-2024