JDK-8301341 : LinkedTransferQueue does not respect timeout for poll()
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.concurrent
  • Affected Version: 17
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2023-01-27
  • Updated: 2024-05-17
  • Resolved: 2023-07-22
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 21 JDK 22
21.0.2Fixed 22 b08Fixed
Related Reports
Relates :  
Relates :  
Description
ADDITIONAL SYSTEM INFORMATION :
Linux 5.15.0-58-generic #64-Ubuntu SMP x86_64 GNU/Linux
Ubuntu 22.04.1 LTS
openjdk 17.0.5 2022-10-18
OpenJDK Runtime Environment (build 17.0.5+8-Ubuntu-2ubuntu122.04)
OpenJDK 64-Bit Server VM (build 17.0.5+8-Ubuntu-2ubuntu122.04, mixed mode, sharing)

A DESCRIPTION OF THE PROBLEM :
There's code that used a LinkedTransferQueue that is not returning from `poll` under certain circumstance. The code in question is https://github.com/openhab/openhab-core/blob/main/bundles/org.openhab.core/src/main/java/org/openhab/core/common/QueueingThreadPoolExecutor.java#L193. I modified it locally to include its own measurements and logging:

```
                            long start = System.nanoTime();
                            final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
                            long end = System.nanoTime();
                            long duration = end - start;
                            if (duration > 3000000000L) {
                                logger.warn("Polling the task queue in thread pool '{}' took {}ms",
                                        threadPoolName, duration / 1000000L);
                            }
```

When it hangs, CPU use on that thread goes to 100%, and the stack trace is:
```
"safeCall-queue" Id=8239 in RUNNABLE
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
    at org.openhab.core.common.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:193)
    at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)
```

This stack trace is consistent, no matter how many times I sample it, I've not caught it executing on a different line.

At this point, I used another thread to inspect the queue, and confirmed that from its point of view the queue is empty. I was unable to inspect the internal data structures of the queue via reflection, because `head` and `tail` are `volatile transient`. I _am_ able to force it to break out by forcing an item to be pushed onto the queue (line 118 in that file). At which point, my logs indicated the following:

```
2023-01-27 08:05:17.338 [WARN ] [re.common.QueueingThreadPoolExecutor] - Polling the task queue in thread pool 'safeCall' took 1377089ms
```

Definitely longer than 2s! I'm not sure if the `awaitMatch` method is messing up its time-keeping, or if it's getting confused by some rare intermediate state. It's not completely deadlocked, because pushing a new element allows it to break out. And it's definitely spinning chewing up CPU usage, and not parking itself.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run openHAB 3.4.1 on my hardware, and I'm sure a smattering of my particular configuration. I can reproduce within a few hours of restarting the process, no less often than once per day.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
`poll` returns after 2 seconds, even if the queue is empty.
ACTUAL -
CPU usage gets pegged at 100%.

CUSTOMER SUBMITTED WORKAROUND :
I'm able to notice the situation, and trigger an item to be pushed onto the pertinent queue, dislodging it.

FREQUENCY : often



Comments
A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u/pull/168 Date: 2023-09-15 18:39:02 +0000
15-09-2023

Changeset: 8d1ab570 Author: Doug Lea <dl@openjdk.org> Date: 2023-07-22 10:41:42 +0000 URL: https://git.openjdk.org/jdk/commit/8d1ab57065c7ebcc650b5fb4ae098f8b0a35f112
22-07-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/14317 Date: 2023-06-05 18:52:00 +0000
05-06-2023

Updated to use Thread.isVirtual to distinguish spin vs immediate block cases, enabling re-introduction of spin control from the previous version, removing anomalies like this one. There wasn't a stand-alone test case submitted for this, and I'm not sure how to create one that merely shows fewer performance glitches.
05-06-2023

This is a similar issue as https://bugs.openjdk.org/browse/JDK-8300663 - and might await non-preview status of loom to address
06-02-2023

Just to be clear, these kinds of timeouts only apply to the time a thread waits (ie parks) for some state to change, it does not apply to the whole operation. If the code enters a busy loop rather then parking, then the timeout will not apply.
31-01-2023

The problem was observed on openHAB 3.4.1
30-01-2023