JDK-8252090 : JFR: StreamWriterHost::write_unbuffered() stucks in an infinite loop OpenJDK (build 13.0.1+9)
  • Type: Bug
  • Component: hotspot
  • Sub-Component: jfr
  • Affected Version: 13
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2020-08-19
  • Updated: 2021-04-21
  • Resolved: 2020-09-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 13 JDK 15 JDK 16
11.0.10-oracleFixed 13.0.6Fixed 15.0.4Fixed 16 b14Fixed
Description
ADDITIONAL SYSTEM INFORMATION :
$ /usr/lib/jvm/jdk-13/bin/java -version
openjdk version "13.0.1" 2019-10-15
OpenJDK Runtime Environment (build 13.0.1+9)
OpenJDK 64-Bit Server VM (build 13.0.1+9, mixed mode, sharing)

$ uname -a
Linux <hostname> 5.1.5-050105-generic #201905251333 SMP Wed Jun 19 09:49:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux



While the OpenJDK we use is somewhat outdated, the write_unbuffered() code is mostly unchanged and still seem to have the same issue.
https://github.com/openjdk/jdk/blob/51c8ac362e051c5e1978e1d696f537dbc0b29e03/src/hotspot/share/jfr/writers/jfrStreamWriterHost.inline.hpp#L112-L116

A DESCRIPTION OF THE PROBLEM :
We encountered an issue where JVM becomes completely unresponsive. Log indicates that the all threads except "JFR Recorder Thread" are paused in a safepoint, waiting for the JFR thread to reach the safepoint. 

This is a second time we saw this issue. The first time was in January 2020.

```
[2020-08-16T14:32:56.366+0900][warning][safepoint        ]
[2020-08-16T14:32:56.366+0900][warning][safepoint        ] # SafepointSynchronize::begin: Timeout detected:
[2020-08-16T14:32:56.366+0900][warning][safepoint        ] # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
[2020-08-16T14:32:56.366+0900][warning][safepoint        ] # SafepointSynchronize::begin: Threads which did not reach the safepoint:
[2020-08-16T14:32:56.366+0900][warning][safepoint        ] # "JFR Recorder Thread" #16 daemon prio=5 os_prio=0 cpu=5866422.36ms elapsed=6491407.13s tid=0x00007f285ee519b0 nid=0x465d runnable  [0x0000000000000000]
[2020-08-16T14:32:56.366+0900][warning][safepoint        ] java.lang.Thread.State: RUNNABLE
[2020-08-16T14:32:56.366+0900][warning][safepoint        ]
[2020-08-16T14:32:56.366+0900][warning][safepoint        ] #
SafepointSynchronize::begin: (End of list)
```

The JFR Recorder Thread was using 100% CPU. Here's the backtrace.

```
(gdb) bt
#0  0x00007f28637782b7 in write () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f2862790f19 in StreamWriterHost<MallocAdapter<1048576ul>, JfrCHeapObj>::write_unbuffered(void const*, unsigned long) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#2  0x00007f28627908c2 in JfrStorage::write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#3  0x00007f2862783a93 in JfrRecorderService::pre_safepoint_write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#4  0x00007f2862784696 in JfrRecorderService::write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#5  0x00007f2862785c8b in JfrRecorderService::rotate(int) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#6  0x00007f28627865b2 in recorderthread_entry(JavaThread*, Thread*) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#7  0x00007f2862c76382 in JavaThread::thread_main_inner() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#8  0x00007f2862c7afed in Thread::call_run() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#9  0x00007f2862acd867 in thread_native_entry(Thread*) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#10 0x00007f286376e6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007f286329388f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```

We confirmed that the thread is running [87eee0, 87ef28] (marked by \*) endlessly in the assembly list (with comments) below. The corresponding source code is also listed below.

```asm
000000000087eea0 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm>:
  87eea0:       55                      push   %rbp
  87eea1:       48 89 e5                mov    %rsp,%rbp
  87eea4:       41 57                   push   %r15
  87eea6:       41 56                   push   %r14
  87eea8:       41 55                   push   %r13
  87eeaa:       49 89 fd                mov    %rdi,%r13
  87eead:       41 54                   push   %r12
  87eeaf:       49 89 d4                mov    %rdx,%r12
  87eeb2:       53                      push   %rbx
  87eeb3:       48 83 ec 18             sub    $0x18,%rsp
  87eeb7:       44 8b 7f 48             mov    0x48(%rdi),%r15d
  87eebb:       48 89 75 c8             mov    %rsi,-0x38(%rbp)
  87eebf:       41 83 ff ff             cmp    $0xffffffff,%r15d
  87eec3:       74 0c                   je     87eed1 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x31>
  87eec5:       4c 8b 37                mov    (%rdi),%r14
  87eec8:       48 8b 47 08             mov    0x8(%rdi),%rax
  87eecc:       4c 29 f0                sub    %r14,%rax
  87eecf:       75 6f                   jne    87ef40 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xa0>
  87eed1:       4d 85 e4                test   %r12,%r12
  87eed4:       74 54                   je     87ef2a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x8a>
  87eed6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  87eedd:       00 00 00
* 87eee0:       44 89 e3                mov    %r12d,%ebx ; p/x $r12 => 0xffffffff00000000 ; p/x $r12d => 0x0 ; n = len
* 87eee3:       41 81 fc ff ff ff 7f    cmp    $0x7fffffff,%r12d
* 87eeea:       b8 ff ff ff 7f          mov    $0x7fffffff,%eax
* 87eeef:       45 8b 75 48             mov    0x48(%r13),%r14d ; p $r14d => 3 ; this->_fd (/tmp/2020_06_02_11_22_49_17357/2020_08_16_14_31_56.part)
* 87eef3:       48 0f 43 d8             cmovae %rax,%rbx ; n = INT_MAX if INT_MAX < (int) len ; p $rbx => 0
* 87eef7:       eb 11                   jmp    87ef0a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x6a>
* 87eef9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
* 87ef00:       e8 7b fb 9f ff          callq  27ea80 <__errno_location@plt>
* 87ef05:       83 38 04                cmpl   $0x4,(%rax)
* 87ef08:       75 17                   jne    87ef21 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x81>
* 87ef0a:       48 8b 75 c8             mov    -0x38(%rbp),%rsi ; x/g $rbp-0x38 => 0x7f2637ffeb18: 0x00007f2838042290 ; buf of write(2)
* 87ef0e:       48 89 da                mov    %rbx,%rdx ; p $rdx => 0 ; count of write(2)
* 87ef11:       44 89 f7                mov    %r14d,%edi
                                        ; write(3, 0x00007f2838042290, 0) #=> 0 has been executed repeatedly.
* 87ef14:       e8 77 fc 9f ff          callq  27eb90 <write@plt>
* 87ef19:       49 89 c7                mov    %rax,%r15 ; p $rax => 0 ; ret of write(2)
* 87ef1c:       83 f8 ff                cmp    $0xffffffff,%eax
* 87ef1f:       74 df                   je     87ef00 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x60>
* 87ef21:       4d 01 7d 40             add    %r15,0x40(%r13)
* 87ef25:       49 29 dc                sub    %rbx,%r12 ; p/x $r12 => 0xffffffff00000000 ; len -= n
                                        ; We keep jumping back to 87eee0 as the upper 32 bits of %r12 (len) never reaches 0.
* 87ef28:       75 b6                   jne    87eee0 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x40>
  87ef2a:       48 83 c4 18             add    $0x18,%rsp
  87ef2e:       5b                      pop    %rbx
  87ef2f:       41 5c                   pop    %r12
  87ef31:       41 5d                   pop    %r13
  87ef33:       41 5e                   pop    %r14
  87ef35:       41 5f                   pop    %r15
  87ef37:       5d                      pop    %rbp
  87ef38:       c3                      retq
  87ef39:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  87ef40:       89 c0                   mov    %eax,%eax
  87ef42:       48 89 45 c0             mov    %rax,-0x40(%rbp)
  87ef46:       eb 12                   jmp    87ef5a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xba>
  87ef48:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  87ef4f:       00
  87ef50:       e8 2b fb 9f ff          callq  27ea80 <__errno_location@plt>
  87ef55:       83 38 04                cmpl   $0x4,(%rax)
  87ef58:       75 17                   jne    87ef71 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xd1>
  87ef5a:       48 8b 55 c0             mov    -0x40(%rbp),%rdx
  87ef5e:       4c 89 f6                mov    %r14,%rsi
  87ef61:       44 89 ff                mov    %r15d,%edi
  87ef64:       e8 27 fc 9f ff          callq  27eb90 <write@plt>
  87ef69:       48 89 c3                mov    %rax,%rbx
  87ef6c:       83 f8 ff                cmp    $0xffffffff,%eax
  87ef6f:       74 df                   je     87ef50 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xb0>
  87ef71:       49 8b 45 00             mov    0x0(%r13),%rax
  87ef75:       49 01 5d 40             add    %rbx,0x40(%r13)
  87ef79:       49 89 45 08             mov    %rax,0x8(%r13)
  87ef7d:       e9 4f ff ff ff          jmpq   87eed1 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x31>
```

```c++
template <typename Adapter, typename AP>
void StreamWriterHost<Adapter, AP>::write_unbuffered(const void* buf, size_t len) {
  this->flush();
  assert(0 == this->used_offset(), "can only seek from beginning");
  while (len > 0) {
    const int n = MIN2<int>((int)len, INT_MAX);
    _stream_pos += os::write(_fd, buf, n);
    len -= n;
  }
}
```

https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/writers/jfrStreamWriterHost.inline.hpp#L108-L117

Here are what we found:

* write(3, 0x00007f2838042290, 0) => 0 is invoked endlessly.
* The original `len` passed to write_unbuffered() was "negative" (`0xffffffffffffea74`). This value was recovered from `unflashed_size` at https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/recorder/storage/jfrStorageUtils.inline.hpp#L46. Fwiw, `t->pos()` was `0x7f2838040d04` and `current_top` was `0x7f2838042290`.

While we were unsuccessful at root-causing why `len` went negative, we identified a bug in write_unbuffered() that caused an infinite loop:
the loop in the function never ends if `len` has larger unsigned value than 0xffffffff, because `n = MIN2<int>((int)len, INT_MAX)` only sees the lower 32 bits of `len` and `n` becomes 0 when `len` still has non-zero upper bits.

write_unbuffered() also seems to have other issues that could potentially corrupt JFR recording files:

* os::write() can fail with -1.
* `len` should be decremented by the *actual* number of bytes written by os::write() which may be less than `n`, to avoid data loss.
* The current implementation could potentially write the same data repeatedly. `buf` must be advanced each time by the number of bytes written.

So, the function probably should look something like (I didn't test this code):

```c++
template <typename Adapter, typename AP>
void StreamWriterHost<Adapter, AP>::write_unbuffered(void* buf, size_t len) {
  this->flush();
  assert(0 == this->used_offset(), "can only seek from beginning");
  while (len > 0) {
    // As os::write() only accepts unsigned int (rather than size_t), we need to split the input into
    // INT_MAX sized chunks. The size of each chunk `n` is `len` clamped to INT_MAX. We can't truncate
    // `len` before comparing it to INT_MAX.
    const unsigned int n = (unsigned int) MIN2<size_t>(len, INT_MAX);
    size_t nr_written = os::write(_fd, buf, n);
    if (nr_written == -1) {
      // TODO: error handling
      return;
    }
    _stream_pos += nr_written;
    len -= nr_written;
    buf += nr_written;
  }
}
```

Even if write_unbuffered() is fixed, "negative" `len` (i.e. length larger than actual `buf`) would still corrupt JFR recordings until os::write() eventually fails with -1 (errno = EFAULT) or some other errors.
It might be good to add an assertion `t->pos() >= current_top` before calling write_unbuffered() to avoid this situation (especially, writing out a random memory location to a file is not good in terms of security).

https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/recorder/storage/jfrStorageUtils.inline.hpp#L46


FREQUENCY : rarely



Comments
Fix request (15u) Requesting backport to 15u for parity with 11u. The patch applies cleanly. Tested with tier1 and jdk/jfr tests.
20-04-2021

Fix request (13u) Requesting backport to 13u for parity with 11u. The patch doesn't apply cleanly due to multiple context differences in jfrChunkWriter.cpp, reapplied manually. Additionally 13u doesn't have JfrChunkHeadWriter::write_magic() method (JDK-8226511 is not in 13u), the corresponding changes are applied to JfrChunkWriter::open(). Tested with tier1 and jdk/jfr. RFR approval: http://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-December/004486.html
21-12-2020

[11u] Fix Request This fix resolves a potential infinite loop when running with JFR. Hitting this issue will render the JVM unresponsive. Therefore it sounds as the proper course of action having this fixed in the LTS version as well. RFR: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-October/003931.html The backport was successfully tested for tier1 and jdk_jfr tests.
07-10-2020

URL: https://hg.openjdk.java.net/jdk/jdk/rev/8051a4f3f57c User: mgronlun Date: 2020-09-01 16:07:11 +0000
01-09-2020

Additional information received from submitter <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Unfortunately, we haven't been unable to find a reproducer for this issue. We've seen this issue only two times so far this year. At a minimum, I believe the Flight Recorder must be enabled (-XX:StartFlightRecording) for this issue to happen. We use this command below to start the JVM, and this particular JVM process has been running for 2.5 months without a problem. ``` /usr/lib/jvm/jdk-13/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError="kill -9 %p" -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=13958643712 -Xms34359738368 -Xmx34359738368 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=(removed) -Dcom.sun.management.jmxremote.ssl=false -Dhbase.unsafe.enabled=false -Xlog:gc*=trace,safepoint*=debug,classhisto*=trace,heap*=trace,gc+tlab*=debug,gc+region*=debug,gc+metaspace*=debug,gc+nmethod*=info,gc+ref*=debug,gc+reloc*=debug:file=/var/log/hbase/jvm.log:t,l,tg:filecount=10,filesize=256M -XX:+AlwaysPreTouch -XX:+FlightRecorder -XX:FlightRecorderOptions=repository=/tmp -XX:-OmitStackTraceInFastThrow -XX:+PerfDisableSharedMem -XX:+PreserveFramePointer -XX:+PrintFlagsFinal -XX:StartFlightRecording=settings=/usr/lib/jvm/jdk-13/lib/jfr/profile.jfc,disk=true,maxage=24h,maxsize=16g -XX:+UnlockExperimentalVMOptions -XX:+UseTransparentHugePages -XX:+UseZGC -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=200 -javaagent:/opt/java-agents/java-procfs-mbeans-agent-0.0.4.jar -javaagent:/opt/java-agents/java-prometheus-metrics-agent-0.0.5.jar=(removed) -XX:MaxDirectMemorySize=17179869184 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/(removed).hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=(removed) -Dhbase.home.dir=(removed) -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/(removed)/lib/hadoop/lib/native:/opt/cloudera/parcels/(removed)/lib/hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start ``` <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< There is no reproducer shared but the submitter is able to identify the root cause <<<<<<<<<<<<<<<<<<<<<< https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/writers/jfrStreamWriterHost.inline.hpp#L108-L117 Here are what we found: * write(3, 0x00007f2838042290, 0) => 0 is invoked endlessly. * The original `len` passed to write_unbuffered() was "negative" (`0xffffffffffffea74`). This value was recovered from `unflashed_size` at https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/recorder/storage/jfrStorageUtils.inline.hpp#L46. Fwiw, `t->pos()` was `0x7f2838040d04` and `current_top` was `0x7f2838042290`. While we were unsuccessful at root-causing why `len` went negative, we identified a bug in write_unbuffered() that caused an infinite loop: the loop in the function never ends if `len` has larger unsigned value than 0xffffffff, because `n = MIN2<int>((int)len, INT_MAX)` only sees the lower 32 bits of `len` and `n` becomes 0 when `len` still has non-zero upper bits. <<<<<<<<<<<<<<<<<<<<<< A workaround is also proposed by the submitter but it is not tested. Moving this incident to dev team for further analysis.
20-08-2020

Requested submitter to provide additional information <<<<<<<<<<<<<<<<<<<<< To understand the issue better, could you please provide a standalone test case/ steps to reproduce this issue. <<<<<<<<<<<<<<<<<<<<<
19-08-2020