Bug ID: JDK-8258396 SIGILL in jdk.jfr.internal.PlatformRecorder.rotateDisk()

Type: Bug
Component: hotspot
Sub-Component: jfr
Affected Version: 11.0.8

Priority: P2
Status: Closed
Resolution: Fixed

Submitted: 2020-12-15
Updated: 2025-01-13
Resolved: 2020-12-21

JDK 11	JDK 13	JDK 15	JDK 16	JDK 17	Other
11.0.11-oracleFixed	13.0.6Fixed	15.0.4Fixed	16Fixed	17 b03Fixed	openjdk8u292Fixed

We are seeing intermittent crashes at customer site when JFR is rotating chunks.

{noformat}
A fatal error has been detected by the Java Runtime Environment:
SIGILL (0x4) at pc=0x00007fa665cd4e5e, pid=1, tid=376
JRE version: OpenJDK Runtime Environment Zulu11.41+23-CA (11.0.8+10) (build 11.0.8+10-LTS)
Java VM: OpenJDK 64-Bit Server VM Zulu11.41+23-CA (11.0.8+10-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
Problematic frame:
V  [libjvm.so+0x8c9e5e]
Core dump will be written. Default location: //core
An error report file with more information is saved as:
/tmp/hs_err_pid1.log
{noformat}

Thanks to @evergizova the culprit was identified to be an erroneous memcpy in JfrStorage::flush_regular() or JfrStorage::flush_large() in combination with musl libc which inserts special traps for cases when memcpy src and dst regions overlap (https://git.2f30.org/fortify-headers/file/include/string.h.html#l39).

The problem boils down to the fact that for a non-empty buffer the  JfrStorage::flush_regular_buffer() will
reset cur.pos() to the start offset while cur_pos will stay at the
start offset + N. 
Then memcpy(cur.pos(), cur_pos, used) will have the
src and dest regions overlapping (given that used > N) and on Alpine
linux (musl libc) SIGILL will be raised.

Fix request (13u) Requesting backport to 13u, the issue is present there too. The patch applies cleanly. Tested with tier1 and jdk/jfr tests.
20-01-2021
Fix request for JDK 16 retroactively approved.
18-01-2021
[16] Fix Request Please, consider this fix for backporting - it prevents SIGILL crash on musl libc based systems when using JFR. The fix is trivial - changing memcpy to memmove to account for possibly overlapping memory regions (when 'resetting' local buffer the data is shifted from pos N to pos 0). The fix applies cleanly.
18-01-2021
Changeset: e85892bf Author: Jaroslav Bachorik <jbachorik@openjdk.org> Date: 2021-01-15 15:12:03 +0000 URL: https://git.openjdk.java.net/jdk/commit/e85892bf
16-01-2021
Only P1 and P2 bugs with approval can be fixed in RDP2: http://openjdk.java.net/jeps/3#rdp-2 You need to change priority to P2 and add label and comment for JDK 16 fix request similar what is done for 15u. I will approve after that.
16-01-2021
[~jbachorik] and [~mgronlun] - This is a P3 bug and has been integrated after RDP2 which has limited rules for integration. See https://openjdk.java.net/jeps/3#Fix-Request-Process [~kvn] should be able to help with figuring out how to get retroactive approval for a P3 fix.
15-01-2021
Hi Jaroslav, no, it's all good I think. The PR has the "ready" label. So, just comment "/integrate" on the PR and you're done.
14-01-2021
Hi Christoph, I have created a backport for JDK16 - https://github.com/openjdk/jdk16/pull/111 and changed the fix version. I will need a proper approval on that PR. It is disallowed to push directly to openjd/jdk repo and the backport must go through the PR. What about 15u? Would the 15u maintainer mind approving that backport as well?
14-01-2021
Jaroslav, you can still push this to jdk16. That's allowed as per RDP rules (https://openjdk.java.net/jeps/3). JDK 16 is in RDP phase one, so P3 bugfixes can still be done. How to do is described here (Skara backport process): https://wiki.openjdk.java.net/display/SKARA/Backports#Backports-CLI - you will need to do the manual commands though as git backport and the backporting by comment in git don't work yet. But you should be able to do it without involving any other OpenJDK committer as it should be a clean backport. And please change the version of JDK-8259607 from 16.0.1 to 16 so the skara update bot can pick it up :)
12-01-2021
Chris, no problem there. Should this be labelled with 'critical' request so it can get to 16.0? Or is 16.0 already considered to be 16u?
12-01-2021
Can consideration be given to also porting this issue to 16u. ( the fixVersion indicates that it missed the cut-off? )
08-01-2021
Ah, I see. I got confused by the rampdown critical request process. Since this issue was spotted quite late in the dev cycle of the current updates and it is leading to a reliable crash under given conditions I thought it would be better to get it in now than to wait another 3 months. I will adjust the labels. Actually, it is already in JDK 16 as the push was done before cut-off (it seems)
06-01-2021
Is this a regression in 8u282? It doesn't immediately seem a candidate for a critical fix and perhaps should be jdk8u-fix-request instead?
22-12-2020
[~jbachorik], I think you mean jdk8u-fix-request as well as jdk15u-fix-request and not -critical-request. The naming of these labels is a bit confusing as critical doesn't stand for the criticality of the issue here but rather for whether it should still be included in the rampdown releases (e.g. january updates this time). On the other hand, I guess it would be nice if you could backport this fix to JDK16.
21-12-2020
[15u critical] Fix Request Please, consider this fix for backporting - it prevents SIGILL crash on musl libc based systems when using JFR. The fix is trivial - changing memcpy to memmove to account for possibly overlapping memory regions (when 'resetting' local buffer the data is shifted from pos N to pos 0). The fix applies cleanly.
21-12-2020
Approving for 11.0.11 (Push to jdk11u-dev repo) as we're already in rampdown for 11.0.10. Changed flag to jdk11u-fix-request accordingly.
21-12-2020
[8u critical] Fix Request Please, consider this fix for backporting - it prevents SIGILL crash on musl libc based systems when using JFR. The fix is trivial - changing memcpy to memmove to account for possibly overlapping memory regions (when 'resetting' local buffer the data is shifted from pos N to pos 0). The fix applies cleanly.
21-12-2020
[11u critical] Fix Request Please, consider this fix for backporting - it prevents SIGILL crash on musl libc based systems when using JFR. The fix is trivial - changing memcpy to memmove to account for possibly overlapping memory regions (when 'resetting' local buffer the data is shifted from pos N to pos 0). The fix applies cleanly.
21-12-2020
Changeset: a06cea50 Author: Jaroslav Bachorik <jbachorik@openjdk.org> Date: 2020-12-21 11:43:13 +0000 URL: https://git.openjdk.java.net/jdk/commit/a06cea50
21-12-2020