JDK-8323755 : JFR fails to get a stack trace of System.currentTimeMillis() on AArch64
  • Type: Bug
  • Component: hotspot
  • Sub-Component: jfr
  • Affected Version: 11.0.21,17.0.9,21.0.1,23
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • CPU: aarch64
  • Submitted: 2024-01-16
  • Updated: 2025-02-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
The attached example calls System.currentTimeMillis() in a loop for 10 seconds. When profiling with JFR, we expect to get nearly 1000 execution samples, since the default CPU profiling interval is 10ms. In fact, we get only ~100 samples. This means, 90% of samples are lost.

Such inaccurate profiling is completely misleading: it gives user a wrong impression that `baseline` consumes 10x more CPU comparing to `timeLoop`, while the truth is that both methods are equally expensive.

Also, JFR output (bad.jfr) contains several samples with an invalid stack trace, which looks truncated and contains a duplicate `timeLoop` frame:
 [0] Test.time
 [1] Test.wrapper
 [2] Test.timeLoop
 [3] Test.timeLoop

This is likely caused by the same problem: when the stack walker attempts to get a sender frame at a wrong location, it may find garbage on the stack and silently discard the sample, or it may find something that resembles a normal frame.

How to run:
java -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:StartFlightRecording=settings=profile,filename=bad.jfr Test
Comments
> Additional note: the comment in the code refers to `capture_last_Java_pc()` which does not exist anymore. JavaFrameAnchor::capture_last_Java_pc() has been inlined in JavaFrameAnchor::make_walkable() in JDK-8282477 (JDK 19, 17).
20-02-2025

I created another 2 issues related to this one. First is platform specific JDK-8350412. The stack layout for aarch64 has been fixed in JDK 24 and this fix can be useful for older versions - to have correct stack traces in JFR (and help external profilers as well). Second one is generic JDK-8350422. JFR can be enhanced to perform stack walking even when the pc does not belong to a code blob.
20-02-2025

[~apangin] Thanks for clarifying. If your theory is correct then yes a single bug causing two problems.
17-01-2024

[~apangin] we seem to have two distinct issues here, which would require two distinct bugs please.
16-01-2024

After applying a workaround in async-profiler that accounts for these extra two words on a stack, I can finally get correct stack traces - see attached good.html. `timeLoop` and `baseline` now indeed have the same weight in the profile. JFR still reports incorrect data.
16-01-2024

The bug was originally reported against async-profiler: https://github.com/async-profiler/async-profiler/issues/819 After investigation, it was found to be a HotSpot issue. When doing a runtime call from compiled code, JVM puts two words onto the stack: https://github.com/openjdk/jdk/blob/b3634722655901b8d3e43dd1f8aa2b4487509a34/src/hotspot/cpu/aarch64/aarch64.ad#L3777 // Leave a breadcrumb for JavaFrameAnchor::capture_last_Java_pc() __ stp(zr, rscratch2, Address(__ pre(sp, -2 * wordSize))); This breaks the invariant the stack walker relies on: the size of the compiled frame is constant (i.e. SP does not change) after `frame_complete_offset`. However, in this case, SP is temporarily adjusted, and therefore JVM cannot find the sender frame. Additional note: the comment in the code refers to `capture_last_Java_pc()` which does not exist anymore.
16-01-2024

[~dholmes] If by two issues you mean missed samples and an invalid stack trace, I believe both are manifestations of the same problem - I updated the bug description to reflect this.
16-01-2024