JDK-8261520 : JDK-8261302 breaks runtime/NMT/CheckForProperDetailStackTrace.java
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2021-02-10
  • Updated: 2021-03-03
  • Resolved: 2021-02-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17
17 b12Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
Since JDK-8261302, the test runtime/NMT/CheckForProperDetailStackTrace.java fails with

java.lang.RuntimeException: 'NativeCallStack::NativeCallStack' found in stdout

--

`NativeCallStack` contains a hash code as a convenience to place it into the malloc site hashmap. Before JDK-8261302, that hash code was calculated lazily in a non-inline hashcode getter. With JDK-8261302, the hash code calculation was moved into the `NativeCallStack` constructor and the getter was made inline.

The `NativeCallStack` constructor fills its stack frames by calling `os::get_native_stack()`. Before JDK-8261302, that call to `os::get_native_stack()` has been the last call in the constructor and has been optimized sometimes into a tail call. Whether or not its a tail call matters since it affects the number of stack frames the stack walker has to skip. Therefore, the constructor contains rather complex and fragile coding to predict whether the tail call optimization happens:

```
#if (defined(_NMT_NOINLINE_) || defined(_WINDOWS) || !defined(_LP64) || defined(PPC64))
  // Not a tail call.
  toSkip++;
#if (defined(_NMT_NOINLINE_) && defined(BSD) && defined(_LP64))
  // Mac OS X slowdebug builds have this odd behavior where NativeCallStack::NativeCallStack
  // appears as two frames, so we need to skip an extra frame.
  toSkip++;
#endif // Special-case for BSD.
#endif // Not a tail call.
```

This prediction was now wrong since the hash code calculation, which needs to happen after the stack was captured, happened afterwards. This causes the test error, since on some platforms (eg Linux x64) we now think we have  a tail call when we don't, which means we do not skip enough frames, and the NMT output contains call frames like "NativeCallStack::NativeCallStack()", which trips the test.

There are several ways to fix it:
1) leave it this way: live with the fact that we lose tail call optimization, and just always skip the frame. I am not sure how much that optimization really matters. This would have the big benefit of removing the need for the fragile guessing code which causes quite a bit of maintenance effort.
2) modify os::get_native_stack() to, in addition to returning the stack, also calculate and write the hash value. That would make os::get_native_stack() less general purpose. Currently its only used by NMT but in theory is general-purpose enough for the os:: namespace.
3) There is no good reason why class NativeCallStack even needs to hold the hash value. Arguably this is not its business. The fact that some caller (NMT) holds instances of it in a hash map should not concern it. However, moving the hash code out of NativeCallStack up into NMTs AllocationSite class would have some wider repercussions, which need checking.


Comments
Changeset: 722142ee Author: Thomas Stuefe <stuefe@openjdk.org> Date: 2021-02-26 06:46:20 +0000 URL: https://git.openjdk.java.net/jdk/commit/722142ee
26-02-2021

If anyone wants to speed this up, I need a second pair of eyes for https://github.com/openjdk/jdk/pull/2539. That is a preparatory patch for the solution I have in mind for this patch.
17-02-2021

ILW = MMM = P3
16-02-2021

The fix will be option (3), moving hash code calculation out of NativeCallStack altogether up into the MallocSiteTable. A preparatory patch is in RFR (JDK-8261644).
12-02-2021

Okay, I problemlisted it.
11-02-2021

This bug is causing repeating tier 2 failures (every run). Increasing priority. Please fix asap or problemlist the test.
11-02-2021