JDK-8269240 : java/foreign/stackwalk/TestAsyncStackWalk.java test failed with concurrent GC
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 17,18
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2021-06-23
  • Updated: 2021-08-07
  • Resolved: 2021-07-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 18
17 b32Fixed 18Fixed
Related Reports
Relates :  
Relates :  
Description
The following crash can be easily reproduced with Shenandoah and ZGC.

make CONF=linux-x86_64-server-fastdebug TEST_VM_OPTS="-XX:+UseShenandoahGC" run-test TEST=java/foreign/stackwalk/TestAsyncStackWalk.java
or 
 make CONF=linux-x86_64-server-fastdebug TEST_VM_OPTS="-XX:+UseZGC" run-test TEST=java/foreign/stackwalk/TestAsyncStackWalk.java



# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/thread.cpp:1968
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/zgu/ws/jdk17/src/hotspot/share/runtime/thread.cpp:1968), pid=1704229, tid=1704245
#  assert((!has_last_Java_frame() && java_call_counter() == 0) || (has_last_Java_frame() && java_call_counter() > 0)) failed: wrong java_sp info!
#
# JRE version: OpenJDK Runtime Environment (17.0) (fastdebug build 17-internal+0-adhoc.zgu.jdk17)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 17-internal+0-adhoc.zgu.jdk17, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x1a6948c]  JavaThread::oops_do_no_frames(OopClosure*, CodeBlobClosure*)+0x1bc
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#

Comments
Verified by running the test in mach5 with -XX:+UseZGC flag
07-08-2021

[~jvernee] Please review the forwardport of this fix here: https://github.com/openjdk/jdk/pull/4842
20-07-2021

Fix Request The test failure uncovered a larger issue with how the Panama Linker API does upcalls. Particularly, the right suspension checks are not being done before entering Java, which besides the assertion failures likely has other unforeseen consequences. Without the fix, the return value of an upcall is also being corrupted in some cases. In other words: there's a risk of spurious crashes, as well as partial loss of function. The fix simplifies the generated upcall stubs somewhat, since the suspension logic is replaced by 2 calls to C++ helper methods (switching out macro assembler code for C++ code), whose implementations are taken from existing code in JavaCallWrapper's constructor and destructor, which do the right thing. The corrupted return value issue is resolved by saving and restoring the return value, which is straightforward since the surrounding code is already doing this for other values. The risk of this fix to other HotSpot code is minimal: the changed code is only run when using the Panama Linker API. This API is also still incubating. Besides the added tests that are part of the fix, combinatorial/stress testing was also added when the same fix was integrated into the panama-foreign repo [1], and these tests have gone through several CI testing cycles without other issues. The fix has been reviewed by Vladimir Ivanov, and David Holmes. Link to the Pull Request: https://github.com/openjdk/panama-foreign/pull/565 [1] : https://github.com/openjdk/panama-foreign/pull/565
20-07-2021

Fix request for JDK 17 is approved post fact.
20-07-2021

[~jvernee] The fix was integrated into JDK17 without approval which is required for current RDP2 phase: http://openjdk.java.net/jeps/3#Fix-Request-Process Please, add label and Fix request comment according to the process. And I will do my part.
20-07-2021

For the person porting this to the mainline (I think that's [~jwilhelm] ?), the patch adds a call to `current->as_Java_thread()` [1]. This should be changed to `JavaThread::cast(current)` for the mainline. As far as I know that is the only incompatibility. [1] : https://github.com/openjdk/jdk17/commit/845c31dc4b49dfbed25238a398c80b8cdd0a3997#diff-abdc409967d04172ecc20e3747aa55a79e755584d570b57c4d58902a9813d188R1600
20-07-2021

Changeset: 845c31dc Author: Jorn Vernee <jvernee@openjdk.org> Date: 2021-07-20 13:10:42 +0000 URL: https://git.openjdk.java.net/jdk17/commit/845c31dc4b49dfbed25238a398c80b8cdd0a3997
20-07-2021

ILW = HLH = P2
13-07-2021

Raising priority as the issue occurs regardless of which garbage collector is used (but with Shenandoah and ZGC the issue happens more frequently). Also, even when there is no hard JVM crash, this issue can manifest itself with an upcall values being discarded, which then results in spurious issues when interacting with native code - e.g. https://mail.openjdk.java.net/pipermail/panama-dev/2021-June/014150.html
12-07-2021

[~rehn] Hi Robbin, I agree with your assessment. The original code for doing Panama upcalls was written several years ago, and recently ported into the new branch by me. While I did re-examine what some of the original code was doing against JavaCallWrapper & generate_upcall_stub, I had assumed most things were correct. It seems that because of this some things were missed, such as this StackWatermarkSet call, as well as the call counter, and this was never caught during our own testing, or during code review (looking now it seems so obvious...). I think a re-write of some of the logic is warranted. The suspend check done by JavaCallWrapper is different to what the panama entry code does, and it seems there is no need to re-guard the stack as well. As far as the call to StackWatermarkSet on the back-edge in ~JavaCallWrapper is concerned, this is happening in an if-block guarded by `_thread->has_pending_exception()`, while Panama upcalls can never throw exceptions, so I think we are good there (but some confirmation of this would be nice). I'm still a bit confused about the assert that is triggered because of the call counter though, since it says there is an active last Java frame, but the call counter is 0, which seems like a valid condition on the calling thread when it is in native code. Though, on the thread doing the upcall, there should never be an active frame anchor AFAICS. (however, adding the call counter increment solves this issue as well).
25-06-2021

I had quick look, to me it looks the issue is that the up-calls was implemented by reversing the down-calls instead doing the same as JavaCallWrapper/~JavaCallWrapper. This means we do not increment/decrement the java call counter and check_special_condition_for_native_trans is called. check_special_condition_for_native_trans is written to be used on the back-edge of down calls, it is now used on front edge of up calls? So instead of running (on back-edge): if (_thread->has_pending_exception() && _thread->has_last_Java_frame()) { StackWatermarkSet::after_unwind(_thread); } We are now executing (on front-edge): StackWatermarkSet::before_unwind(thread); ? And never incrementing/decrementing the counter. You can reproduce without a concurrent GC by doing: -XX:GuaranteedSafepointInterval=1 -XX:+HandshakeALot The dummy handshake will do a sanity check on the JT and see the inconsistency between the call counter and last frame.
24-06-2021