JDK-8324345 : Stack overflow during C2 compilation when splitting memory phi
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 17,21,22,23
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2024-01-23
  • Updated: 2024-11-11
  • Resolved: 2024-07-31
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 24
24 b09Fixed
Related Reports
Relates :  
Relates :  
Description
# Failure analysis

ConnectionGraph::find_inst_mem contains recursive calls that can lead to a native C++ stack overflow in some cases.

# Original description

The C2 crashes without hs_err generation. 

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/lmesnik/ws/jdk-jck/build/linux-x64/images/jdk/bin/java --enable-preview -'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd8678a4e74 in PhiNode::verify_adr_type (this=this@entry=0x7fd7daf45928, recursive=recursive@entry=true) at /home/lmesnik/ws/jdk-jck/open/src/hotspot/share/opto/cfgnode.cpp:1188
1188	  if (VMError::is_error_reported())  return;  // muzzle asserts when debugging an error
[Current thread is 1 (Thread 0x7fd84496d640 (LWP 605955))]


With instrumentation mentioned in comments:

# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/opt/mach5/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S9922/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/bb2a42a8-b515-4ca4-871f-d848b5f874f4/runs/4c895d9f-7833-4a7a-bd1d-cca0ad6bce55/workspace/open/src/hotspot/share/opto/cfgnode.cpp:1162), pid=947068, tid=947140
# assert(count < 1000) failed: Stack overflow
#
# JRE version: Java(TM) SE Runtime Environment (23.0) (fastdebug build 23-internal-2024-01-23-1116378.tobias.hartmann.jdk3)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 23-internal-2024-01-23-1116378.tobias.hartmann.jdk3, compiled mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x881018] check_for_stack_overflow() [clone .part.0]+0x28
#

Current CompileTask:
C2:62927 8732 b javax.swing.plaf.basic.BasicLookAndFeel::initComponentDefaults (14883 bytes)

Stack: [0x00007f34df8ff000,0x00007f34df9ff000], sp=0x00007f34df9c7b80, free space=802k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x881018] check_for_stack_overflow() [clone .part.0]+0x28 (cfgnode.cpp:1162)
V [libjvm.so+0x886150] PhiNode::slice_memory(TypePtr const*) const+0x0
V [libjvm.so+0x894636] PhiNode::adr_type() const+0x16
V [libjvm.so+0x138b6be] MergeMemNode::memory_at(unsigned int) const+0x2de
V [libjvm.so+0xbb58c7] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0x327
V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8
V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19
V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8
V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19
V [libjvm.so+0xbb592b] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0x38b
V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8
V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19
V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8
V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19
V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8
V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19
V [libjvm.so+0xbb592b] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0x38b
V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8
Comments
Changeset: fdb4350f Branch: master Author: Daniel Lundén <dlunden@openjdk.org> Date: 2024-07-31 16:05:42 +0000 URL: https://git.openjdk.org/jdk/commit/fdb4350fcecef1915cdbc27ece24153a1b6c884d
31-07-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/20238 Date: 2024-07-18 15:24:33 +0000
18-07-2024

[~dlunden] Thank you for verifying that it will converge. > I'll add a recursion depth limit and make a PR. Let me know if you have additional comments or want me to investigate more. Good plan.
15-07-2024

The find_inst_mem method and split_memory_phi do not revisit nodes (checked experimentally and through source code review). Consequently, we are not splitting the same phi and will also converge (bounded by the number of nodes). The particular problematic case in this issue is compiling the method javax.swing.plaf.basic.BasicLookAndFeel::initComponentDefaults. It is a large method with ~95000 ideal graph nodes at the time of escape analysis, and find_inst_mem has to walk through very long chains of nodes with lots of phis, leading to the recursion-induced stack overflow. I'll add a recursion depth limit and make a PR. Let me know if you have additional comments or want me to investigate more.
15-07-2024

Reproduces with the simple attached test case (Test.java).
15-07-2024

Good plan. "If we do not care about getting a potentially worse result from EA". This is corner case which is vary rare in production. Run performance testing after fix to make sure it does not affect our regular benchmarks. Note, we already have other bailouts from EA. There is time limit to abort EA if it takes too long: `EscapeAnalysisTimeout`. And we limit number of iterations we go over ConnectionGraph: `GRAPH_BUILD_ITER_LIMIT`
10-07-2024

Not yet, but working on it.
10-07-2024

Sounds good. Did you manage to extract a regression test for this?
10-07-2024

If we do not care about getting a potentially worse result from EA, bailing out based on recursion depth sounds like a much cleaner solution compared to the manual rewrite. Based on [~kvn]'s questions and comments, I will 1. check that we are not splitting the same Phi over and over again, 2. check that the process converges given unlimited stack space, and 3. add a reasonable EA bailout based on recursion depth (if still needed after 1 and 2). Thanks!
10-07-2024

First, question: do we cycling through the same memory subgraph and splitting the same Phi over and over again? If it is the case we should catch and fix it. There should be already check to catch this. Second, if it is not the same Phi does the process converge if we have unlimited stack or memory? If this is the case we can simply bailout from EA by checking depth of recursion, for example. It is normal to bailout from EA if it consume too much resources. I don't think we need to convert this from recursion to manual stack.
09-07-2024

After JDK-8331185, my rewrite also crashes on debug builds with "Hit MemLimit (limit: 1073741824 now: 1073773536)". However, it does now generate an hs_err file since it is no longer a native C++ stack overflow. It should also run fine in release builds (but use a lot of memory). It is also possible to avoid the crash on debug builds with -XX:CompileCommand=MemLimit,*.*,0. [~kvn] [~thartmann] Any recommendations on how to proceed? I'm leaning towards going ahead and creating a PR with my rewrite above. We should also perhaps set -XX:CompileCommand=MemLimit,*.*,0 for debug build tests that potentially fail due to this issue.
09-07-2024

I've had a look at this now, and see no other (straightforward) alternative to rewriting the recursion with a manual stack. Here is a first attempt that seems to work fine (the issue no longer appears and there are no GHA failures): https://github.com/openjdk/jdk/compare/master...dlunde:jdk:stack-overflow-phi-8324345 Granted, it it less readable. But perhaps it is necessary.
30-05-2024

Targeting to JDK 24 for now since it's an old issue. Feel free to re-target to JDK 23 if the fix is ready in time.
28-05-2024

Please, discuss ideas here before start implementing it. As Xin Liu commented in JDK-8276219, it is not simple problem.
14-02-2024

This came up before, see JDK-8276219 and the linked bugs.
09-02-2024

ILW = Stack overflow in C2 verification code, with JCK test and -Xcomp; only affects debug builds, no workaround but disable compilation of affected method = HLM = P3
23-01-2024

I can reproduce this with JDK 21u as well, so this is not a recent regression.
23-01-2024

I added some code to the problematic method to assert if the native stack is growing over 1000 frames (see overflow_check.patch) and it's indeed a stack overflow: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/opt/mach5/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S9922/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/bb2a42a8-b515-4ca4-871f-d848b5f874f4/runs/4c895d9f-7833-4a7a-bd1d-cca0ad6bce55/workspace/open/src/hotspot/share/opto/cfgnode.cpp:1162), pid=947068, tid=947140 # assert(count < 1000) failed: Stack overflow # # JRE version: Java(TM) SE Runtime Environment (23.0) (fastdebug build 23-internal-2024-01-23-1116378.tobias.hartmann.jdk3) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 23-internal-2024-01-23-1116378.tobias.hartmann.jdk3, compiled mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x881018] check_for_stack_overflow() [clone .part.0]+0x28 # Current CompileTask: C2:62927 8732 b javax.swing.plaf.basic.BasicLookAndFeel::initComponentDefaults (14883 bytes) Stack: [0x00007f34df8ff000,0x00007f34df9ff000], sp=0x00007f34df9c7b80, free space=802k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x881018] check_for_stack_overflow() [clone .part.0]+0x28 (cfgnode.cpp:1162) V [libjvm.so+0x886150] PhiNode::slice_memory(TypePtr const*) const+0x0 V [libjvm.so+0x894636] PhiNode::adr_type() const+0x16 V [libjvm.so+0x138b6be] MergeMemNode::memory_at(unsigned int) const+0x2de V [libjvm.so+0xbb58c7] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0x327 V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8 V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19 V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8 V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19 V [libjvm.so+0xbb592b] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0x38b V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8 V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19 V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8 V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19 V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8 V [libjvm.so+0xbb5fb9] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0xa19 V [libjvm.so+0xbb592b] ConnectionGraph::find_inst_mem(Node*, int, GrowableArray<PhiNode*>&)+0x38b V [libjvm.so+0xbb6508] ConnectionGraph::split_memory_phi(PhiNode*, int, GrowableArray<PhiNode*>&)+0x1e8 [...] Unfortunately, replay compilation does not reproduce the issue.
23-01-2024

We need to double-check why no hs_err file is generated. I suspect that it's a stack overflow in C++ code (the failing method is recursive) and we can't recover from that to generate the file.
23-01-2024