Bug ID: JDK-8339349 Crash in the GC running the DaCapo spring benchmark

JDK-8339349 : Crash in the GC running the DaCapo spring benchmark

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 22,23,24

Priority: P2
Status: Closed
Resolution: Duplicate

Submitted: 2024-08-30
Updated: 2024-11-27
Resolved: 2024-11-27

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 24
24Resolved

Related Reports

Duplicate :	JDK-8327647 - Occasional SIGSEGV in markWord::displaced_mark_helper() for SPECjvm2008 sunflow
Relates :	JDK-8308606 - C2 SuperWord: remove alignment checks when not required
Relates :	JDK-8342498 - Add test for Allocation elimination after use as alignment reference by SuperWord
Relates :	JDK-8310190 - C2 SuperWord: AlignVector is broken, generates misaligned packs

Description

I can intermittently (maybe 3% of the time) get a crash in the G1GC collector using OpenJDK-22.0.2+9 (from here: https://jdk.java.net/22/) to run the DaCapo 23.11-chopin "spring" benchmark (from here: https://github.com/dacapobench/dacapobench/releases).

I am running on an Ampere Computing Altra (aarch64).  I have not tried other platforms.

The command line is minimal:

    $ ${JAVA_HOME}/bin/java -jar ${DACAPO_HOME}/dacapo-23.11-chopin.jar --iterations 5 --size default spring

The crashes thread stacks look like:

    Current thread (0x00004000d40146c0):  WorkerThread "GC Thread#50"   [id=3890367, stack(0x0000400498850000,0x0000400498a50000) (2048K)]

    Stack: [0x0000400498850000,0x0000400498a50000],  sp=0x0000400498a4da70,  free space=2038k
    Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
    V  [libjvm.so+0xae8918]  markWord::displaced_mark_helper() const+0x18
    V  [libjvm.so+0x735060]  G1ParCopyClosure<(G1Barrier)0, false>::do_oop(oopDesc**)+0x180
    V  [libjvm.so+0xb80b98]  InterpreterOopMap::iterate_oop(OffsetClosure*) const+0x114
    V  [libjvm.so+0x6a9e1c]  frame::oops_interpreted_do(OopClosure*, RegisterMap const*, bool) const+0x16c
    V  [libjvm.so+0x8187f0]  JavaThread::oops_do_frames(OopClosure*, CodeBlobClosure*) [clone .part.0]+0xd0
    V  [libjvm.so+0xd0aa3c]  Thread::oops_do(OopClosure*, CodeBlobClosure*)+0xbc
    V  [libjvm.so+0xd16a70]  Threads::possibly_parallel_oops_do(bool, OopClosure*, CodeBlobClosure*)+0x10c
    V  [libjvm.so+0x737980]  G1RootProcessor::process_java_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)+0x80
    V  [libjvm.so+0x737a84]  G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)+0x64
    V  [libjvm.so+0x749c04]  G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState*, unsigned int)+0x24
    ...

where the last frame is always at markWord::displaced_mark_helper() const+0x18

I have attached a sample hs_err_pid*.log file.  I have more of them if you want them.  I have also attached a file with the tops of the threads stacks from my 5 most recent hs_err_pid*.log files.

Comments

Closed as duplicate. Big thanks to [~ecaspole] for narrowing it down!
27-11-2024
[~matsaave] just pointed me to JDK-8327647. That does look like the exact same issue.
26-11-2024
So it looks like JDK-8301996 might cause / trigger this. [~matsaave] any thoughts?
20-11-2024
After more triage, 22-b8 did not crash in 3000 tries, but 22-b9 crashed at least twice in 1500 tries.
14-11-2024
This morning I got a similar crash in b12 - # JRE version: Java(TM) SE Runtime Environment (22.0+12) (build 22-ea+12-877) # Java VM: Java HotSpot(TM) 64-Bit Server VM (22-ea+12-877, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # V [libjvm.so+0x6cfbb4] G1ConcurrentMark::mark_in_bitmap(unsigned int, oopDesc*) [clone .constprop.0]+0x44 B1 worked for 1500 loops. Trying b11 next.
07-11-2024
Assigning this to you [~thartmann], I'm focusing my work on the split out JBS issue JDK-8342498
01-11-2024
Some observations: - Failures all happen quite early, 7-11 seconds after VM startup - We always crash while processing oops in an interpreter frame (frame::oops_interpreted_do -> InterpreterOopMap::iterate_oop) - Most of the frames on the stack of the processed thread are either interpreted or C1 compiled - No relevant deopt from compiled code in the log - Happens both with G1 as well as with Parallel GC - Initial investigation suggests that the issue first reproduces with JDK 22 b15, potentially due to JDK-8308869 because it does not seem to reproduce anymore with -XX:TypeProfileSubTypeCheckCommonThreshold=100 (but that could just be due to different timing)
25-10-2024
FTR, Emanuel filed JDK-8342498 for the assert. We are currently trying to narrow down the GC crash.
24-10-2024
Ok, there seem to be 2 failure modes here, and possibly two separate bugs. The originally reported GC failure: Current thread (0x00004000d40146c0): WorkerThread "GC Thread#50" [id=3890367, stack(0x0000400498850000,0x0000400498a50000) (2048K)] Stack: [0x0000400498850000,0x0000400498a50000], sp=0x0000400498a4da70, free space=2038k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xae8918] markWord::displaced_mark_helper() const+0x18 V [libjvm.so+0x735060] G1ParCopyClosure<(G1Barrier)0, false>::do_oop(oopDesc*)+0x180 V [libjvm.so+0xb80b98] InterpreterOopMap::iterate_oop(OffsetClosure) const+0x114 V [libjvm.so+0x6a9e1c] frame::oops_interpreted_do(OopClosure, RegisterMap const, bool) const+0x16c V [libjvm.so+0x8187f0] JavaThread::oops_do_frames(OopClosure, CodeBlobClosure) [clone .part.0]+0xd0 V [libjvm.so+0xd0aa3c] Thread::oops_do(OopClosure, CodeBlobClosure)+0xbc V [libjvm.so+0xd16a70] Threads::possibly_parallel_oops_do(bool, OopClosure, CodeBlobClosure)+0x10c V [libjvm.so+0x737980] G1RootProcessor::process_java_roots(G1RootClosures, G1GCPhaseTimes, unsigned int)+0x80 V [libjvm.so+0x737a84] G1RootProcessor::evacuate_roots(G1ParScanThreadState, unsigned int)+0x64 V [libjvm.so+0x749c04] G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState, unsigned int)+0x24 But the assert we trigger with the replay file is a second failure mode: # Error: assert(this_region != nullptr) failed And in product we hit a SIGSEGV in a different place: Current CompileTask: C2:42341 186 b 4 org.apache.coyote.http11.Http11OutputBuffer::write (93 bytes) Stack: [0x00007fc30efaf000,0x00007fc30f0b0000], sp=0x00007fc30f0ab760, free space=1009k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x7910fa] G1BarrierSetC2::eliminate_gc_barrier(PhaseMacroExpand, Node) const+0x22a (node.hpp:406) V [libjvm.so+0xbd119f] PhaseMacroExpand::process_users_of_allocation(CallNode)+0x6bf (macro.cpp:159) V [libjvm.so+0xbd6d0e] PhaseMacroExpand::eliminate_allocate_node(AllocateNode)+0x1ee (macro.cpp:1100) V [libjvm.so+0xbd6e92] PhaseMacroExpand::eliminate_macro_nodes()+0x122 (macro.cpp:2386) V [libjvm.so+0xbd6f39] PhaseMacroExpand::expand_macro_nodes()+0x19 (macro.cpp:2434) V [libjvm.so+0x641bee] Compile::Optimize()+0x89e (compile.cpp:2446) V [libjvm.so+0x6432ad] Compile::Compile(ciEnv, ciMethod, int, Options, DirectiveSet)+0xedd (compile.cpp:857) V [libjvm.so+0x56b091] C2Compiler::compile_method(ciEnv, ciMethod, int, bool, DirectiveSet)+0x1f1 (c2compiler.cpp:134) V [libjvm.so+0x648c71] CompileBroker::invoke_compiler_on_method(CompileTask)+0xae1 (compileBroker.cpp:2299) V [libjvm.so+0x64bd58] CompileBroker::compiler_thread_loop()+0x498 (compileBroker.cpp:1958) V [libjvm.so+0x909d38] JavaThread::thread_main_inner() [clone .part.0]+0xb8 (javaThread.cpp:721) V [libjvm.so+0xebcf7f] Thread::call_run()+0x9f (thread.cpp:220) V [libjvm.so+0xce0485] thread_native_entry(Thread)+0xd5 (os_linux.cpp:789)
17-10-2024
CastP2X with NULL control is not GC related. That code was added during Hotspot C2 development: https://github.com/openjdk/jdk/blame/302540691b288d181ecdde8a8a6de2b49760f111/hotspot/src/share/vm/opto/library_call.cpp#L2587 I see that in JDK 6 changes already. At that time it `p` was a Load with Control edge so we could skip control for CastP2X. In current, mainline, code `p` could be constant (load from constant field) without control or some "normalized" value with control. So I think don't have control edge for CastP2X is fine there.
16-10-2024
Just had an offline chat with [~thartmann]. It could well be that the original crash in the GC is not the same issue as the assert I am now looking at. Hmm. I removed asserts until I hit a SIGSEGV. And it was still in G1BarrierSetC2::eliminate_gc_barrier, and not in G1GC or ParallelGC. That would indicate that we are dealing possibly with multiple issues here.
16-10-2024
To answer [~thartmann]: > I narrowed it down. The issue is introduced/triggered by JDK-8308606 in JDK 22 b03 (see hs_err_pid1979173.log) and fixed/hidden by JDK-8310190 in JDK 23 b05. > Emanuel, please have a look and verify that the fix was indeed introduced by JDK-8308606 and fixed JDK-8310190. If so, we need to re-triage those bugs. I think it may have been triggered/hidden by my changes, but not introduced/fixed. Because before and after my changes, SuperWord uses CastP2X without ctrl. And that is what the G1GC code asserts: it expects a ctrl. What I am not sure about: was it the G1GC code that is wrong here to assume all CastP2X must have ctrl, or is it the SuperWord code that makes a mistake by not having a ctrl for the CastP2X? I have been told (but may be wrong/misunderstood) that the ctrl makes sure that we do not evaluate the CastP2X before a safepoint but use its value after the SafePoint - where potentially the underlying pointer/object location has changed and now the CastP2X value is outdated. This could generally be a correctness problem. But in my understanding, the CastP2X use in SuperWord does not suffer in correctness (for SuperWord itself) - though maybe the alignment will be possibly suboptimal in those few cases where the array was moved underneath the execution - a rare and not so impactful consequence. Then the bug might just be in the G1GC code, because it makes a bad assumption about ctrl always being present. Now Roberto has changed the G1GC code in JDK-8334060, and the assertion is not present in JDK-24 any more. Not sure if this is a fix of a bad assert or just further hides a real issue. These are the 2 uses of CastP2X without ctrl: src/hotspot/share/opto/library_call.cpp: p = gvn().transform(new CastP2XNode(nullptr, p)); src/hotspot/share/opto/superword.cpp: Node* xbase = new CastP2XNode(nullptr, base); If no ctrl for CastP2X is generally an issue, then we should probably add an assert for that. Does anybody else have some ideas here?
16-10-2024
[~rcastanedalo] answered me offline. In G1BarrierSetC2::post_barrier, we had this: // Convert the store obj pointer to an int prior to doing math on it // Must use ctrl to prevent "integerized oop" existing across safepoint Node* cast = __ CastPX(__ ctrl(), adr);
15-10-2024
[~rcastanedalo] Just completely refactored this G1GC code with his new change 2 weeks ago: 8334060: Implementation of Late Barrier Expansion for G1 Ha, he actually removed the assumption. [~rcastanedalo] do you think that code was ever correct?
14-10-2024
I can reproduce it. And it looks like we find something unexpected with the CastP2XNode added in SuperWord::align_initial_loop_index (in JDK24 this is method VTransform::adjust_pre_loop_limit_to_align_main_loop_vectors) Node* xbase = new CastP2XNode(nullptr, align_to_ref_p.adr()); We explicitly do not set any control. We never have, in any JDK version I have seen, so that is not the issue. But in: src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp void G1BarrierSetC2::eliminate_gc_barrier(PhaseMacroExpand* macro, Node* node) const { if (is_g1_pre_val_load(node)) { macro->replace_node(node, macro->zerocon(node->as_Load()->bottom_type()->basic_type())); } else { assert(node->Opcode() == Op_CastP2X, "ConvP2XNode required"); assert(node->outcnt() <= 2, "expects 1 or 2 users: Xor and URShift nodes"); // It could be only one user, URShift node, in Object.clone() intrinsic // but the new allocation is passed to arraycopy stub and it could not // be scalar replaced. So we don't check the case. // An other case of only one user (Xor) is when the value check for null // in G1 post barrier is folded after CCP so the code which used URShift // is removed. // Take Region node before eliminating post barrier since it also // eliminates CastP2X node when it has only one user. Node* this_region = node->in(0); assert(this_region != nullptr, ""); Not sure why we land here, but we seem to assume that the CastP2X has a ctrl, and that this is a Region. (rr) p res->dump_bfs(15,find_node(5251),"#-dc") dist dump --------------------------------------------- 0 221 CheckCastPP === 382 383 [[ 978 974 2741 2449 2449 985 978 985 974 7094 ]] #java/nio/StringCharBuffer (java/lang/Comparable,java/lang/CharSequence,java/lang/Appendable,java/lang/Readable):NotNull:exact ,iid=772 Oop:java/nio/StringCharBuffer (java/lang/Comparable,java/lang/CharSequence,java/lang/Appendable,java/lang/Readable):NotNull:exact ,iid=772 !orig=5767 !jvms: CharBuffer::wrap @ bci:0 (line 548) CharBuffer::wrap @ bci:8 (line 569) Charset::encode @ bci:2 (line 946) MessageBytes::toBytes @ bci:47 (line 248) Http11OutputBuffer::write @ bci:9 (line 394) 1 7094 CastP2X === _ 221 [[ 7095 ]] 2 7095 ConvL2I === _ 7094 [[ 7106 ]] #int 3 7106 URShiftI === _ 7095 813 [[ 7107 ]] 4 7107 AndI === _ 7106 1484 [[ 7108 ]] !orig=[7098] 5 7108 AddI === _ 5552 7107 [[ 7100 ]] !orig=[7099] 6 7100 AddI === _ 7108 7093 [[ 7101 ]] 7 7101 AndI === _ 7100 1484 [[ 7102 ]] 8 7102 AddI === _ 5552 7101 [[ 7103 ]] 9 7103 MinI === _ 7102 462 [[ 5228 ]] !orig=[5476] 10 5228 CmpI === _ 5230 7103 [[ 5227 ]] !orig=4551,4341,[2016] !jvms: Buffer::hasRemaining @ bci:8 (line 533) ISO_8859_1$Encoder::encodeBufferLoop @ bci:6 (line 221) ISO_8859_1$Encoder::encodeLoop @ bci:24 (line 246) CharsetEncoder::encode @ bci:57 (line 586) CharsetEncoder::encode @ bci:51 (line 821) Charset::encode @ bci:17 (line 925) Charset::encode @ bci:5 (line 946) MessageBytes::toBytes @ bci:47 (line 248) Http11OutputBuffer::write @ bci:9 (line 394) 11 5227 Bool === _ 5228 [[ 5251 ]] [lt] !orig=4550,4342,[1824] !jvms: Buffer::hasRemaining @ bci:8 (line 533) ISO_8859_1$Encoder::encodeBufferLoop @ bci:6 (line 221) ISO_8859_1$Encoder::encodeLoop @ bci:24 (line 246) CharsetEncoder::encode @ bci:57 (line 586) CharsetEncoder::encode @ bci:51 (line 821) Charset::encode @ bci:17 (line 925) Charset::encode @ bci:5 (line 946) MessageBytes::toBytes @ bci:47 (line 248) Http11OutputBuffer::write @ bci:9 (line 394) 12 5251 CountedLoopEnd === 5250 5227 [[ 5252 5293 ]] [lt] P=0.500000, C=59974.000000 !orig=4591,4343,[1556] !jvms: ISO_8859_1$Encoder::encodeBufferLoop @ bci:9 (line 221) ISO_8859_1$Encoder::encodeLoop @ bci:24 (line 246) CharsetEncoder::encode @ bci:57 (line 586) CharsetEncoder::encode @ bci:51 (line 821) Charset::encode @ bci:17 (line 925) Charset::encode @ bci:5 (line 946) MessageBytes::toBytes @ bci:47 (line 248) Http11OutputBuffer::write @ bci:9 (line 394) $26 = void We have some Allocate node, and its result is a "221 CheckCastPP". This also happens to be the base pointer for which SuperWord tries to align. Somehow, the G1GC code seems to assume that if a use of CheckCastPP is a CastP2X, then it must have a ctrl. But this is a bit of a strong assumption... we can also use if for alignment in SuperWord without a ctrl. Why did my SuperWord changes have an effect on this? Need to investigate that... Maybe I need to extract a reproducer.
14-10-2024
I can reproduce the issue on both Linux x64 and AArch64 with attached replay_pid3400217.log and jars from jars.zip from JDK 22 b26 up to JDK-8310190 in JDK 23 b05: java -XX:+ReplayCompiles -XX:+ReplayIgnoreInitErrors -XX:ReplayDataFile=replay_pid3400217.log -cp "jars/" # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (workspace/open/src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp:730), pid=3769317, tid=3769331 # Error: assert(this_region != nullptr) failed # # JRE version: Java(TM) SE Runtime Environment (23.0+4) (fastdebug build 23-ea+4-185) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 23-ea+4-185, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xc38091] G1BarrierSetC2::eliminate_gc_barrier(PhaseMacroExpand, Node*) const+0x411 Before JDK 22 b26, the replay file does not work because of profile information changes by JDK-8267532.
14-10-2024
Emanuel, please have a look and verify that the fix was indeed introduced by JDK-8308606 and fixed JDK-8310190. If so, we need to re-triage those bugs.
09-10-2024
I narrowed it down. The issue is introduced/triggered by JDK-8308606 in JDK 22 b03 (see hs_err_pid1979173.log) and fixed/hidden by JDK-8310190 in JDK 23 b05.
09-10-2024
ILW = Crash in GC or assert during C2 compilation, intermittent with Dacapo benchmark but reproducible with replay file, -XX:-UseSuperWord or disable compilation of affected method = HMM = P2
08-10-2024
Not able to reproduce after 2500 runs. [~ecaspole] were you able to narrow this down a bit more?
07-10-2024
I tried the replay file that Eric provided but unfortunately it does not work: java -XX:+ReplayCompiles -XX:+ReplayIgnoreInitErrors -XX:ReplayDataFile=replay_pid3400217.log -cp dacapo-23.11-chopin/jar/spring/:dacapo-23.11-chopin/jar/tomcat/ # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (workspace/open/src/hotspot/share/ci/ciReplay.cpp:1439), pid=1383720, tid=1383734 # assert(m->_data_size + m->_extra_data_size == rec->_data_length * (int)sizeof(rec->_data[0]) \|\| m->_data_size == rec->_data_length * (int)sizeof(rec->_data[0])) failed: must agree Looks like we assert when compiling this method: https://github.com/apache/tomcat/blob/main/java/org/apache/coyote/http11/Http11OutputBuffer.java#L387 Trying to reproduce.
01-10-2024
Moving to compiler team since the assert is in optimization of the G1 barrier code in C2.
01-10-2024
Running with fastdebug builds I get an assert - # Error: assert(this_region != nullptr) failed # # JRE version: Java(TM) SE Runtime Environment (22.0.2+9) (fastdebug build 22.0.2+9-70) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22.0.2+9-70, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # V [libjvm.so+0xb03c40] G1BarrierSetC2::eliminate_gc_barrier(PhaseMacroExpand, Node) const+0x3bc # Stack: [0x0000fffc7e440000,0x0000fffc7e640000], sp=0x0000fffc7e63ab70, free space=2026k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xb03c40] G1BarrierSetC2::eliminate_gc_barrier(PhaseMacroExpand, Node) const+0x3bc (g1BarrierSetC2.cpp:730) V [libjvm.so+0x116af40] PhaseMacroExpand::process_users_of_allocation(CallNode)+0x5c0 (macro.cpp:159) V [libjvm.so+0x1175e34] PhaseMacroExpand::eliminate_allocate_node(AllocateNode)+0x284 (macro.cpp:1100) V [libjvm.so+0x1176518] PhaseMacroExpand::eliminate_macro_nodes()+0x3d8 (macro.cpp:2386) V [libjvm.so+0x11767f4] PhaseMacroExpand::expand_macro_nodes()+0x14 (macro.cpp:2434) V [libjvm.so+0x8e3d7c] Compile::Optimize()+0x98c (compile.cpp:2439) V [libjvm.so+0x8e6940] Compile::Compile(ciEnv, ciMethod, int, Options, DirectiveSet)+0x14a0 (compile.cpp:856) V [libjvm.so+0x73a3f0] C2Compiler::compile_method(ciEnv, ciMethod, int, bool, DirectiveSet)+0x17c (c2compiler.cpp:134) V [libjvm.so+0x8f2354] CompileBroker::invoke_compiler_on_method(CompileTask)+0x7d0 (compileBroker.cpp:2305) V [libjvm.so+0x8f2f1c] CompileBroker::compiler_thread_loop()+0x598 (compileBroker.cpp:1964) V [libjvm.so+0xd73d6c] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:721) V [libjvm.so+0x15b1f80] Thread::call_run()+0xac (thread.cpp:225) V [libjvm.so+0x1324134] thread_native_entry(Thread)+0x130 (os_linux.cpp:796) C [libpthread.so.0+0x7950] start_thread+0x190 I will switch to ParGC and see what happens.
13-09-2024
I ran the same ${JVM_HOME}/OpenJDK/22.0.2+9/bin/java -XX:+UseG1GC ${DACAPO_HOME}/dacapo-23.11-chopin.jar --iterations 2 --size default spring on a Graviton2 700 times, and got 4 hs_err_pid*.log files. Three of them look like the ones I attached earlier (at markWord::displaced_mark_helper() const+0x18) and I have attached one of those. But one of them says the SIGSEGV is at markWord::displaced_mark_helper() const+0x24 which is enough different that I have attached it also.
13-09-2024
Running with ${OPENJDK_HOME}/jdk-22.0.2/bin/java -XX:+UseG1GC -jar /tmp/pkessler/DaCapo_23.11-chopin/dacapo-23.11-chopin.jar --iterations 2 --size default spring (that is, using G1GC and --iterations 2) works well (~2% of the time) to reproduce the crash. It is possible that the difference in failure rates between G1GC and ParallelGC is some heap policy that causes more frequent collections with G1GC. Which suggests that this issue might be susceptible to -XX:+ScavengeALot (or its friends) in a non-product build.
09-09-2024
That's 2 crashes in 1500 runs, if you are keeping score. Since these crashes seem to happen early in the runs, probably the `--iterations 5` on the command line could be changed to `--iterations 2` to save some time. But I haven't tried that.
07-09-2024
I changed the title of the issue from G1GC to GC, since this seems not to be a G1GC issue.
07-09-2024
Good (!?) news. I was able to get one of these crashes with -XX:+UseParallelGC. I now have 2 hs_err_pid.log files that say: # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000400008ab8918, pid=440290, tid=440337 # # JRE version: OpenJDK Runtime Environment (22.0.2+9) (build 22.0.2+9-70) # Java VM: OpenJDK 64-Bit Server VM (22.0.2+9-70, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-aarch64) # Problematic frame: # V [libjvm.so+0xae8918] markWord::displaced_mark_helper() const+0x18 ... --------------- S U M M A R Y ------------ Command Line: -XX:+UseParallelGC /home/pkessler/Work/DaCapo/DaCapo_23.11-chopin/dacapo-23.11-chopin.jar --iterations 5 --size default spring Host: AArch64, 160 cores, 509G, Fedora release 36 (Thirty Six) Time: Sat Sep 7 03:19:09 2024 PDT elapsed time: 8.101434 seconds (0d 0h 0m 8s) ... Stack: [0x0000400097af0000,0x0000400097cf0000], sp=0x0000400097cedc00, free space=2039k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xae8918] markWord::displaced_mark_helper() const+0x18 V [libjvm.so+0xc0a088] PSRootsClosure<false>::do_oop(oopDesc)+0x48 V [libjvm.so+0xb80b98] InterpreterOopMap::iterate_oop(OffsetClosure) const+0x114 V [libjvm.so+0x6a9e1c] frame::oops_interpreted_do(OopClosure, RegisterMap const, bool) const+0x16c V [libjvm.so+0x8187f0] JavaThread::oops_do_frames(OopClosure, CodeBlobClosure) [clone .part.0]+0xd0 V [libjvm.so+0xd0aa3c] Thread::oops_do(OopClosure, CodeBlobClosure)+0xbc V [libjvm.so+0xc08b30] PSThreadRootsTaskClosure::do_thread(Thread)+0x50 V [libjvm.so+0xd15e08] Threads::possibly_parallel_threads_do(bool, ThreadClosure)+0x108 V [libjvm.so+0xc09cbc] ScavengeRootsTask::work(unsigned int)+0x12c V [libjvm.so+0xdae9c8] WorkerThread::run()+0x98 V [libjvm.so+0xd0af28] Thread::call_run()+0xa8 V [libjvm.so+0xb93f10] thread_native_entry(Thread*)+0xdc C [libc.so.6+0x809b8] start_thread+0x2c8 I have attached hs_err_pid440290.log and hs_err_pid537557.log.
07-09-2024
If you are running on a dual-processor Altra, it might make your run times more consistent to run with `numactl --cpunodebind=0 --membind=0 --`. The spring benchmark scales to 16 cores, but there is no point in giving it all 160 cores to get confused on.
03-09-2024
I realized I accidentally edited out the `-jar` option from the command line when I simplified it. I added it back in the description above.
03-09-2024
I apologize for not having a lib/hsdis-aarch64.so in place to decode the instructions in the hs_err_pid*.log file. I have rectified that now, but I will have to wait for a new crash to know if I did that correctly. [Update: I don't think hs_err processing tries to disassemble the instructions around the crash site.]
31-08-2024
I think that markWord::displaced_mark_helper() is https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/markWord.cpp#L32 markWord markWord::displaced_mark_helper() const { assert(has_displaced_mark_helper(), "check"); if (has_monitor()) { // Has an inflated monitor. Must be checked before has_locker(). ObjectMonitor* monitor = this->monitor(); return monitor->header(); } if (has_locker()) { // has a stack lock BasicLock* locker = this->locker(); return locker->displaced_header(); } // This should never happen: fatal("bad header=" INTPTR_FORMAT, value()); return markWord(value()); } and that the disassembly around the site of the segfault is 0x40002ad78900 : mov x4, x0 0x40002ad78904 : ldr x0, [x0] 0x40002ad78908 : and x1, x0, #3 0x40002ad7890c : cmp x1, #2 0x40002ad78910 : b.eq #0x40002ad78920 0x40002ad78914 : cbnz x1, #0x40002ad7892c 0x40002ad78918 -> ldr x0, [x0] 0x40002ad7891c : ret 0x40002ad78920 : eor x0, x0, #2 0x40002ad78924 : ldr x0, [x0] 0x40002ad78928 : ret 0x40002ad7892c : ... I think we have decided that has_monitor() is false, but has_locker() is true, and we are trying to return this->locker()->displaced_header(). But we seqfault instead.
31-08-2024
On my machine, one iteration of DaCapo spring takes about 4 seconds, so the 5 iterations I am running here, with harness setup, etc, takes about 20 seconds. YMMV. The first iteration takes longer because of runtime compilation, etc. The command line I show causes the DaCapo harness to call System.gc() between each iteration, but there are 130 other collections during a 5-iteration run. The crashes all seem to be before (or during?) the first call to System.gc(): $ for i in 20240830-/hs_err_pid.log ; do > grep -e '^Event: [0-9][0-9.]* GC heap ' -e '{Heap .* GC invocations=' "${i}" /dev/null \| head -2 > done 20240830-115604/hs_err_pid3737279.log:Event: 5.293 GC heap after 20240830-115604/hs_err_pid3737279.log:{Heap after GC invocations=13 (full 0): 20240830-115758/hs_err_pid3742736.log:Event: 4.734 GC heap after 20240830-115758/hs_err_pid3742736.log:{Heap after GC invocations=10 (full 0): 20240830-120325/hs_err_pid3758012.log:Event: 4.999 GC heap after 20240830-120325/hs_err_pid3758012.log:{Heap after GC invocations=11 (full 0): 20240830-122957/hs_err_pid3817928.log:Event: 4.899 GC heap after 20240830-122957/hs_err_pid3817928.log:{Heap after GC invocations=12 (full 0): 20240830-125621/hs_err_pid3889398.log:Event: 5.178 GC heap after 20240830-125621/hs_err_pid3889398.log:{Heap after GC invocations=12 (full 0): Since all the GC event traces start with "after" events I think that implies that the previous event finished, successfully. The Java thread stack traces (not shown here) make me think the JVM has started a young collection (in the middle of doing work) rather than a call to System.gc(). That is, all the crashes might be before the System.gc() call between the first and second iteration of the spring benchmark.
31-08-2024