JDK-8270090 : C2: LCM may prioritize CheckCastPP nodes over projections
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11.0.13,17,18,19
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • CPU: arm
  • Submitted: 2021-07-08
  • Updated: 2022-07-03
  • Resolved: 2022-04-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 17 JDK 19
11.0.17-oracleFixed 17.0.5-oracleFixed 19 b18Fixed
Related Reports
Relates :  
Relates :  
Description
LCM should prioritize projections over other nodes when selecting among the ready list, to ensure that nodes are directly followed by their projections in local schedules.

Currently, LCM gives equal priority to projections, constants, CreateEx, and CheckCastPP nodes (see PhaseCFG::select()), effectively relying on the order in which these nodes appear in the ready list for tie-breaking.

In ARM32, this leads to the assertion failure reported below, where a CheckCastPP node is scheduled between a node and its projection. Even if this assertion failure has only showed up for ARM32, it could potentially happen for other platforms if the order of insertion in the ready list is altered.

An example subgraph for x86-64 that could potentially suffer from the same issue is attached. Depending on the node order within the ready list, the following local schedule could be produced:

461 membar_storestore
460 MachProj
504 checkCastPP
498 MachProj

This local schedule is problematic because all projections of 461 membar_storestore (460 MachProj, 498 MachProj) are not scheduled directly after 461.

ORIGINAL REPORT:

As we're part of the OpenJDK Quality Outreach program I recently started running the JaCoCo builds on a fastdebug build of OpenJDK 18-ea and observed the following crash during the execution of beanshell-maven-plugin task:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/workspace/src/hotspot/share/opto/block.cpp:1281), pid=7, tid=2051
#  assert(pred == parent || (pred->is_Proj() && pred->in(0) == parent)) failed: projections must follow their parents or other sibling projections
#
# JRE version: OpenJDK Runtime Environment (18.0) (fastdebug build 18-internal+0-adhoc..workspace)
# Java VM: OpenJDK Server VM (fastdebug 18-internal+0-adhoc..workspace, mixed mode, sharing, g1 gc, linux-arm)
# Problematic frame:
# V  [libjvm.so+0x22e72a]  PhaseCFG::verify() const+0x4e9
#
# Core dump will be written. Default location: /workspace/core
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#

The problem occures reproducible on our ARM32 build environment. 
The JDK is based on commit 4fbcce119b1736455cb74d0a585097eca617593c


Comments
Fix request [11u] I backport this for parity with 11.0.17-oracle. Typical risk of a C2 fix. Needs follow-up 8285820. Clean backport except for Copyright. SAP nightly testing passed.
29-06-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1178 Date: 2022-06-24 09:40:32 +0000
24-06-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk18u/pull/167 Date: 2022-06-21 08:00:29 +0000
21-06-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk18u/pull/153 Date: 2022-06-20 13:40:23 +0000
20-06-2022

Fix request [17u] I backport this for parity with 17.0.5-oracle. Needs a follow-up. Clean backport. SAP nightly testing passed.
09-06-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk17u-dev/pull/441 Date: 2022-06-08 10:19:58 +0000
08-06-2022

Changeset: 8ebea443 Author: Roberto CastaƱeda Lozano <rcastanedalo@openjdk.org> Date: 2022-04-11 06:37:57 +0000 URL: https://git.openjdk.java.net/jdk/commit/8ebea443f333ecf79d6b0fc725ededb231e83ed5
11-04-2022

It would be great if an ARM32 tester/maintainer could extract and contribute a small regression test case out of the original failure.
28-03-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7988 Date: 2022-03-28 10:18:31 +0000
28-03-2022

Thanks [~marchof], a patch is now submitted for review.
28-03-2022

Yes, should be triggered automatically. Indeed you can see your latest commit id here: https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740
25-03-2022

Hi again [~marchof], I have pushed a new version of the patch to https://github.com/robcasloz/jdk/tree/JDK-8270090, and the build on ARM32 seems to succeed according to https://pici.beachhub.io/#/JDK-8270090. I am unsure whether the JaCoCo build in https://pici.beachhub.io/#/JDK-8270090-jacoco is triggered automatically. If not, could you please try another run?
25-03-2022

Great, thanks! I will assign this issue to myself and submit a PR with a more refined and better-tested version of the patch. This issue seems to only show up for ARM32 but it could potentially affect other platforms as well in the future, as LCM is currently relying on the node order within the ready list for correctness.
10-03-2022

I kicked another build of my original scenario: It's green :) Looks like your fix works. Thanks [~rcastanedalo]!
09-03-2022

[~rcastanedalo] The original problem seems to be fixed, and I cannot see any JVM crashes. But: the test subject (JaCoCo build) is *extremly* slow with the JDK built from your branch and the build fails at a later point in time (probably due to some race condition). Maybe due to the additional logging? Here are the CI builds: JDK build: https://pici.beachhub.io/#/JDK-8270090 JaCoCo build: https://pici.beachhub.io/#/JDK-8270090-jacoco
09-03-2022

[~rcastanedalo] You're right, the debug build on master crashes relatively early but still takes long time. So probably is is due to the debug build.
09-03-2022

[~marchof] Great, thanks for trying out the branch! The additional logging in the branch is only emitted if an assertion fails, which does not seem to happen here (unless I am missing something i the JaCoCo build output log). Couldn't the slowness be due to the fact that it is a JDK debug version running on relatively slow hardware? I will do some performance regression testing of the branch on other systems, but I do not expect major regressions.
09-03-2022

Hi again, I pushed a new tentative fix to https://github.com/robcasloz/jdk/tree/JDK-8270090 (commit dd868a1). The JDK seems to build fine this time according to https://pici.beachhub.io/#/JDK-8270090/20220308-105241. [~marchof] could you please see if it addresses the original failure?
08-03-2022

Hi [~marchof], thanks for trying out the branch and for setting up the CI build, I will investigate further.
07-03-2022

Hi [~rcastanedalo], here is a ARM32 CI build for your branch if this helps: https://pici.beachhub.io/#/JDK-8270090
03-03-2022

Hi [~rcastanedalo] I tried to build your branch (commit 38fdaa9) on ARM32. The JDK build itself fails with the following error: * For target jdk__optimize_image_exec: # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/block.cpp:1236 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/src/hotspot/share/opto/block.cpp:1236), pid=16763, tid=16783 # assert(j == 1 || block->get_node(j-1)->is_Phi()) failed: CreateEx must be first instruction in block # # JRE version: OpenJDK Runtime Environment (19.0) (fastdebug build 19-internal-adhoc..workspace) # Java VM: OpenJDK Server VM (fastdebug 19-internal-adhoc..workspace, mixed mode, g1 gc, linux-arm) # Problematic frame: # V [libjvm.so+0x2652b2] PhaseCFG::verify() const+0x631 # # Core dump will be written. Default location: /workspace/make/core #
03-03-2022

As the original reporter I will test the branch on ARM32. Please give me till end of this week.
02-03-2022

Hi [~shade], based on the information you shared in your comment (2021-07-09 15:21), it seems that the assertion is failing because there is a "wild" projection node ("n") that is not scheduled right after its parent node ("parent"). JDK-8263227 added this assertion and actually relies on projection nodes being scheduled right after their parent nodes. I do not have access to an ARM32 system, so I cannot find out why "n" is not scheduled right after "parent". Looking at the source code in lcm.cpp, however, I realize now that CheckCastPP nodes (such as "pred" in this failure) might actually be selected for local scheduling earlier than projection nodes, and that *might* be the root cause of this failure. I was probably misled by the comment "Projections always win" and "Projections take top priority for correctness reasons" on lcm.cpp when I worked on JDK-8263227. [~shade] (or anyone with access to an ARM32 system): if you are still able to reproduce the failure, could you please see if the (tentative, not thoroughly tested) fix in https://github.com/robcasloz/jdk/tree/JDK-8270090 prevents it?
01-03-2022

Unfortunately, I'm not able to reproduce this on x86, even with -XX:+OptoScheduling (which is default on ARM) and I don't have an 32-bit ARM system for testing. I'm unassigning this and deferring for now. [~shade] and other 32-bit ARM testers/maintainers, feel free to pick this up and re-target.
16-11-2021

Sorry for not following this up sooner. Please find the replay log attached now.
28-10-2021

Usually, only the classes corresponding to the compiled method (in that case "org.apache.tools.ant.types.Path::translatePath") and the inlined methods are required. Please share the replay file if you have one and I can give it a try as well.
16-08-2021

> You just need to make sure that all required classes are on the classpath. That will be tough, it is actually a Maven build which is crashing. I don't think I can construct a class path for this.
13-08-2021

When the JVM crashes, it should print something like: # An error report file with more information is saved as: # [...]/hs_err_pid469621.log [...] # # Compiler replay data is saved as: # [...]/replay_pid469621.log The replay_pid file can then be used to replay the compilation (and hopefully reproduce the crash): java -XX:+ReplayIgnoreInitErrors -XX:+ReplayCompiles -XX:ReplayDataFile=replay_pid469621.log You just need to make sure that all required classes are on the classpath. If that does not reproduce the issue, you can try: java -XX:+ReplayIgnoreInitErrors -XX:+ReplayCompiles -XX:ReplayDataFile=replay_pid469621.log -XX:RepeatCompilation=1000 -XX:+StressIGVN -XX:+StressGCM -XX:+StressLCM
13-08-2021

I can try with with replay compilation. Do you have any pointers for me how to enable this? Thx.
13-08-2021

[~shade], FYI, [~rcastanedalo] is on parental leave until the end of the year. Does this only reproduce on 32-bit ARM? Does it reproduce with replay compilation? And if so, could you please share the replay file?
13-08-2021

As JDK-8263227 was backported to Java 11 I tried to reproduce it with a fastdebug build of jdk11u-dev (commit id 93f952c95b1db5b7226b5255b61caa539225f3e2) but was not able to reproduce the problem.
15-07-2021

ILW = MMH = P3
12-07-2021

I was able to reproduce it, very very slowly. The "parent" and "pred" nodes in that assert are: parent: 1167 safePoint_poll === 1171 0 2083 0 0 1168 112 0 134 227 1517 0 0 [[ 1169 1170 1164 2112 ]] !jvms: String::charAt @ bci:1 (line 1511) StringTokenizer::setMaxDelimCodePoint @ bci:38 (line 150) StringTokenizer::<init> @ bci:48 (line 200) PathTokenizer::<init> @ bci:34 (line 67) Path::translatePath @ bci:22 (line 397) pred: 2112 checkCastPP === 1167 227 [[ 1166 2129 2119 ]] org/apache/tools/ant/PathTokenizer:NotNull * Oop:org/apache/tools/ant/PathTokenizer:NotNull * !jvms: StringTokenizer::<init> @ bci:39 (line 198) PathTokenizer::<init> @ bci:34 (line 67) Path::translatePath @ bci:22 (line 397) [~rcastanedalo], any clues?
09-07-2021

1) Clone https://github.com/jacoco/jacoco 2) mvn clean verify -B -Dbytecode.version=17
09-07-2021

Marc, do you have a reproducer? E.g. what project do you build and which Maven invocation?
09-07-2021

That assert was added by JDK-8263227. It might not be a reason for failure, though.
09-07-2021

With JDK release builds (17+18) my test builds succeed without any issues.
08-07-2021

Same problem can be reproduced with Java 17-ea (based on commit 168af2e6b2343d6674fa053dcb09aca028e372bf): # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/src/hotspot/share/opto/block.cpp:1281), pid=6, tid=2157 # assert(pred == parent || (pred->is_Proj() && pred->in(0) == parent)) failed: projections must follow their parents or other sibling projections # # JRE version: OpenJDK Runtime Environment (17.0) (fastdebug build 17-internal+0-adhoc..workspace) # Java VM: OpenJDK Server VM (fastdebug 17-internal+0-adhoc..workspace, mixed mode, sharing, g1 gc, linux-arm) # Problematic frame: # V [libjvm.so+0x23b1ea] PhaseCFG::verify() const+0x4e9 # # Core dump will be written. Default location: /workspace/core # # If you would like to submit a bug report, please visit: # https://bugreport.java.com/bugreport/crash.jsp #
08-07-2021