JDK-8289127 : Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 17,18,19
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2022-06-24
  • Updated: 2022-08-04
  • Resolved: 2022-07-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 19 JDK 20
17.0.5-oracleFixed 19 b32Fixed 20Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
When running Apache Lucene tests on the branch to introduce memory segment support in Apache Lucene (for Java 19 with --enable-preview), we have seen the following error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (macroAssembler_x86.cpp:845), pid=3260852, tid=3260993
#  fatal error: DEBUG MESSAGE: duplicated predicate failed which is impossible
#
# JRE version: OpenJDK Runtime Environment (19.0+27) (build 19-ea+27-2074)
# Java VM: OpenJDK 64-Bit Server VM (19-ea+27-2074, mixed mode, sharing, tiered, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xb6e5b1]  MacroAssembler::debug64(char*, long, long*)+0x41
#

The used branch in Lucene is this one:
https://github.com/uschindler/lucene/tree/draft/jdk-foreign-mmap-jdk19

When digging into hs_err the following correlation to project panama seems to occur:

While hitting this error it looks like it compiles org.apache.lucene.codecs.lucene90.ForUtil::decode14 (code: 
https://github.com/uschindler/lucene/blob/draft/jdk-foreign-mmap-jdk19/lucene/core/src/java/org/apache/lucene/codecs/lucene90/ForUtil.java#L726-L740).
This calls DataInput::readLong, implemented by MemorySegmentIndexInput (https://github.com/uschindler/lucene/blob/draft/jdk-foreign-mmap-jdk19/lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java#L182-L194)

In standrad Lucene builds we have not seen this error, so it could really be related to project Panama.
Comments
A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/579 Date: 2022-07-21 15:50:26 +0000
21-07-2022

17u backport request: I'd like to backport this 17u. It applies cleanly. The bug does not reproduce with 17u out of the box but I could get it to reproduce by tweaking c2. It's quite possible this would never happen in the wild but it's a nasty bug that's hard to diagnose and causes crashes. The fix is low risk. Tested with tier1.
21-07-2022

Thanks for correcting me on the fact that it affects all builds. You're obviously right.
20-07-2022

> First, this bug only affects fastdebug builds. A release build wouldn't be affected. We have seen the bug in standard builds downloaded from https://jdk.java.net/19/ - maybe there were some builds which have this assertion enabled, although I assumed that the EA builds are standard production builds! The exact version that we used was posted above, see the hs_err.pid at end: e.g., https://bugs.openjdk.org/secure/attachment/99766/hs_err_pid3047355.log But anyway, thanks for backporting! Uwe
20-07-2022

First, this bug only affects fastdebug builds. A release build wouldn't be affected. The bug doesn't reproduce on 17 because there are other changes in 19 that affect loop optimizations and are needed for the bug to occur with this particular test case. So the bug is there in 17, doesn't reproduce with this test case but could possibly happen with another code shape. I think it's a nasty bug and it's better to have it backported anyway.
19-07-2022

[~uschindler] it doesn't seem to reproduce with 17 but could be by chance. Let me take a closer look.
18-07-2022

Changeset: 4f3f74c1 Author: Roland Westrelin <roland@openjdk.org> Date: 2022-07-18 07:08:49 +0000 URL: https://git.openjdk.org/jdk19/commit/4f3f74c14121d0a80f0dcf1d593b4cf1c3e4a64c
18-07-2022

[~uschindler], yes I tried 17.0.5, but please give it a try yourself. You might have better luck.
18-07-2022

Dean, I'd suggest to leave it up to Roland to clarify if this is a regression introduced by JDK-8286625 or what else could have caused this. JDK-8286625 was backported to 17.0.5 and 17.0.5-oracle (see JIRA), so it is important to check. I will try to reproduce it with 17.x, but I am a bit busy at moment. Uwe
18-07-2022

Did you test 17.0.5, because JDK-8286625 was backported to this branch. So 17.0.4 should not have the problem . I just want to make sure we don't get our users affected by this bug if they upgrade. If it is unrelated and not a regression introduced by JDK-8286625, I don't care.
16-07-2022

[~uschindler], I haven't been able to reproduce this with jdk17. I would think the fix would be safe to back-port, however.
16-07-2022

Thanks. Do we need to backport this to JDK 17? It looks like the change JDK-8286625 in the commit of Wed Jun 8 06:35:28 2022 +0000 (mentioned above by Dean) was backported to jdk-17? Or was this caused by something else?
15-07-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk19/pull/143 Date: 2022-07-15 12:30:56 +0000
15-07-2022

Fix is this: diff --git a/src/hotspot/share/opto/loopTransform.cpp b/src/hotspot/share/opto/loopTransform.cpp index b3a7594ec88..739e92fb343 100644 --- a/src/hotspot/share/opto/loopTransform.cpp +++ b/src/hotspot/share/opto/loopTransform.cpp @@ -1390,6 +1390,7 @@ static bool skeleton_follow_inputs(Node* n, int op) { op == Op_OrL || op == Op_RShiftL || op == Op_LShiftL || + op == Op_LShiftI || op == Op_AddL || op == Op_AddI || op == Op_MulL || A predicate is not updated when copied above the main loop.
12-07-2022

When I bisect the failure, the changeset that seems to introduce the problem is this one: Author: Roland Westrelin <roland@openjdk.org> Date: Wed Jun 8 06:35:28 2022 +0000 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect Reviewed-by: thartmann, chagedorn [~roland], please take a look at this.
09-07-2022

That last command-line did the trick. I was able to reproduce it. Thanks.
05-07-2022

Hi again, when ooking at the hserr log files, it seems to happen always in Lucene's Join module in exactly one test execution. So better run tests on join module not on the Lucene core library (this adds more complexity, possibly making it fail earlier): $ cd /path/to/lucene-checkout $ export JAVA_HOME=/path/to/jdk17 $ export RUNTIME_JAVA_HOME=/path/to/jdk19-ea+27 $ ./gradlew :lucene:join:test -Ptests.multiplier=3 This may be anoyoing to run and it does a lot of stuff. As all failures were in one test and one test method, it is better to just run this over and over, with additional data. Sorry for not giving that information before. Our Jenkins server already uses 3 times more random data during tests (parameter '-Ptests.multiplier=3"). To reproduce it easier (still takes some time), you may use this command line, which only runs the single test which always fails in a loop. The Gradle task is "beast" for doing "beasting" named after the famous "beast" Machine with 128 cores by Mike McCandless: $ ./gradlew :lucene:join:beast -Ptests.multiplier=6 -Ptests.dups=100 -Ptests.jvmargs="-XX:CompileCommand=DumpReplay,org.apache.lucene.codecs.lucene90.ForUtil::decode14" --tests TestBlockJoin.testMultiChildQueriesOfDiffParentLevels You see also how to pass the extra JVM command line args. The above command line will do the following: - Run test "TestBlockJoin.testMultiChildQueriesOfDiffParentLevels" 100 times, each in a separate JVM - Use a test multiplier of 6: this will use 6 times more random indexing/query data to add more computational complexity to the test - Pass the compile cmmand to get a replay file The replay and hserr files will be in: ./lucene/join/build/tmp/tests-cwd This command failed for me on the second iteration of 100 after like 2 minutes. I was doing this on my local windows laptop, so it also affects Intel CPUs (the Jenkins server was AMD Ryzen) and Windows.
02-07-2022

You need to run it in a loop, there are possibilities to do this also in an automated way. We had a similar problem with another bug, but Tobias Hartmann was able to reproduce and he works on a fix, see JDK-8285835 for details.
02-07-2022

I have tried both the jdk-foreign-mmap-jdk19 branch and non-Panama main branch with jdk19+27, and I can't reproduce the crash.
02-07-2022

I attached 3 files: - hs_err_pid3047355.log - replay_pid3047355_compid7571.log - replay_pid3047355_compid7604.log Hope this helps to reproduce or understand. To reproduce, just run Lucene's Gradle build with JAVA_HOME set to JDK-17 (thats for bootstrapping the build, because unfortunately Gradle does not run with later versions) and also set RUNTIME_JAVA_HOME to an JDK-19 instance. It will then run the test suite with RUNTIME_JAVA_HOME: $ gradlew :lucene:core:test If you repeat this often enough it will crash like above.
01-07-2022

Hi, I added this to the Jenkins config file. Once another crash occurs I will upload the file. I will also do some local testing.
30-06-2022

My mistake, it's crashing while executing the compiled code, not while compiling the method. You can try "-XX:CompileCommand=DumpReplay,org.apache.lucene.codecs.lucene90.ForUtil::decode14" to capture a replay file for successful compiles.
29-06-2022

Hi, it did not create a replay file by default, only hs_err. What command line parameters should I add to give you more insights? The error happens all the time, so every 50% of all Lucene builds with 19-ea+27, so it is easy to reproduce.
29-06-2022

[~uschindler], do you have any replay file from the crashes?
28-06-2022

ILW = crash; with Lucene; disable compilation of affected method = HMM = P2
28-06-2022

Always happens when compiling this method: org.apache.lucene.codecs.lucene90.ForUtil::decode14 See: https://github.com/apache/lucene/blob/3e74ebbc0d4b5ed1b1847d8ada9157586d774307/lucene/core/src/java/org/apache/lucene/codecs/lucene90/ForUtil.java#L726-L740 All other methods in this class (they are generated by a script) are not (yet) affected.
28-06-2022

It looks like this was introduced after JDK 19 build 24 and was triggered now regularily in Apache Lucene test runs with build 27.
28-06-2022

Hi, it also happens with JDK 19 preview builds on the non-Panama branches of Apache Lucene: https://jenkins.thetaphi.de/job/Lucene-main-Linux/35387/ I uploaded more hs_err files.
28-06-2022