Bug ID: JDK-8255763 C2: OSR miscompilation caused by invalid memory instruction placement

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 6,7,8,9,10,11,12,13,14,15,16,17

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2020-11-02
Updated: 2025-02-12
Resolved: 2020-12-21

JDK 11	JDK 16	JDK 17
11.0.12-oracleFixed	16 b30Fixed	17Fixed

FAILURE:

The attached program Count.java (the most recent version, for some reason I
cannot remove the old one) is miscompiled by C2:

# Interpreted.
$ java -Xint Count.java
count: 0
count: 1
count: 2
count: 3

# Compiled with C1.
java -Xcomp -XX:CompileOnly=Count Count.java
count: 0
count: 1
count: 2
count: 3

# Compiled with C2: count jumps from 2 to 4.
java -Xcomp -XX:-TieredCompilation -XX:CompileOnly=Count Count.java
count: 0
count: 1
count: 2
count: 4

ANALYSIS:

The problem is introduced by an OSR compilation that triggers after "count: 2"
(entering right before the innermost loop). In the miscompiled program, the
load, increment, and store instructions corresponding to the "count++" statement
are wrongly moved into a loop header and end up being executed on two of the
three switch cases. The wrong placement can be seen in the attached CFG
(count-actual.pdf) generated after global code motion (some optimizations are
disabled for simplicity): nodes 113, 112, and 110 implement the "count++"
statement and are wrongly placed into block B8 (the header of loop L) instead of
the preheader block B7. This bug is due to the combination of two underlying
problems in C2.

Problem 1

The placement is wrongly allowed by the Ideal representation since 1) B7 and B8
are control equivalent (B7 dominates B8 and B8 postdominates B7) and 2) there is
no memory effect within L (and thus no phi instruction acting as a barrier in
L's header B8). Hence, global code motion wrongly considers B8 as a candidate
for placement of {113, 112, 110}. In general, there does not seem to exist any
mechanism in C2's intermediate representation to avoid moving "effectful" memory
instructions into the header block of "effectless" loops.

Problem 2

Typically, global code motion favors placing nodes outside of loops, which
explains why Problem 1 does not often cause miscompilations in practice.
However, in the OSR compilation of Count.java, B8 has a lower estimated
execution frequency than B7 (as reflected in the red-blue color scale in
count-actual.pdf), which leads to the placement of {113, 112, 110} into B8.
Since B8 is the single successor of B7, it should have at least the same
estimated execution frequency.

The misestimation is due to the assumption in CFGLoop::compute_freq() that the
underlying CFG is reducible, which is not the case for the OSR compilation of
Count.java. More specifically, compute_freq() computes the frequencies of each
block within a loop L in a single forward pass (in reverse DFS postorder) over
the members (blocks and loops) of L [1], assuming that the members form an
acyclic graph (except for L's back-edges). This assumption does not hold for
irreducible graphs, where there might be additional cycles corresponding to
non-natural loops. The result in this case is that the frequencies of {B8, B9,
B10, B11, B12, B13, B14, B16, B18, B20} are updated "too early" (before the
frequency of B7 is updated with that of B22) and never revisited.

A possible solution to Problem 2 is to transform the single forward pass into a
proper system of equations to be solved iteratively (see e.g. [2]), as in the
following Python/SymPy snippet:

from sympy import *
b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25 = symbols(['b3', 'l3', 'b5', 'b6', 'b7', 'l2', 'b14', 'b16', 'b18', 'b20', 'b21', 'b22', 'b23', 'b25'])
system = [
    Eq(b3,  1),
    Eq(l3,  b3 * 1),
    Eq(b5,  l3 * 1),
    Eq(b6,  b5 * 0),
    Eq(b7,  b6 * 1 + b22 * 1),
    Eq(l2,  b7 * 1),
    Eq(b14, l2 * 0.666),
    Eq(b16, b14 * 1),
    Eq(b18, b16 * 1),
    Eq(b20, b18 * 1),
    Eq(b21, b25 * 1 + b20 * 1),
    Eq(b22, b21 * 0.02),
    Eq(b23, b21 * 0.98),
    Eq(b25, b5 * 1),
    ]
soln = solve(system, [b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25])
print(soln)
{b3: 1.00000000000000, l3: 1.00000000000000, b5: 1.00000000000000, b6: 0.0, b7: 0.0202699963514007, l2: 0.0202699963514007, b14: 0.0134998175700328, b16: 0.0134998175700328, b18: 0.0134998175700328, b20: 0.0134998175700328, b21: 1.01349981757003, b22: 0.0202699963514007, b23: 0.993229821218632, b25: 1.00000000000000}

Using these frequencies instead of those computed by C2 results in global code
motion placing 'count++' into B7 (see count-expected.pdf), and the expected
output being produced by Count.java.

An alternative solution could be to simply ensure that all CFGs are reducible by
applying node-splitting before frequency computation (perhaps other
optimizations could also benefit from that?).

[1] https://github.com/openjdk/jdk/blob/4a267f1bc2b025aae2cb9df7283156aeb0282406/src/hotspot/share/opto/gcm.cpp#L1790

[2] https://dl.acm.org/doi/10.1145/178243.178251 (Section 5).

ORIGINAL REPORT FROM FUZZER TEST:

The attached fuzzer test produces a different result for C2 compared to C1/interpreter.

To reproduce:
$ java -Xint Test.java > Xint.log
$ java -Xmx1G -Xcomp -Xbatch -XX:-TieredCompilation -XX:CompileOnly=Test Test.java > c2.log

$ diff Xint.log c2.log
13c13
< i21 i22 i23 = -4427127,2,66
---
> i21 i22 i23 = -50828706,2,66
17,18c17,18
< Test.iFld byFld Test.fFld = -4427127,-77,1252881271
< Test.iFld1 Test.bArrFld iArrFld = -22136,13534,-43010
---
> Test.iFld byFld Test.fFld = -76243092,-77,1252384471
> Test.iFld1 Test.bArrFld iArrFld = -63536,13534,-43010
28c28
< Test.iFld byFld Test.fFld = -1140077,66,1258518506
---
> Test.iFld byFld Test.fFld = -1140077,66,1258249012
35c35
< i21 i22 i23 = -4428333,2,66
---
> i21 i22 i23 = -50833530,2,66
39,40c39,40
< Test.iFld byFld Test.fFld = -4428333,-77,1261269881
< Test.iFld1 Test.bArrFld iArrFld = -22142,13534,-86020
---
> Test.iFld byFld Test.fFld = -76250328,-77,1260607481
> Test.iFld1 Test.bArrFld iArrFld = -63542,13534,-86020
50c50
< Test.iFld byFld Test.fFld = -1138674,66,1264202151
---
> Test.iFld byFld Test.fFld = -1138674,66,1263539751
57c57
< i21 i22 i23 = -4440730,2,66
---
> i21 i22 i23 = -50827918,2,66
61,62c61,62
< Test.iFld byFld Test.fFld = -4440730,-77,1266817300
< Test.iFld1 Test.bArrFld iArrFld = -22204,13534,-129030
---
> Test.iFld byFld Test.fFld = -76241910,-77,1265877126
> Test.iFld1 Test.bArrFld iArrFld = -63535,13534,-129030
64,65c64,65
< iMeth_check_sum: -5029668300147928
< vMeth_check_sum: -607986478366232936
---
> iMeth_check_sum: -5029668300160304
> vMeth_check_sum: -607986478366244874
68c68
< i21 i22 i23 = -1149871,2,-77
---
> i21 i22 i23 = -1151077,2,-77
71,73c71,73
< Test.instanceCount Test.dFld Test.bFld = 6,-4591997026765212379,0
< Test.iFld byFld Test.fFld = -1149871,66,1268283435
< Test.iFld1 Test.bArrFld iArrFld = -5749,13534,-158758
---
> Test.instanceCount Test.dFld Test.bFld = 0,-4591997026765212379,0
> Test.iFld byFld Test.fFld = -1151077,66,1267744602
> Test.iFld1 Test.bArrFld iArrFld = -5755,13534,-158758
75,76c75,76
< iMeth_check_sum: -5867946350171353
< vMeth_check_sum: 8514054478760838568
---
> iMeth_check_sum: -5867946350186585
> vMeth_check_sum: 8514054478760823836
83c83
< Test.iFld byFld Test.fFld = -4438334,-77,1269659754
---
> Test.iFld byFld Test.fFld = -4438334,-77,1269120922
86,87c86,87
< iMeth_check_sum: -6706224400191446
< vMeth_check_sum: -810648637821638496
---
> iMeth_check_sum: -6706224400206202
> vMeth_check_sum: -810648637821652847
94c94
< Test.iFld byFld Test.fFld = -1150675,66,1271125889
---
> Test.iFld byFld Test.fFld = -1150675,66,1270587057
97,98c97,98
< iMeth_check_sum: -7544502450215823
< vMeth_check_sum: 8311392319305431865
---
> iMeth_check_sum: -7544502450230579
> vMeth_check_sum: 8311392319305417514
105c105
< Test.iFld byFld Test.fFld = -4440730,-77,1272502208
---
> Test.iFld byFld Test.fFld = -4440730,-77,1271963376
108,109c108,109
< iMeth_check_sum: -8382780500236868
< vMeth_check_sum: -1013310797277046215
---
> iMeth_check_sum: -8382780500251624
> vMeth_check_sum: -1013310797277060566

11u Fix Request Backporting this fix eliminates a bug that causes some code to execute incorrectly. Patch applies cleanly to 11u. New test fails without the patch, and passes with it. tier1 tests pass with the patch.

19-03-2021

Changeset: 4e8338eb Author: Roberto Castañeda Lozano <rcastanedalo@openjdk.org> Committer: Christian Hagedorn <chagedorn@openjdk.org> Date: 2020-12-21 13:04:24 +0000 URL: https://git.openjdk.java.net/jdk16/commit/4e8338eb

21-12-2020

ILW = Incorrect execution of C2 compiled code, only fails with single Java Fuzzer test (probably edge case) and not a recent regression, disable compilation of affected method = HLM = P3

02-11-2020

Relates :	JDK-8258894 - C2: Forbid GCM to move stores into loops
Relates :	JDK-8349930 - C2: verify memory liveness invariants after scheduling
Relates :	JDK-8258895 - C2: Refine execution frequency estimation for irreducible CFGs
Relates :	JDK-8259061 - C2: assert(found) failed: memory-writing node is not placed in its original loop or an ancestor of it