Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
FAILURE: The attached program Count.java (the most recent version, for some reason I cannot remove the old one) is miscompiled by C2: # Interpreted. $ java -Xint Count.java count: 0 count: 1 count: 2 count: 3 # Compiled with C1. java -Xcomp -XX:CompileOnly=Count Count.java count: 0 count: 1 count: 2 count: 3 # Compiled with C2: count jumps from 2 to 4. java -Xcomp -XX:-TieredCompilation -XX:CompileOnly=Count Count.java count: 0 count: 1 count: 2 count: 4 ANALYSIS: The problem is introduced by an OSR compilation that triggers after "count: 2" (entering right before the innermost loop). In the miscompiled program, the load, increment, and store instructions corresponding to the "count++" statement are wrongly moved into a loop header and end up being executed on two of the three switch cases. The wrong placement can be seen in the attached CFG (count-actual.pdf) generated after global code motion (some optimizations are disabled for simplicity): nodes 113, 112, and 110 implement the "count++" statement and are wrongly placed into block B8 (the header of loop L) instead of the preheader block B7. This bug is due to the combination of two underlying problems in C2. Problem 1 The placement is wrongly allowed by the Ideal representation since 1) B7 and B8 are control equivalent (B7 dominates B8 and B8 postdominates B7) and 2) there is no memory effect within L (and thus no phi instruction acting as a barrier in L's header B8). Hence, global code motion wrongly considers B8 as a candidate for placement of {113, 112, 110}. In general, there does not seem to exist any mechanism in C2's intermediate representation to avoid moving "effectful" memory instructions into the header block of "effectless" loops. Problem 2 Typically, global code motion favors placing nodes outside of loops, which explains why Problem 1 does not often cause miscompilations in practice. However, in the OSR compilation of Count.java, B8 has a lower estimated execution frequency than B7 (as reflected in the red-blue color scale in count-actual.pdf), which leads to the placement of {113, 112, 110} into B8. Since B8 is the single successor of B7, it should have at least the same estimated execution frequency. The misestimation is due to the assumption in CFGLoop::compute_freq() that the underlying CFG is reducible, which is not the case for the OSR compilation of Count.java. More specifically, compute_freq() computes the frequencies of each block within a loop L in a single forward pass (in reverse DFS postorder) over the members (blocks and loops) of L [1], assuming that the members form an acyclic graph (except for L's back-edges). This assumption does not hold for irreducible graphs, where there might be additional cycles corresponding to non-natural loops. The result in this case is that the frequencies of {B8, B9, B10, B11, B12, B13, B14, B16, B18, B20} are updated "too early" (before the frequency of B7 is updated with that of B22) and never revisited. A possible solution to Problem 2 is to transform the single forward pass into a proper system of equations to be solved iteratively (see e.g. [2]), as in the following Python/SymPy snippet: from sympy import * b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25 = symbols(['b3', 'l3', 'b5', 'b6', 'b7', 'l2', 'b14', 'b16', 'b18', 'b20', 'b21', 'b22', 'b23', 'b25']) system = [ Eq(b3, 1), Eq(l3, b3 * 1), Eq(b5, l3 * 1), Eq(b6, b5 * 0), Eq(b7, b6 * 1 + b22 * 1), Eq(l2, b7 * 1), Eq(b14, l2 * 0.666), Eq(b16, b14 * 1), Eq(b18, b16 * 1), Eq(b20, b18 * 1), Eq(b21, b25 * 1 + b20 * 1), Eq(b22, b21 * 0.02), Eq(b23, b21 * 0.98), Eq(b25, b5 * 1), ] soln = solve(system, [b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25]) print(soln) {b3: 1.00000000000000, l3: 1.00000000000000, b5: 1.00000000000000, b6: 0.0, b7: 0.0202699963514007, l2: 0.0202699963514007, b14: 0.0134998175700328, b16: 0.0134998175700328, b18: 0.0134998175700328, b20: 0.0134998175700328, b21: 1.01349981757003, b22: 0.0202699963514007, b23: 0.993229821218632, b25: 1.00000000000000} Using these frequencies instead of those computed by C2 results in global code motion placing 'count++' into B7 (see count-expected.pdf), and the expected output being produced by Count.java. An alternative solution could be to simply ensure that all CFGs are reducible by applying node-splitting before frequency computation (perhaps other optimizations could also benefit from that?). [1] https://github.com/openjdk/jdk/blob/4a267f1bc2b025aae2cb9df7283156aeb0282406/src/hotspot/share/opto/gcm.cpp#L1790 [2] https://dl.acm.org/doi/10.1145/178243.178251 (Section 5). ORIGINAL REPORT FROM FUZZER TEST: The attached fuzzer test produces a different result for C2 compared to C1/interpreter. To reproduce: $ java -Xint Test.java > Xint.log $ java -Xmx1G -Xcomp -Xbatch -XX:-TieredCompilation -XX:CompileOnly=Test Test.java > c2.log $ diff Xint.log c2.log 13c13 < i21 i22 i23 = -4427127,2,66 --- > i21 i22 i23 = -50828706,2,66 17,18c17,18 < Test.iFld byFld Test.fFld = -4427127,-77,1252881271 < Test.iFld1 Test.bArrFld iArrFld = -22136,13534,-43010 --- > Test.iFld byFld Test.fFld = -76243092,-77,1252384471 > Test.iFld1 Test.bArrFld iArrFld = -63536,13534,-43010 28c28 < Test.iFld byFld Test.fFld = -1140077,66,1258518506 --- > Test.iFld byFld Test.fFld = -1140077,66,1258249012 35c35 < i21 i22 i23 = -4428333,2,66 --- > i21 i22 i23 = -50833530,2,66 39,40c39,40 < Test.iFld byFld Test.fFld = -4428333,-77,1261269881 < Test.iFld1 Test.bArrFld iArrFld = -22142,13534,-86020 --- > Test.iFld byFld Test.fFld = -76250328,-77,1260607481 > Test.iFld1 Test.bArrFld iArrFld = -63542,13534,-86020 50c50 < Test.iFld byFld Test.fFld = -1138674,66,1264202151 --- > Test.iFld byFld Test.fFld = -1138674,66,1263539751 57c57 < i21 i22 i23 = -4440730,2,66 --- > i21 i22 i23 = -50827918,2,66 61,62c61,62 < Test.iFld byFld Test.fFld = -4440730,-77,1266817300 < Test.iFld1 Test.bArrFld iArrFld = -22204,13534,-129030 --- > Test.iFld byFld Test.fFld = -76241910,-77,1265877126 > Test.iFld1 Test.bArrFld iArrFld = -63535,13534,-129030 64,65c64,65 < iMeth_check_sum: -5029668300147928 < vMeth_check_sum: -607986478366232936 --- > iMeth_check_sum: -5029668300160304 > vMeth_check_sum: -607986478366244874 68c68 < i21 i22 i23 = -1149871,2,-77 --- > i21 i22 i23 = -1151077,2,-77 71,73c71,73 < Test.instanceCount Test.dFld Test.bFld = 6,-4591997026765212379,0 < Test.iFld byFld Test.fFld = -1149871,66,1268283435 < Test.iFld1 Test.bArrFld iArrFld = -5749,13534,-158758 --- > Test.instanceCount Test.dFld Test.bFld = 0,-4591997026765212379,0 > Test.iFld byFld Test.fFld = -1151077,66,1267744602 > Test.iFld1 Test.bArrFld iArrFld = -5755,13534,-158758 75,76c75,76 < iMeth_check_sum: -5867946350171353 < vMeth_check_sum: 8514054478760838568 --- > iMeth_check_sum: -5867946350186585 > vMeth_check_sum: 8514054478760823836 83c83 < Test.iFld byFld Test.fFld = -4438334,-77,1269659754 --- > Test.iFld byFld Test.fFld = -4438334,-77,1269120922 86,87c86,87 < iMeth_check_sum: -6706224400191446 < vMeth_check_sum: -810648637821638496 --- > iMeth_check_sum: -6706224400206202 > vMeth_check_sum: -810648637821652847 94c94 < Test.iFld byFld Test.fFld = -1150675,66,1271125889 --- > Test.iFld byFld Test.fFld = -1150675,66,1270587057 97,98c97,98 < iMeth_check_sum: -7544502450215823 < vMeth_check_sum: 8311392319305431865 --- > iMeth_check_sum: -7544502450230579 > vMeth_check_sum: 8311392319305417514 105c105 < Test.iFld byFld Test.fFld = -4440730,-77,1272502208 --- > Test.iFld byFld Test.fFld = -4440730,-77,1271963376 108,109c108,109 < iMeth_check_sum: -8382780500236868 < vMeth_check_sum: -1013310797277046215 --- > iMeth_check_sum: -8382780500251624 > vMeth_check_sum: -1013310797277060566
|