Bug ID: JDK-4699924 64 bit jvm generates slow array copying code

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 1.4.1

Priority: P3
Status: Resolved
Resolution: Fixed
OS: solaris_2.5.1
CPU: sparc

Submitted: 2002-06-10
Updated: 2002-09-18
Resolved: 2002-08-28

Other
1.4.2 mantisFixed


Name: ks84122			Date: 06/10/2002


The example below demonstrates that the 64 bit server compiler generates
poor array copying code. Reproducible on Solaris 9, Solaris 7 in 64 bit 
mode, JDK 1.4.1 (solaris-sparcv9)

Original description from HP:
=============================

public class copybench
{
  static char[] source = new char[2000];
  static char[] target = new char[5000];

  public static long run(int loopcount, boolean usesystem, int source_off,
    int target_off, int len)
  {
    int indx, indx2;
    long starttime = System.currentTimeMillis();
   
    char[] s = source;
    char[] t = target;
    for (indx2=loopcount; --indx2 >= 0; )
    {
      if (! usesystem)
      {
        for (indx = 0; indx < len; ++indx)
        {
          t[target_off + indx] = s[source_off + indx];
        }
      }
      else
      {
        System.arraycopy(s, source_off, t, target_off, len);
      }
    }
    long endtime = System.currentTimeMillis();
    return endtime - starttime;
  }

  public static void main(String argv[])
  {
    int indx;
    for (indx=0; indx < source.length; ++indx)
    {
      source[indx] = (char) indx;
    }
    System.out.println((argv.length == 0) + " warmup " + (long) argv.length);
    run(2,true,0,0,2000);
    run(2,false,0,0,2000);
    for (indx=100000; indx > 0; --indx)
    {
      run(1,true,1,1,16); // warm up run
      run(1,false,1,1,16); // warm up run
    }
    System.out.println("start");
    for (indx=10; indx > 0; --indx)
    {
      System.out.println("system copy " + run(99990,true,240,200,20));
      System.out.println("java copy " + run(99990,false,240,200,20));
    }
  }

}

7%/export/home1/jgragg/jdk1.4.0_01/bin/java -server copybench
...
system copy 38
java copy 28
8%/export/home1/jgragg/jdk1.4.0_01/bin/java -d64 -server copybench
...
system copy 31
java copy 52    (52 ms vs 28 ms is almost 2x slower)

When I look at the assembly code generated, the 32 bit jvm generates
a clean unrolled copying loop.  The 64 bit jvm generates something
that is quite ugly and slower code.
(Review ID: 153345) 
======================================================================

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.4.1_02 mantis FIXED IN: mantis INTEGRATED IN: mantis

14-06-2004

SUGGESTED FIX See attached file LONG-UNROLL-PATCH-2 for a suggested fix. This fix is currently under test for inclusion in Mantis. The main copybench loop before the change, with 8-way unrolling, had 83 instructions in 35 bundles; average "java copy" time was 38 ms. After the fix, it is 27 instructions in 21 bundles, with average "java copy" time of 32 ms. See "before" and "after" examples in the attachments. ###@###.### 2002-07-23 Suggested fix is amended (simplified) following engineering review. Before integrating any such fix please contact me for the latest version. The attached LONG-UNROLL-PATCH-2 may be used for testing and risk assessment. ###@###.### 2002-07-25

25-07-2002

EVALUATION ###@###.### 2002-06-11 Reproduced with current 1.4.1 build on solaris_sparc Sample generated code for 32bit loadC and storeC: 1cc LDUH [R_L2 + #12],R_L0 1d0 + STH R_L0,[R_G4 + #12] ! short Sample generated code for 64bit storeC: 180 + ADD R_G5,#3,R_O2 18c SRA R_O2,0,R_O2 ! int->long 198 SLLX R_O2,#1,R_L1 1ac ADD R_L6,R_L1,R_L1 1b8 + STH R_O2,[R_L1 + #24] ! short similar code is generated for LoadC This is likely a matching problem where the CastI2L, inserted between the original offset expression and the array access, interferes with matching. ----- ###@###.### 2002-07-12 While changing match rules may be necessary, futher investigation shows that the CastI2L also interferes with optimizations that cleanup index expressions. This is done in PhaseIdealLoop::remix_address_expressions(Node*) which only works on address expressions consisting of integer math and needs to be generalized to work on long math as well. ----- Root cause is a semantic mismatch between array indices (32 bits) and hardware addressing mode operands (64 bits). This is patched up in the IR by an explicit conversion operation, which generates on SPARC as "SRA %index, 0, %operand". Unfortunately, when the copy loop body is unrolled, all the interesting variant indexes look like "x[xo + i + 3]". Since they do not commute with the conversion operator, they cannot be commoned up with the displacement of the addressing mode. The parser performs this generation of IR: x[xo + i + 3] => *(jchar*)( &x + I2L(xo + i + 3)*2 + 24 ) Here is the optimization which is not done: *(jchar*)( &x + I2L(xo + i + 3)*2 + 24 ) => *(jchar*)( &x + I2L(xo + i)*2 + 3*2 + 24 ) Given such an optimization, the eventual assignment looks like this: ADD %xo, %i, %xoi SRA %xoi, 0, %xoi64 ! int->long SLLX %xoi64, 1, %tem1 ADD %x, %tem1, %tem2 STH %char, [%tem2 + 30] Note that in an unrolled loop all but the last instruction is shared by all instances of "STH", and the instruction costs are amortized by the unroll factor across several "STH" instructions. Solution: Make ConvI2L nodes commute with addition, under carefully controlled circumstances. (I.e., no 32-bit overflow.) This can only be done by adding a target-type annotation to the ConvI2L nodes which is sufficient to prove (locally during optimization) that the 32-bit operands will not overflow, and can therefore be added as 64-bit operands. This latter add (of a constant such as "3*sizeof(jchar)") can then be subsumed into the 64-bit addressing mode. An annotated ConvI2L node will not common with another node of the same input but different annotation, and it should not. This would lead to multiple equivalent values alive in the unrolled loop, multiple unshared "SRA" and "SLLX" instructions, and multiple live registers holding the same values. Therefore, we must add ConvI2L nodes to the IGVN worklist whenever we think the loop optimizations may be about to converge, to normalize the type annotations after all loop unrolling is done, and common up all equivalent "I2L" and "LShift" operations before matching and register allocation. ###@###.### 2002-07-25

25-07-2002