JDK-8309267 : C2 SuperWord: some tests fail on KNL machines - fail to vectorize
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 17,20,21
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • Submitted: 2023-06-01
  • Updated: 2024-06-06
  • Resolved: 2024-06-06
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdResolved
Related Reports
Blocks :  
Duplicate :  
Relates :  
Description
Found this because I was looking into JDK-8278920

I have some examples that:

AVX2: vectorize
KNL: do NOT vectorize -> why does it not just make use of its AVX2 capabilities?
AVX512: vectorize

I have not tested this on a proper KNL machine, but when I whitelist the KNL setting in the IR-Framework (see JDK-8309183), then some IR-tests begin to fail. And I think that is a clear indicator that the tests would fail on actual KNL machines.


--------------------------------------

For example, I extracted compiler.loopopts.superword.TestGeneralizedReductions.testMapReductionOnGlobalAccumulator:

./java -Xbatch -XX:CompileCommand=compileonly,Test1::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose -XX:+UseKNLSetting Test1.java

Unimplemented
 497 PopCountL === _ 498 [[ 496 ]] Type:int !orig=418,356,126 !jvms: Test1::test @ bci:18 (line 14)

The pack has 8 ops.

But if I run:

./java -Xbatch -XX:CompileCommand=compileonly,Test1::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose -XX:UseAVX=3 Test1.java

Then the pack with 8 PopCountL seems to create no issues, we vectorize.

Likewise, with:

./java -Xbatch -XX:CompileCommand=compileonly,Test1::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose -XX:UseAVX=2 Test1.java

I get a pack of 4 PopCountL, and that vectorizes.

The problem is in src/hotspot/cpu/x86/x86.ad

    case Op_PopCountVI:
    case Op_PopCountVL: {
        if (!is_pop_count_instr_target(bt) &&
            (size_in_bits == 512) && !VM_Version::supports_avx512bw()) {
          return false;
        }
      }

The issue is that we have allowed the packing of 8 longs, which makes 512 bit. But under KNL we have no avx512bw support. So we say "unimplemented" and reject the packing, and end up with no vectorization. But it would have been nice to instead step down to AVX2, and pack 4 longs. Because that should be able to vectorize!

One solution: we could retry vectorization at a smaller MaxVectorSize if it fails. I have seen some other compilers do that.

Another option: find the maximal vector width per instruction at the beginning of SuperWord, and limit the vectorization to the smallest one we find.


--------- List of failures below, may not be exhausive ---------------


Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "private static long compiler.loopopts.superword.TestGeneralizedReductions.testMapReductionOnGlobalAccumulator(long[])" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIfCPUFeatureAnd={}, phase={DEFAULT}, applyIfCPUFeatureOr={}, applyIf={}, applyIfCPUFeature={"avx2", "true"}, counts={"_#ADD_REDUCTION_VI#_", ">= 1", "_#POPCOUNT_VL#_", ">= 1"}, failOn={}, applyIfAnd={"SuperWordReductions", "true", "UsePopCountInstruction", "true"}, applyIfOr={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(AddReductionVI.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!
         * Constraint 2: "(\\d+(\\s){2}(PopCountVL.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!


Reason:

    case Op_PopCountVI:
    case Op_PopCountVL: {
        if (!is_pop_count_instr_target(bt) &&
            (size_in_bits == 512) && !VM_Version::supports_avx512bw()) {
          return false;
        }
      }


---------------------------------

I see similar failures like this, with KNL:

    case Op_PopulateIndex:
      if (size_in_bits > 256 && !VM_Version::supports_avx512bw()) {
        return false;
      }
      break;

Failed IR Rules (3) of Methods (3)
----------------------------------
1) Method "public void compiler.vectorization.TestPopulateIndex.exprWithIndex1()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIfCPUFeatureAnd={}, phase={DEFAULT}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#POPULATE_INDEX#_", "> 0"}, applyIfAnd={}, failOn={}, applyIfOr={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(PopulateIndex.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

2) Method "public void compiler.vectorization.TestPopulateIndex.exprWithIndex2()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIfCPUFeatureAnd={}, phase={DEFAULT}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#POPULATE_INDEX#_", "> 0"}, applyIfAnd={}, failOn={}, applyIfOr={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(PopulateIndex.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

3) Method "public void compiler.vectorization.TestPopulateIndex.indexArrayFill()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIfCPUFeatureAnd={}, phase={DEFAULT}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#POPULATE_INDEX#_", "> 0"}, applyIfAnd={}, failOn={}, applyIfOr={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(PopulateIndex.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!



--------------------

And this:

    case Op_AbsVF:
    case Op_NegVF:
      if ((vlen == 16) && (VM_Version::supports_avx512dq() == false)) {
        return false; // 512bit vandps and vxorps are not available
      }
      break;
    case Op_AbsVD:
    case Op_NegVD:
      if ((vlen == 8) && (VM_Version::supports_avx512dq() == false)) {
        return false; // 512bit vpmullq, vandpd and vxorpd are not available
      }
      break;

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public static float compiler.loopopts.superword.SumRedAbsNeg_Float.sumReductionImplement(float[],float[],float[],float)" - [Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(applyIfCPUFeatureAnd={}, phase={DEFAULT}, applyIfCPUFeatureOr={}, applyIf={}, applyIfCPUFeature={"sse2", "true"}, counts={"_#ADD_REDUCTION_VF#_", ">= 1", "_#ABS_V#_", ">= 1", "_#NEG_V#_", ">= 1"}, failOn={}, applyIfAnd={"SuperWordReductions", "true", "LoopMaxUnroll", ">= 8"}, applyIfOr={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(AddReductionVF.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!
         * Constraint 2: "(\\d+(\\s){2}(AbsV(B|S|I|L|F|D).*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!
         * Constraint 3: "(\\d+(\\s){2}(NegV(F|D).*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!

Comments
I verified it with that attached test. It now works, since we split the packs correctly for the available vector_length available, see JDK-8326139.
06-06-2024

Yes, an API like Matcher::max_vector_size(int opcode, BasicType bt) can be used during combine pack.
15-06-2023

A wholly different approach: Run SLP at a smaller MaxVectorSize if it failed. I know some compilers do things like that.
09-06-2023

[~jbhateja] We basically need a facility that tells us the maximum number of elements for a opc and bt, right? Do we have anything like that? I think the issue is that max_vector_size_in_def_use_chain does only find the largest bt in the def-use chain. But it could be any of the types/opc combinations in the def-use chain that have some sort of odd constraint that allows them less than the expected number of elements in a vector. I guess we could construct something like this: Matcher::max_vector_size(int opcode, BasicType bt) And then we query this: Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType bt) starting with vlen = max_vector_size(bt) And if that does not work, we just divide vlen by 2, until it works. What do you think?
09-06-2023

Looks like a problem during combine packs which is agnostic to match_rule_supported_vector, it combines the packs based on max vector size supported by the target, KNL does support 512 bit non-sub word type vectors (int/long/float/double), only during filter packs we discard non-implementable pack.
01-06-2023

PopCountV was added in JDK-19: https://github.com/openjdk/jdk/commit/fde31498963d76630ada31bd0e0cf3035f87445b The other operations go back more, I think, eg Op_NegVD goes back to JDK-13, though I have not verified this by running the examples: https://github.com/openjdk/jdk/commit/707c30fae6616fa603a0b45aae749b2fe137db5f#diff-d6a3624f0f0af65a98a47378a5c146eed5016ca09b4de1acd0a3acc823242e82 When going further back it may be difficult to reproduce the exact examples because of other things in SuperWord that were not yet implemented.
01-06-2023

Actually, I think PopCountV was introduced in JDK-19, but there are other features still missing in JDK-19 to make Test1.java vectorize on AVX512, for example AddL and ConvI2L. But if I try with JDK-20 I get the same behavior as with current JDK-21.
01-06-2023

ILW = Loop not vectorized due to superword bailout, on KNL, no workaround = MLH = P4
01-06-2023