JDK-8309636 : Optimize double drem for Blender.java
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 22
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • CPU: generic
  • Submitted: 2023-06-07
  • Updated: 2023-07-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
[Blender.java](https://www.graalvm.org/22.1/examples/java-performance-examples/#sunflow-example) is the kernel of Sunflow. It's basically a microbenchmark of Partial Escape Analysis.  the object Color inside of loop manifests the PEA opportunity. 

C2 PEA can make Blender 38.58% faster. Graal (graalvm-ce-java17-22.3.1) can make it 47.35% faster. In other words, Graal is still 14.3% faster than C2 with PEA. I profiled allocation using async-profiler, I believe C2 PEA has the same effect as Graal. It looks like the problem comes from drem operation for this expression: (color.r + color.g + color.b) % 42 == 0

In output_c2.html, 66% cpu time on Blender.initialize@82, that's bytecode drem.  Even though Color.x/y/z are all double,  their value are only from integers.  output_graal.html, bytecode @82 only accounts for 4.10%.

I think it's a good opportunity to optimize drem like Graal does. 

Comments
JDK-8312233 and JDK-8312188 are related as well.
20-07-2023

JDK-8308966 looks related.
14-06-2023

Looks to me there is already the parsing support in C2: case Bytecodes::_drem: if (Matcher::has_match_rule(Op_ModD)) { // Generate a ModD node. b = pop_pair(); a = pop_pair(); // a % b c = _gvn.transform( new ModDNode(0,a,b) ); d = dprecision_rounding(c); push_pair( d ); } else { // Generate a call. modd(); } break; But the matching rule for ModD is only implemented with FPU for in x86_32.ad, but not for x86_64: instruct modD_reg(regD dst, regD src0, regD src1, eAXRegI rax, eFlagsReg cr) %{ predicate(UseSSE>=2); ... Maybe it is worth sharing that implementation for x86_64.
08-06-2023