JDK-8039104 : Don't use Math.min/max intrinsic on x86
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9,10
  • Priority: P4
  • Status: In Progress
  • Resolution: Unresolved
  • OS: generic
  • CPU: x86
  • Submitted: 2014-04-02
  • Updated: 2024-03-02
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Description
Math.min/max intrinsic unconditionally generates cmove instructions on x86 which may perform much worse than branch depending on probability of branches.
We have code in  PhaseIdealLoop::conditional_move() which takes into account branch's probability.
I suggest to not use LibraryCallKit::inline_min_max() on CPUs which have good predictors (modern x86).
Comments
JDK-8324655 The idea was to replace if-statements with min/max nodes when possible. But I found a regression case: https://github.com/openjdk/jdk/pull/17574#issuecomment-1973121951 It seems the main motivation was to have CMove/Min/Max nodes for loops, so that we can vectorize. Currently we do not allow control-flow in loops for vectorization. I hope to change that by doing if-conversion during AutoVectorization (see progress in JDK-8317424). If we could assume that if-conversion in loops was completed, then the conversation about CMove/Max/Min vs If would be primarily about straight-line code, or loops that don't get vectorized. Here it is difficult to get a good cost-function, but maybe something can be done about it. The cost-function would of course be different on different platforms/architectures: it depends on the cost of individual instructions but also on how good the branch predictor is. And how well the branch predictor works may not just depend on the code around the if/CMove, but more generally the pressure on the branch predictor. So it could turn out to be a problem that cannot be solved optimally in general, but we'd have to accept some heuristics that can be wrong in some cases.
02-03-2024

The Java code for this sequence is coming from inlining Arrays::copyOf() when calling String::getBytes(): @ 35 java.lang.String::getBytes (15 bytes) inline (hot) @ 0 java.nio.charset.Charset::defaultCharset (53 bytes) inline (hot) @ 4 java.lang.String::coder (15 bytes) inline (hot) @ 11 java.lang.String::encode (48 bytes) inline (hot) @ 10 java.lang.String::encodeUTF8 (141 bytes) inline (hot) @ 15 java.lang.StringCoding::hasNegatives (25 bytes) (intrinsic) @ 24 java.util.Arrays::copyOf (19 bytes) inline (hot) @ 11 java.lang.Math::min (11 bytes) inline (hot) @ 14 java.lang.System::arraycopy (0 bytes) (intrinsic)
01-04-2021

With intrinsics disabled generated code contains additional huge code (UseXMMForObjInit) to clear allocated byte[] which later will be overwritten by arraycopy(): xorq rax, rax # ClearArray: cmp InitArrayShortSize,rcx jg LARGE With min() intrinsics code for ClearArray is not generated.
01-04-2021

An other issue with this intrinsics. min() intrinsic is used in Arrays.copyOfRange() and as result the array is not removed by Escape Analysis: public class Test { public static void main(String[] args) { byte[] src = new byte[16]; for (int i = 0; i < 12000; ++i) { test(src); } private static int test(byte[] src) { byte[] dst = Arrays.copyOfRange(src, 1, 7); return dst[2] + dst[5]; } } I think it is because intrinsic leads to generation of CMove node instead of branch and uncommon for and as result arraycopy length is not constant in copyOfRange(): System.arraycopy(original, from, copy, 0, Math.min(original.length - from, newLength)); https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/Arrays.java#L3777
29-03-2021

I think it is still the issue. See discussion in https://github.com/openjdk/jdk/pull/2570
24-03-2021

Thanks Vladimir for correcting my wrong assumption about jdk8u20 b11. With my own build of JVM, with the patch, I still get similar results: OpenJDK 64-Bit Server VM (build 25.20-b13-internal, mixed mode) % java -jar target/microbenchmarks.jar ".*Max.*" Benchmark Mode Samples Mean Mean error Units o.o.j.s.JmhMaxBenchmark.intrinsic avgt 100 0.854 0.000 ns/op o.o.j.s.JmhMaxBenchmark.own avgt 100 0.764 0.005 ns/op % java -XX:-UseCBCond -jar target/microbenchmarks.jar ".*Max.*" Benchmark Mode Samples Mean Mean error Units o.o.j.s.JmhMaxBenchmark.intrinsic avgt 100 0.817 0.000 ns/op o.o.j.s.JmhMaxBenchmark.own avgt 100 1.040 0.004 ns/op
26-04-2014

Interesting. We use MOVcc on SPARC for cmoves that is why intrinsic performance is the same. jdk8u20 b11 does not have the fix for JDK-8038297. Should be in next build I think.
25-04-2014

Thanks Igor for the suggestion. I just ran the same tests with jdk8u20 b11 and got roughly the same results. So it doesn't seem to be directly related to JDK-8038297. It seems to me that CBcond helped the JIT'd code and hurt the intrinsics, therefore tipped the scale.
25-04-2014

It seems that on a 3GHz SPARC T4, with JDK8, the Math.max intrinsic is slower as well: Benchmark Mode Samples Mean Mean error Units o.o.j.s.JmhMaxBenchmark.intrinsic avgt 100 0.854 0.000 ns/op o.o.j.s.JmhMaxBenchmark.own avgt 100 0.767 0.006 ns/op But, with -XX:-UseCBCond Benchmark Mode Samples Mean Mean error Units o.o.j.s.JmhMaxBenchmark.intrinsic avgt 100 0.817 0.000 ns/op o.o.j.s.JmhMaxBenchmark.own avgt 100 1.034 0.004 ns/op
25-04-2014

James, try the latest build of jdk8u20. It can be related to JDK-8038297.
25-04-2014

On 4/2/14 7:22 AM, Andrew Haley wrote:> On 04/02/2014 02:03 PM, Vitaly Davidovich wrote: >> You mean substantial unrolling occurs with certain probabilities? > > It seems to. My experiment was to make the branch totally unpredictable, > in which case the intrinsic is no better than the user-defined > Math.max. > >> I think that's where the cmov really hurts (for predictable >> branches) as the cpu will be slowed down by dependency chains. >> >> As I mentioned in my other reply, it's hard to build a software >> model of a branch prediction unit as they're not simple probability >> counters but look at branch patterns. I think best hotspot can do >> here is just record highly likely/unlikely code paths, but leave the >> gray area alone (i.e. prefer jumps). > > As far as I can see from x86, there aren't really any cases where the > intrinsic is useful. In some cases HotSpot generates cmov with user- > defined Math.max, depending on probability. Simply deleting the > intrinsic would be no bad thing, judging from these results. > > Andrew.
02-04-2014

On 4/2/14 7:50 AM, Vitaly Davidovich wrote: > I don't think it's reasonably possible to model hardware branch > prediction in software. As you say, details are murky, the hardware > changes and advances, etc. > > So clearly there will be cases where cmov will yield a speedup over > branches. The question is whether compiler can ascertain that with very > good precision; otherwise code is penalized and doesn't take advantage > of any advances in branch prediction in hardware. > > By removing branches in java I meant rewriting the code/algorithm to not > branch (or branch more predictably); this is akin to what people do in > other languages/domains (e.g. c or c++ on hardware with weak > predictors). Granted level of control in java is limited to some > degree, but certainly avoiding branches or making them more predictable > is feasible? > > Sent from my phone > > On Apr 2, 2014 10:11 AM, "Martin Grajcar" <maaartinus@gmail.com > <mailto:maaartinus@gmail.com>> wrote: > > On Wed, Apr 2, 2014 at 2:55 PM, Vitaly Davidovich <vitalyd@gmail.com > <mailto:vitalyd@gmail.com>> wrote: > > I think hotspot profiling records cumulative probability of a > branch taken vs not taken - it doesn't record the pattern, > right? This basically means that the best hotspot can do is > determine if a jump is very likely taken or not, rather than > being able to answer whether something is highly unpredictable. > You can have a branch with 50/50 chance of being taken or not, > but the pattern is just as important for hardware branch > prediction. For example, a loop runs for 100 iterations with > first 50 taking the jump and last 50 not; branch predictor > should do fine here even though frequency as recorded by hotspot > will simply say 50/50 (correct me if I'm wrong though). > > I'm afraid you're right. > > So it really seems that hotspot can only use profiling to > determine very high probability of branches taken, but not able > to say much else for other cases because it doesn't model > prediction the same way as cpu does (nor should it). > > Maybe it should, but I'm afraid it's just too complicated and > possibly not publicly known. I guess hotspot could recognize some > simple patterns, which are sure to be recognized by the CPU as well. > > Maybe considering the condition's origin might be a good > heuristic: Conditions dependent on array content are probably less > predictable than those depending let's say on the counter only. > > Given that, seems like code gen should prefer to emit jumps > almost all the time. > > I must disagree here. Even a 50% branch can be perfectly > predictable, but in case it isn't you lose too much. Just > extrapolate the slowdown from the first image on > http://stackoverflow.com/questions/19689214/strange-branching-performance to > 50%. Not using cmov would lead to a slowdown factor 8. > > There's the other aspect to this which is that if hotspot emits > jumps and profiling shows high branch misprediction, developer > can possibly change their code to either remove branches or make > them more predictable. If cmov is emitted though, then there's > nothing dev can do (unless you guys modify intrinsic for > math/min to use profile info). > > Am I missing something in my reasoning? > > I'm afraid there's no good way to remove branches in Java. In > general you'll get something notably slower than cmov.
02-04-2014

On 4/1/14 4:31 PM, Vitaly Davidovich wrote: > Thanks for putting the jmh code inline. > > Yes, I tend to agree with not forcing cmov in the intrinsic given modern > hardware (unless, of course, profiling via interpreter shows the branch > highly unpredictable). Perhaps JIT should see if the min/max is > executed in a loop body, and if so, consider it predictable (and > generate jumps); if outside loop, it probably doesn't matter for perf > all that much whether it's cmov or jump. > > Sent from my phone > > On Apr 1, 2014 7:06 PM, "Martin Grajcar" <maaartinus@gmail.com > <mailto:maaartinus@gmail.com>> wrote: > > Answered inline. > > On Tue, Apr 1, 2014 at 11:58 PM, Vitaly Davidovich > <vitalyd@gmail.com <mailto:vitalyd@gmail.com>> wrote: > > Apologies, meant to reply to the list. > > Sent from my phone > > On Apr 1, 2014 5:48 PM, "Vitaly Davidovich" <vitalyd@gmail.com > <mailto:vitalyd@gmail.com>> wrote: > > I can't see the attachment (on my phone) but I'm guessing > the jumps generated by manual code are highly predicted? > What if you try it with an array of random values? > > > The input array is random: > > @Setup public void setUp() { > final Random random = new Random(); > for (int i=0; i<table.length; ++i) table[i] = random.nextInt(); > } > > The whole benchmark is this loop: > > @GenerateMicroBenchmark public int intrinsic() { > int result = Integer.MIN_VALUE; > for (final int x : table) result = Math.max(result, x); > return result; > } > > The values are random, but the branch gets more and more predictable > as result approaches the real maximum. > > I'm guessing the cmov based intrinsics only win on (a) cpus > with poor branch prediction, (b) unpredictable branches, or > (c) code with lots of branches clogging up branch history > buffer. > > Agreed, but my point was not to /force/ using cmov for Math.max when > the compiler can do it anyway (though there cases when it doesn't > although it should like > http://stackoverflow.com/questions/19689214/strange-branching-performance). > > Also, is compiler generating larger code when using jumps? > If so, icache pressure could be an issue; I don't think a > microbenchmark will capture that though. > > I'd guess the code size is about the same. Anyway, this > microbenchmark is really tiny.
02-04-2014

On Apr 1, 2014 5:48 PM, "Vitaly Davidovich" <vitalyd@gmail.com> wrote: I can't see the attachment (on my phone) but I'm guessing the jumps generated by manual code are highly predicted? What if you try it with an array of random values? I'm guessing the cmov based intrinsics only win on (a) cpus with poor branch prediction, (b) unpredictable branches, or (c) code with lots of branches clogging up branch history buffer. Also, is compiler generating larger code when using jumps? If so, icache pressure could be an issue; I don't think a microbenchmark will capture that though. Sent from my phone On Apr 1, 2014 5:30 PM, "Martin Grajcar" <maaartinus@gmail.com> wrote: Done and attached. The results are pretty clear: Benchmark Mode Samples Mean Mean error Units o.o.j.s.JmhMaxBenchmark.intrinsic avgt 100 0.922 0.001 ns/op o.o.j.s.JmhMaxBenchmark.own avgt 100 0.408 0.005 ns/op openjdk version "1.9.0-internal" OpenJDK Runtime Environment (build 1.9.0-internal-maaartin_2014_02_18_06_05-b00) OpenJDK 64-Bit Server VM (build 25.0-b62, mixed mode) model name : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz Regards, Martin. On Tue, Apr 1, 2014 at 10:42 AM, Andrew Haley <aph@redhat.com> wrote: On 04/01/2014 07:02 AM, Martin Grajcar wrote: > I wonder if these intrinsics still make sense when the compiler can > generate movcc itself. According to this SO question > http://stackoverflow.com/questions/22752198/java-math-min-max-performance > it can lead to a factor two slowdown (I guess a JMH benchmark would confirm > this). Exactly. Will you do it? Andrew.
02-04-2014