JDK-5005861 : native libraries for java.lang.Math should be non-strict
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 1.4.2,5.0,6
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • OS: windows_2000,windows_xp
  • CPU: x86
  • Submitted: 2004-03-01
  • Updated: 2016-04-16
  • Resolved: 2016-04-16
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
Name: rmT116609			Date: 03/01/2004


A DESCRIPTION OF THE REQUEST :
The transcedental functions in java.lang.Math, like trigonometric or
exponential methods, have dismal performance when compared to similar
functions from C/C++ libraries (or virtually any compiled language).
Java's performance is limited by the stringent requirements of IEEE
adherence, but this should impose only a modest overhead over other
languages. Investigating the SCSL sources, it's easy to see that the root of
all slowness lies in the fldlibm native library, which implements the
native methods from java.lang.StrictMath with an approach that compares
to emulated FP; that is, the libray doesn't exploit the FP opcodes of
FPUs, not even for the "kernel" of the calculation that FPUs could do.

For example, java.lang.StrictMath.tan() calls a JNI stub, that calls
s_tan.c:tan(), that calls k_tan.c:__kernel_tan().  These two C functions
implement the tangent with a complicated algorithm that never makes use
of the FPU's tangent opcode (e.g., FTAN in Intel chips). This doesn't make
the algorithm FPU-free, because the C compiler will generate the
architecture-specific opcodes for fundamental operations used in fldlibm,
like +,-,*,/; it's only the critical FTAN instruction that's avoided, but that's the
one instruction that could replace 90% of the C code -- and run one order
of magnitude faster than that code. In fact, debugging inside a C/C++
compiler's equivalent tan() library function, it's possible to see that FTAN
does all the hard work (with a few extra instructions to load arguments or
handle special cases like infinites).

This approach in fldlibm is probably the only way to implement "strict"
maths, so the results are bit-per-bit identical in all platforms; the
microcode in FPUs often use simpler algorithms than those in fldlibm,
so they may produce less precise results or not obey all requisites from
the Java spec, such as semi-monotonicity.

Unfortunately, all maths are strict in Java, because the transcedental
methods in java.lang.Math are all stubs that invoke the same method in
java.lang.StrictMath.  I can understand the tradeoff in StrictMath, but
in the standard ("loose") Math class, we should favor best performance.
Notice that most semantics dictated by the specification of the functions
(i.e., support for all special values) could still be implemented with
checks performed before invoking the FPU instruction.  It's possible
[I'm not a Math expert] that 1ulp precision and semi-monotonicity cannot
be implemented easily (in the general case (*)) around the FPU opcode;
but if this is a problem, the spec should be fixed; these requirements
clearly exceed the needs of loose maths.

In a similar fashion, the methods min() and max() should be alleviated
from the demands of supporting NaN and signed zeros; due to this,
Java's min()/max() are a lot slower than in most other languages (where
similar functions are defined trivially, as "x >= y ? x : y").  In fact, most Java developers completely ignore the overheads of min()/max() because such
methods are so common in all languages, that a new comer to Java would
hardly care to read the docs or the sources (what could be simpler than
min and max?). But min/max can be critical to some tight loops (e.g. DSP).
Once again, I think java.lang.StrictMath is okay (it's great to have a
max() that reports that +0 is bigger than -0, if I ever need this); but in
java.lang.Math, the spec should enable the trivial implementation.

(*) When the FPU can do calculations with extended precision, i.e. in the
Pentium: using 64-bit registers for floats or 80-bit registers for doubles,
it's highly likely that the results of transcedental opcodes will be exact at
least to the last bit used by the Java type (32 for floats, 64 for doubles).
This means that the current requisites of precision and semi-monotonicity
would certainly be satisfied, and only bit-per-bit reproducibility of results
might be lost, so this optimization could be performed without any spec
changes at least in architectures supporting extended-precision FP.
Even in platforms not supporting FP numbers larger than 64 bits, the
optimization could be done for floats at the very least.

JUSTIFICATION :
Many years have gone by since the introduction of StrictMath/strictfp,
so the people who really care about such wizardry as signed zeroes or
multiplatform-reproducible results are either using StrictMath explicitly,
or deserving any problem they could have with the relaxation of the loose
methods.  On the other hand, the people who don't need those advanced
features but need maximum FP performance currently have no choice
(except for custom math libs, and most would require custom JNI code).
If there is a risk that further relaxing of java.lang.Math could break some
apps (that *shouldn't* be using this class in the first place), this is certainly
a lesser problem than the severe hit currently imposed for those who want
to code scientific and engineering apps, or advanced 3D games, in Java.



EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Performance should bw similar to C/C++; for example, benchmarking tan()
shows a performance of 12ns per call (VC++ 2003,  PentiumIV-2,25GHz)
ACTUAL -
The Java version of this benchmark shows 148ns per call to tan() (either the
java.lang.Math or java.lang.StrictMath).  This is 12X slower (HotSpot 1.4.2 -server -Xbatch).  The same problem hapopens with several methods in java.lang.Math.

---------- BEGIN SOURCE ----------
Java benchmark:

public class Tan {
 static void benchLoose () {
  long t = System.currentTimeMillis();
  for (double i = 0; i < 10000000.0; ++i) Math.tan(i);
   System.out.println("Loose:  "+(System.currentTimeMillis()-t)/100+"ns/call");
 }
 static strictfp void benchStrict ()	{
  long t = System.currentTimeMillis();
  for (double i = 0; i < 10000000.0; ++i) StrictMath.tan(i);
   System.out.println("Strict: "+(System.currentTimeMillis()-t)/100+"ns/call");
 }
 public static void main (String[] args)	{
  while (true) {
   benchLoose();
   benchStrict();
  }
 }
}

C++ benchmark:

#include <cmath>
#include <cstdio>
#include <ctime>
void bench () {
 clock_t t = clock();
 for (double i = 0; i < 10000000.0; ++i) tan(i);
 printf("C++:  %fns/call\n", (clock() - t) / 100.0);
}
void main (char** args)	{
 while (true)
  bench();
}

Note: even though both are microbenchmarks and I test with full optimization,
I have analysed the generated code to be sure that tan() is being called.
---------- END SOURCE ----------
(Incident Review ID: 234064) 
======================================================================

Comments
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: mustang
27-09-2004

EVALUATION Thank you for your interest in Java numerics. A few preliminaries: 1. the language's default floating-point semantics vs. strictfp floating-point semantics and the semantics of Math vs. StrictMath libraries are separate issues that are not directly related 2. neither the original 1985 IEEE 754 floating-point standard nor the draft revised standard covers the behavior of any "interesting" math library functions other than square root. The 754 revision committee explicitly decided not to attempt to specify the behavior of "easy" functions like log and exp. However, the 754 revision has added min/max operations that require "proper" treatment of signed zeros, etc., similar to the java.lang.{Strict}Math specifications for that operation. One of the driving design goals of Java's original floating-point semantics was bit-wise cross-platform reproducibility for both basic arithmetic operations and math library results. Specifying such a result for the basic arithmetic operations is relatively easy because of the ubiquity of 754 and because the "correct" answer is easy to understand and feasible to achieve: assuming the inputs and exact, return the floating-point number nearest to the exact mathematical result. While this notion of a "correctly rounded" function is as applicable to {+, -, *, /} as to sin, cos, exp, etc., creating usably fast correctly rounded implementation of the latter class of functions remains something of a research problem. Therefore, to achieve the same level of reproducibility for math library functions, it is necessary to use a particular algorithm. Java (and Matlab) use the fdlibm libraries for this purpose. Starting in 1.3, the StrictMath class required use of the fdlibm algorithms while corresponding methods in the Math class were allowed to use any implementation that meets the stated quality of implementation requirements, usually in terms of accuracy and monotonicity. The fdlibm algorithms satisfy the Math quality of implementation requirements so the current *source code* which delegates Math calls to StrictMath is correct. However, how an optimizing jvm, like HotSpot, implements the math library might bear little resemblance to the source code. For example, the sqrt method is commonly replaced by the square root instruction on the CPU in question. This "intrinsification" optimization is done in the jvm and is not represented in source code. We are aware of the almabench results and and osnews article on trigonometric performance. However, the HotSpot implementation of sin/cos on x86 for years has used and continues to use fsin/fcos x87 instructions in a range where those instructions meet the quality of implementation requirements, basically [-pi/4, pi/4]. Outside of that range, the results of fsin/fcos can be anywhere in the range [-1, 1] with little relation to the true sine/cosine of the argument. For example, fsin(Math.PI) only gets about half the digits of the result correct. The reason for this is that the fsin/fcos instruction implementations use a less than ideal algorithm for *argument reduction*; the argument reduction process is explained in bug 4857011. Although the range [-pi/4, pi/4] covers more than half of all floating-point values, osnews et al. benchmark outside of this range. For larger values, higher precision arithmetic is needed to compute the (nearly) right answer. Therefore, outside of [-pi/4, pi/4], the speed of a compliant java.lang.Math sin/cos routine will largely be a function of the argument's exponent. Neither osnews nor almabench examine the numerical results so the quality of the implementation is not examined. Intrinsifing more math library functions should be investigated in a future release. ###@###.### 2004-03-11 Tan, LN and Log10 have all been intrinsified in Mustang. ###@###.### 2004-09-16
11-03-2004