Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
Name: rmT116609 Date: 03/01/2004 A DESCRIPTION OF THE REQUEST : The transcedental functions in java.lang.Math, like trigonometric or exponential methods, have dismal performance when compared to similar functions from C/C++ libraries (or virtually any compiled language). Java's performance is limited by the stringent requirements of IEEE adherence, but this should impose only a modest overhead over other languages. Investigating the SCSL sources, it's easy to see that the root of all slowness lies in the fldlibm native library, which implements the native methods from java.lang.StrictMath with an approach that compares to emulated FP; that is, the libray doesn't exploit the FP opcodes of FPUs, not even for the "kernel" of the calculation that FPUs could do. For example, java.lang.StrictMath.tan() calls a JNI stub, that calls s_tan.c:tan(), that calls k_tan.c:__kernel_tan(). These two C functions implement the tangent with a complicated algorithm that never makes use of the FPU's tangent opcode (e.g., FTAN in Intel chips). This doesn't make the algorithm FPU-free, because the C compiler will generate the architecture-specific opcodes for fundamental operations used in fldlibm, like +,-,*,/; it's only the critical FTAN instruction that's avoided, but that's the one instruction that could replace 90% of the C code -- and run one order of magnitude faster than that code. In fact, debugging inside a C/C++ compiler's equivalent tan() library function, it's possible to see that FTAN does all the hard work (with a few extra instructions to load arguments or handle special cases like infinites). This approach in fldlibm is probably the only way to implement "strict" maths, so the results are bit-per-bit identical in all platforms; the microcode in FPUs often use simpler algorithms than those in fldlibm, so they may produce less precise results or not obey all requisites from the Java spec, such as semi-monotonicity. Unfortunately, all maths are strict in Java, because the transcedental methods in java.lang.Math are all stubs that invoke the same method in java.lang.StrictMath. I can understand the tradeoff in StrictMath, but in the standard ("loose") Math class, we should favor best performance. Notice that most semantics dictated by the specification of the functions (i.e., support for all special values) could still be implemented with checks performed before invoking the FPU instruction. It's possible [I'm not a Math expert] that 1ulp precision and semi-monotonicity cannot be implemented easily (in the general case (*)) around the FPU opcode; but if this is a problem, the spec should be fixed; these requirements clearly exceed the needs of loose maths. In a similar fashion, the methods min() and max() should be alleviated from the demands of supporting NaN and signed zeros; due to this, Java's min()/max() are a lot slower than in most other languages (where similar functions are defined trivially, as "x >= y ? x : y"). In fact, most Java developers completely ignore the overheads of min()/max() because such methods are so common in all languages, that a new comer to Java would hardly care to read the docs or the sources (what could be simpler than min and max?). But min/max can be critical to some tight loops (e.g. DSP). Once again, I think java.lang.StrictMath is okay (it's great to have a max() that reports that +0 is bigger than -0, if I ever need this); but in java.lang.Math, the spec should enable the trivial implementation. (*) When the FPU can do calculations with extended precision, i.e. in the Pentium: using 64-bit registers for floats or 80-bit registers for doubles, it's highly likely that the results of transcedental opcodes will be exact at least to the last bit used by the Java type (32 for floats, 64 for doubles). This means that the current requisites of precision and semi-monotonicity would certainly be satisfied, and only bit-per-bit reproducibility of results might be lost, so this optimization could be performed without any spec changes at least in architectures supporting extended-precision FP. Even in platforms not supporting FP numbers larger than 64 bits, the optimization could be done for floats at the very least. JUSTIFICATION : Many years have gone by since the introduction of StrictMath/strictfp, so the people who really care about such wizardry as signed zeroes or multiplatform-reproducible results are either using StrictMath explicitly, or deserving any problem they could have with the relaxation of the loose methods. On the other hand, the people who don't need those advanced features but need maximum FP performance currently have no choice (except for custom math libs, and most would require custom JNI code). If there is a risk that further relaxing of java.lang.Math could break some apps (that *shouldn't* be using this class in the first place), this is certainly a lesser problem than the severe hit currently imposed for those who want to code scientific and engineering apps, or advanced 3D games, in Java. EXPECTED VERSUS ACTUAL BEHAVIOR : EXPECTED - Performance should bw similar to C/C++; for example, benchmarking tan() shows a performance of 12ns per call (VC++ 2003, PentiumIV-2,25GHz) ACTUAL - The Java version of this benchmark shows 148ns per call to tan() (either the java.lang.Math or java.lang.StrictMath). This is 12X slower (HotSpot 1.4.2 -server -Xbatch). The same problem hapopens with several methods in java.lang.Math. ---------- BEGIN SOURCE ---------- Java benchmark: public class Tan { static void benchLoose () { long t = System.currentTimeMillis(); for (double i = 0; i < 10000000.0; ++i) Math.tan(i); System.out.println("Loose: "+(System.currentTimeMillis()-t)/100+"ns/call"); } static strictfp void benchStrict () { long t = System.currentTimeMillis(); for (double i = 0; i < 10000000.0; ++i) StrictMath.tan(i); System.out.println("Strict: "+(System.currentTimeMillis()-t)/100+"ns/call"); } public static void main (String[] args) { while (true) { benchLoose(); benchStrict(); } } } C++ benchmark: #include <cmath> #include <cstdio> #include <ctime> void bench () { clock_t t = clock(); for (double i = 0; i < 10000000.0; ++i) tan(i); printf("C++: %fns/call\n", (clock() - t) / 100.0); } void main (char** args) { while (true) bench(); } Note: even though both are microbenchmarks and I test with full optimization, I have analysed the generated code to be sure that tan() is being called. ---------- END SOURCE ---------- (Incident Review ID: 234064) ======================================================================
|