Bug ID: JDK-6440250 On Windows System.nanoTime() may be 25x slower than System.currentTimeMillis()

FULL PRODUCT VERSION :
applies to all there:

java version "1.6.0-beta2-fastdebug"
Java(TM) SE Runtime Environment (build 1.6.0-beta2-fastdebug-b84)
Java HotSpot(TM) Server VM (build 1.6.0-beta2-fastdebug-b84-debug, mixed mode)

java version "1.6.0-beta2-fastdebug"
Java(TM) SE Runtime Environment (build 1.6.0-beta2-fastdebug-b84)
Java HotSpot(TM) Client VM (build 1.6.0-beta2-fastdebug-b84-debug, mixed mode)

java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode)


ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows XP [Version 5.1.2600]

EXTRA RELEVANT SYSTEM CONFIGURATION :
Athlon64 X2 4400+  with  /usepmtimer  in boot.ini

A DESCRIPTION OF THE PROBLEM :
System.nanoTime() is more then 25x times slower then System.currentTimeMillis().

If I use nanoTime() for benchmarking it slows down my application by a huge factor.
If I use currentTimeMillis() then I don't have enough time precision.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the test case

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
nanoTime:          169 ns
currentTimeMillis: 57 ns

ACTUAL -
nanoTime:          1659 ns
currentTimeMillis: 57 ns


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
class PerfTest {
    private static final int COUNT = 1000000;
    public static void main(String[] args) {
        long start = System.nanoTime();
        long end = start;
        for(int i=0 ; i<COUNT ; i++) {
            end = System.nanoTime();
        }
        System.out.println("nanoTime:          " + (end-start) / COUNT + " ns");
        
        long dummy = 0;
        start = System.nanoTime();
        for(int i=0 ; i<COUNT ; i++) {
            dummy = System.currentTimeMillis();
        }
        end = System.nanoTime();
        System.out.println("currentTimeMillis: " + (end-start) / COUNT + " ns");
    }
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Use a JNI DLL which uses the processor counter (RDTSC) and normalize the counter after the measurement - but this has then JNI overhead which is not the case for currentTimeMillis() because it is inlined by the JVM.

EVALUATION See also: http://support.microsoft.com/?id=896256 It seems synchronization of the TSC on MP or HT systems is a hotfix for XP SP2. The general recommendation by Microsoft is to use the /usepmtimer switch to base QueryPerformanceCounter on the ACPI power management timer rather than the TSC.
21-06-2006
WORK AROUND If you are forcing use of the ACPI power management timer through the /usepmtimer boot.ini flag then you can switch back to using the TSC by removing this flag. Of course this means that any problems your system may experience through use of an unreliable TSC value will be experienced - so "caveat emptor": choosing speed over correctness is generally the wrong choice. In limited circumstances, for example if you have constrained the CPU affinity for the JVM threads to run on only one processor or core, and you know you have a stable TSC frequency, then you may be able to use the TSC safely. Of course this is a system wide setting so you need to ensure that all applications on your system will run correctly if the TSC is used.
21-06-2006
EVALUATION ------------------------------------------- This isn't a benchmarking problem - run with -Xint to see the same results. On Windows, System.currentTimeMillis() is implemented using GetSystemTimeAsFileTime, which essentially just reads the low resolution time-of-day value that windows maintains. Reading this global variable is naturally very quick - around 6 cycles according to reported information. In contrast System.nanoTime() is implemented using the QueryPerformanceCounter API (if available, else it returns currentTimeMillis*10^6). QueryPerformanceCounter is implemented in different ways depending on the hardware you are running on. The default implementation is determined by the HAL, but some systems allow you to explicitly control it using options in boot.ini, such as the /usepmtimer switch that the originator refers to. QueryPerformanceCounter will use either the programmable-interval-timer (PIT), or the ACPI power management timer (PMT), or the CPU-level timestamp-counter (TSC). Accessing the PIT/PMT requires execution of slow I/O port instructions and as a result the execution time for QPC is in the order of microseconds. In contrast reading the TSC is on the order of 100 clock cycles (to read the TSC from the chip and convert it to a time value based on the operating frequency). What the originator is observing (as they should expect given the use of /usepmtimer) is the use of the ACPI power management timer, which takes over a microsecond to read. You can tell if your system uses the ACPI PMT by checking if QueryPerformanceFrequency returns the signature value of 3,579,545 (ie 3.57MHz). If you see a value around 1.19Mhz then your system is using the old 8245 PIT chip. While the TSC is the fastest way to read a high resolution timestamp value, it has problems that make it an unreliable timing source on many systems: - its update rate is determined by the CPU frequency, which on many systems in not a constant (particularly laptops but also many desktop systems with advanced power management support - Windows XP uses ACPI to modify CPU frequency to adapt the power usage of the system to the current workload - hence the TSC should not be used on such systems) - on SMP systems the TSC on different CPU's can be different, hence corrections have to made to account for threads changing CPU's between two timestamp readings According to some sources Windows will synchronize the TSC across different processors, but those same sources also recommend ensuring your start and end timestamp occur on the same processor: http://www.intel.com/cd/ids/developer/asmo-na/eng/209859.htm?prn=y There are also techniques for dealing with frequency variation by using power management API's to disable frequency changing for the duration of the timed section of code, but that requires a paired API of the form: startTimedSection(); <code to measure>; endTimedSection(); which isn't applicable to a simple API like System.nanoTime() or QueryPerformanceCounter itself. Rather than try to solve the TSC problems itself the JVM simply relies on the implementation of QueryPerformanceCounter giving the "best" available result for that platform (Intel themselves recommend this over using RDTSC) - if Windows determines the TSC can be reliably used on your hardware then it will use it, else it will likely use the PMT. Note that QPC, even when not based on TSC can still have problems on some systems eg time leaping forward by several seconds. A good synopsis of the state of Windows timers is given at: http://www.gamedev.net/reference/programming/features/timing/ See also: http://www.mattwalsh.com/twiki/bin/view/Main/HighFrequencyCounterInC
20-06-2006
SUGGESTED FIX There is no suggested fix in general. QueryPerformanceCounter will use the available reliable high resolution time source on a given platform. If that is the TSC then you'll get timer overhead on the order of a few tens of nanoseconds; if it the PMT then you'll get microsecond overhead.
20-06-2006
SUGGESTED FIX Martin Buchholz wrote: > See > 6440250: System.nanoTime() is 25x slower than System.currentTimeMillis() > 6440430: c2 doesn't optimize away pointless calls to nanoTime, > currentTimeMillis > > Michael McMahon wrote: > >>Does anyone know if there is a more efficient way >>to measure elapsed time than System.currentTimeMillis() ? >> >>Trouble is that is a native method, and I believe it is quite >>costly to call it frequently. Or trivial for the compiler writers, but I'm not sure. And I don't know if this is a userspace perf counter or if it really is expensive to access. But John and/or Steve probably do. -Pete >> >>- Michael.
19-06-2006
EVALUATION Writing microbenchmarks is known to be very difficult. Here's a benchmark written using the lessons of years of mistakes which shows that once hotspot gets going, the two methods are approximately equally fast: (Lesson: use warmup like in the benchmark harness below) $ for f in -client -server; do jver 6 jr $f MicroBenchmark.java; done ==> javac -source 1.6 -Xlint:all MicroBenchmark.java ==> java -client -esa -ea MicroBenchmark Method Millis Ratio nanoTime 2692 1.000 currentTimeMillis 2965 1.101 ==> javac -source 1.6 -Xlint:all MicroBenchmark.java ==> java -server -esa -ea MicroBenchmark Method Millis Ratio nanoTime 2314 1.000 currentTimeMillis 1901 0.822 $ cat MicroBenchmark.java class MicroBenchmark { abstract static class Job { private final String name; Job(String name) { this.name = name; } String name() { return name; } abstract void work() throws Throwable; } private static void collectAllGarbage() { try { for (int i = 0; i < 2; i++) { System.gc(); System.runFinalization(); Thread.sleep(10); } } catch (InterruptedException e) { throw new Error(e); } } /** * Runs each job for long enough that all the runtime compilers * have had plenty of time to warm up, i.e. get around to * compiling everything worth compiling. * Returns array of average times per job per run. / private static long[] time0(Job ... jobs) throws Throwable { final long warmupNanos = 10L 1000L * 1000L * 1000L; long[] nanoss = new long[jobs.length]; for (int i = 0; i < jobs.length; i++) { collectAllGarbage(); long t0 = System.nanoTime(); long t; int j = 0; do { jobs[i].work(); j++; } while ((t = System.nanoTime() - t0) < warmupNanos); nanoss[i] = t/j; } return nanoss; } private static void time(Job ... jobs) throws Throwable { long[] warmup = time0(jobs); // Warm up run long[] nanoss = time0(jobs); // Real timing run final String nameHeader = "Method"; int nameWidth = nameHeader.length(); for (Job job : jobs) nameWidth = Math.max(nameWidth, job.name().length()); final String millisHeader = "Millis"; int millisWidth = millisHeader.length(); for (long nanos : nanoss) millisWidth = Math.max(millisWidth, String.format("%d", nanos/(1000L * 1000L)).length()); final String ratioHeader = "Ratio"; int ratioWidth = ratioHeader.length(); String format = String.format("%%-%ds %%%dd %%.3f%%n", nameWidth, millisWidth); String headerFormat = String.format("%%-%ds %%-%ds %%-%ds%%n", nameWidth, millisWidth, ratioWidth); System.out.printf(headerFormat, "Method", "Millis", "Ratio"); // Print out absolute and relative times, calibrated against first job for (int i = 0; i < jobs.length; i++) { long millis = nanoss[i]/(1000L * 1000L); double ratio = (double)nanoss[i] / (double)nanoss[0]; System.out.printf(format, jobs[i].name(), millis, ratio); } } private static int intArg(String[] args, int i, int defaultValue) { return args.length > i ? Integer.parseInt(args[i]) : defaultValue; } private static void deoptimize(long x) { if (x == 289374930472L) throw new Error(); } public static void main(String[] args) throws Throwable { final int iterations = intArg(args, 0, 10000000); time( new Job("nanoTime") { void work() { long x = 0; for (int i = 0 ; i < iterations; i++) x += System.nanoTime(); deoptimize(x);}}, new Job("currentTimeMillis") { void work() { long x = 0; for (int i = 0 ; i < iterations; i++) x += System.currentTimeMillis(); deoptimize(x);}} ); } }
19-06-2006
EVALUATION This is a valid request, but it's possible the impl of this interface is already fully tuned.
19-06-2006

Relates :	JDK-6878481 - Add performance counters in the JDK
Relates :	JDK-6440430 - C2 does not optimize away pointless calls to nanoTime, currentTimeMillis