JDK-8185891 : System.currentTimeMillis() is slow on Linux, especially with the HPET time source
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 8,9,10
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • OS: linux
  • CPU: generic
  • Submitted: 2017-08-05
  • Updated: 2023-01-06
  • Resolved: 2019-01-22
Related Reports
Relates :  
Relates :  
Relates :  
Description
A DESCRIPTION OF THE REQUEST :
See http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html

System.currentTimeMillis() takes about 650 ns on Linux with the HPET time source, compared to 4 ns on Windows. 

The Linux implementation of currentTimeMillis calls gettimeofday. gettimeofday has higher precision than what's needed for currentTimeMillis. clock_gettime with a COARSE timer would be much faster, which would help a lot with the HPET time, but also with the TSC timer.

JUSTIFICATION :
If System.currentTimeMillis is called in a time-sensitive manner, the slower behaviour on Linux can be damaging, especially if the HPET timer happens to be used. Using a faster implementation would be beneficial without a drawback, since the granularity of the result is 1ms.


---------- BEGIN SOURCE ----------
See http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html for extensive tests and analysis.
---------- END SOURCE ----------


Comments
Just as a reference point on my Oracle Linux 7.9 cloud system the clock-source is kvm-clock. The benchmark (as referenced above) is flawed though and can't report the COARSE variants because it runs too few samples for the low-res time to change, but uses that elapsed time to report the call overhead! It needs to use an independent clock to measure the elapsed time of all samples. I reworked an existing crude benchmark I already had and got the following: CLOCK_MONOTONIC: 1000000 calls took 38470580 nanos (38 ns/call) CLOCK_MONOTONIC_RAW: 1000000 calls took 302140353 nanos (302 ns/call) CLOCK_MONOTONIC_COARSE: 1000000 calls took 5453367 nanos (5 ns/call) CLOCK_REALTIME: 1000000 calls took 38415196 nanos (38 ns/call) CLOCK_REALTIME_COARSE: 1000000 calls took 7201208 nanos (7 ns/call) gettimeofday: 1000000 calls took 40051537 nanos (40 ns/call)
06-01-2023

Note that we stopped using gettimeofday quite some ago and will use clock_gettime(CLOCK_REALTIME) if it is available, but that wasn't backported to 8u (something which could now be done as we should have left behind the ancient Linux systems that didn't support it). But when it comes to the clocksource as I wrote above in 2017: > This is not a game that the JVM wants to be involved in at all! We rely on the OS API's for these things and let it choose what is deemed best (HPET, TSC, ACPI-pm-timer etc). We actually have code using rdtscp in the VM that is used by JFR for its timestamps, but it is not something considered stable enough to expose through a Java API. Virtualization makes this game even more unpredictable. We have looked at using CLOCK_MONOTONIC_COARSE in the past for nanoTime but didn't see any significant benefit in doing so. > It seems silly that we have to implement our own time source to get the performance on Linux that we can get on the same hardware running Windows. Windows chose a fast low-resolution solution for its time source; Linux did not. If you do want to do something different to what the OS will provide then rolling-your-own is a solution. Can you share (privately if necessary) what your time source solution is? Adding a new API to the JDK is not something we do lightly and we need to be clear on the cross-platform semantics of the new API. A new API is also not generally a good solution for this kind of hardware issue as developers have to buy in to using the new API and won't have any way to know when that new API might be preferred. Also a new API can't be backported so it would be years before software migrated to use it. In such cases a roll-your-own solution has many benefits.
05-01-2023

I understand that you can't change the behavior of currentTimeMillis. I would note that Windows has no problem implementing it performantly on the same hardware. Even accepting that it is impossible for Linux to implement it performantly we would be happy to call another method. If such a method were added I would expect java.util.logging and some other internal users to adopt the new method as well. My data comes from an Oracle Database JDBC performance test. This test creates a large number of worker threads that do multiple queries, updates, and inserts, simulating a web app. When we turn off our time source and use System.currentTimeMillis the performance for this test drops measurably. It's not a lot, but it is consistent and measurable. Oracle Database customers, like all large scale Java users, need all the performance they can get. This article https://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html suggests that there may be a limit to the rate at which currentTimeMillis can run across an entire system. If so tests which do not reach that limit may not see the performance falloff that occurs with more threads. I have no opinion on the accuracy of this article other than our experience agrees with it and our highly concurrent performance test says currentTimeMillis is a problem. I don't know what the hardware is exactly. It's a remote host running Oracle Linux 6 in a data center. I'm pretty sure it's some flavor of x86. Oracle has migrated hosts multiple times and this issue shows up every time we look for it. It seems silly that we have to implement our own time source to get the performance on Linux that we can get on the same hardware running Windows. And of course we are using the exact same .jar file on both Windows and Linux; Write Once Run Anywhere.
05-01-2023

[~dsurber] we cannot replace System.currentTimeMillis() with something that is faster but has a much coarser resolution as that would potentially impact a lot of running code that relies on the existing resolution (rightly or wrongly). Do you have data on the cost of System.currentTimeMillis() and the exact hardware configurations? If you look at this PR: https://github.com/openjdk/jdk/pull/10123 we are talking about 35ns for the native part of the call, and that includes JNI overhead.
04-01-2023

System.currentTimeMillis is widely used to create timestamps, for example the millis field in java.util.logging.LogRecord. Oracle Database JDBC needs timestamps to record event times and detect timeout and idle conditions. It is common for a single JVM to run tens or hundreds of threads that all are using java.util.logging and Oracle Database JDBC. As a result System.currentTimeMillis shows up as an appreciable part of the load on Linux systems, though not on Windows. Clearly it is possible for currentTimeMillis to be faster on Linux. And Project Loom will only make this worse, thousands of threads rather than hundreds. This is a big enough problem that Oracle has implemented a time source class that estimates currentTimeMillis occasionally calling System.currentTimeMillis to resynchronize. This time source is neither accurate nor precise but it's good enough. And it is 100 times faster than System.currentTimeMillis in our JDBC benchmarks. It seems silly that we have to provide our own implementation of something as basic as currentTimeMillis, especially since the Window's implementation is perfectly adequate. Oracle Database JDBC does not need a particularly precise clock. The clock doesn't even need to be strictly monotonic. Ticking 100/sec is sufficient and accuracy within a few hundredths of a second is adequate. Better is nice but not at the expense of performance.
03-01-2023

Runtime Triage: This is not on our current list of priorities. We will consider this feature if we receive additional customer requirements.
22-01-2019

This isn't a bug, so I changed the issue to be an enhancement request. Also lowered priority to P4. The impact is Low IMHO. At 650ns we're talking about 0.065% overhead - hardly significant in the kind of code currentTimeMillis() is intended for - and it isn't intended for high-frequency timing loops. Congratulations on a very good write up. You uncovered all the necessary gory details. Windows is blindingly fast because you just read a global variable that is updated asynchronously. Linux could have done something similar but they don't and instead have a very complex clock/timer subsystem. The TSC is a faster time source than HPET or ACPI-pm-timer, but has been very problematic. While raw hardware now often supports frequency-invariant TSC, there can still be issues with synchronizing the TSC across cores/processors, and even worse virtualized environments often break TSC emulation (adding back all the old problems that had been fixed). This is not a game that the JVM wants to be involved in at all! We rely on the OS API's for these things and let it choose what is deemed best (HPET, TSC, ACPI-pm-timer etc). The use of gettimeofday for the millisecond time-of-day timer is historical: for a long time it was all there was, and even if clock_getime(CLOCK_REALTIME) was available it was something we had to dynamically check for at runtime, and there was no real motivation or incentive to switch away from using gettimeofday.. Even with JDK 10 we still have to account for older POSIX systems which may not have these APIs and/or don't support the required CLOCK_* variants. That said, if clock_gettime(CLOCK_REALTIME_COARSE) is significantly better than gettimeofday then we could look into using it. Though one would hope that if there is a significant difference in performance, and they both provide the same "time line" then Linux would ensure they function the same. There is another excellent article on Linux clocks etc at: http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/ including some benchmarking code. Here's the results of that code (with addition of gettimeofday on my system: "Ubuntu 12.04 LTS" Linux xxxx 3.2.0-24-generic #39-Ubuntu SMP Mon May 21 16:52:17 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux (Sorry the formatting is messed up - can't do anything about it in this Jira instance) + ./clocks clock res (ns) secs nsecs gettimeofday 1,000 1,502,087,989 149,067,000 CLOCK_REALTIME 1 1,502,087,989 149,112,824 CLOCK_REALTIME_COARSE 4,000,250 1,502,087,989 145,767,803 CLOCK_MONOTONIC 1 18,960,460 713,125,658 CLOCK_MONOTONIC_RAW 1 18,959,370 211,892,996 CLOCK_MONOTONIC_COARSE 4,000,250 18,960,460 709,741,445 + taskset -c 1 ./ClockBench Method samples min max avg median stdev gettimeofday 255 0.00 1000.00 50.98 500.00 219.96 CLOCK_REALTIME 255 39.00 73.00 41.48 56.00 2.47 CLOCK_REALTIME_COARSE 255 0.00 0.00 0.00 0.00 0.00 CLOCK_MONOTONIC 255 42.00 76.00 42.74 59.00 2.14 CLOCK_MONOTONIC_RAW 255 145.00 178.00 147.39 161.50 2.32 CLOCK_MONOTONIC_COARSE 255 0.00 0.00 0.00 0.00 0.00 cpuid+rdtsc 255 96.00 104.00 103.00 100.00 2.65 rdtscp 255 32.00 40.00 34.01 36.00 3.47 rdtsc 255 24.00 24.00 24.00 24.00 0.00 Using CPU frequency = 1.000000 ClockBench.java: Method samples min max avg median stdev System.nanoTime 255 265.00 283.00 267.47 274.00 4.05 CLOCK_REALTIME 255 268.00 286.00 268.87 277.00 1.34 cpuid+rdtsc 255 120.00 176.00 123.26 148.00 5.10 rdtscp 255 56.00 112.00 58.16 84.00 4.81 rdtsc 255 40.00 88.00 41.19 64.00 3.96 Using CPU frequency = 1.000000 So something not right with the _COARSE variants on my machine! I didn't look too deep into the methodology of the benchmark.
07-08-2017

This is very good report, contains good source of information (http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html) to demonstrate the issue. Taken a small test case (attached one) and executed on windows and Linux under jdk 9 to get the desired results. Windows == Sum = 2634114345848286594; time = 1483; or 14.83 ns / iter Linux == Sum = 2634106325227250724; time = 15164; or 151.64 ns / iter Result shows Linux took 10 times more compared to Windows
07-08-2017