JDK-6900441 : PlatformEvent.park(millis) on Linux could still be affected by changes to the time-of-day clock
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: e5.0u21,hs24,hs25,6,6u29,7
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: linux,linux_ubuntu
  • CPU: generic,x86,ppc
  • Submitted: 2009-11-11
  • Updated: 2019-06-17
  • Resolved: 2013-09-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 JDK 8 Other
6u71Fixed 7u60Fixed 8 b109Fixed hs25Fixed
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
This is a continuation of 6311057 which only partially "fixed" the problem because Linux itself is operating incorrectly in this area. I have it on good authority that "The futex wait implementation is historically wrong." and that "I think with newer kernel/glibc combinations it will be correct."

The real fix here is to change the pthread_cond so that it is associated with the monotonic clock (pthread_condattr_setclock(CLOCK_MONOTONIC) - and with pthread_cond_init being passed the attr object) and to calculate the asbolute time as "millis from now" on the monotonic clock. This will make the PlatformEvent::park(millis) code immune to changes in the time-of-day clock.

This change is relatively simple but it does affect all timed-waiting calls that ultimately use PlatformEvent. This should be okay as the majority of Java APIs that involve timed-waiting all specify relative times the same as Thread.sleep. I believe this will cover Thread.sleep and Object.wait.

Note that there is a similar problem with the java.util.concurrent.locks.LockSupport.park methods which are implemented using the platform specific Parker class, which again uses a pthread_mutex and pthread_cond under the covers. However as they support both absolute and relative wait times the fix is more involved as we need to use the monotonic clock for relative waits, but the TOD clock for absolute ones! That would require using two different pthread_cond objects.

Update: Note that time jump forwards lead to early returns. For os::sleep there is a guard in place to prevent this already. For wait(ms) and park(nanos) such a guard could also be implemented but is unnecessary in practice because an early timeout can not be distinguished from a "spurious wakeup" which is permitted by the spec and which Java code has to account for.
Note that even CLOCK_MONOTONIC is subject to some TOD adjustments. The solution on newer linuxes is to use CLOCK_MONOTONIC_RAW if available. Update: the adjustments to CLOCK_MONOTONIC under ntp are not a problem and use of CLOCK_MONOTONIC_RAW is neither necessary nor desirable.

The backport to JDK6 is turning out to be problematic. We are currently investigating options, but we will not be able to take this to the January CPU. Removing critical-watch.

There is no regression test for this as the available tests need to be run manually with superusers privileges on a system where we can mess with the system time.

Changed Fix Version from 9 to 8. Unfortunately the 9 caused a backport issue for hs25 to be created instead of using this as the main issue. This issue will now be used for 8.

Yes, I can test your fix if you need, I have a system that it reproduces on.

Thanks for the info. It is good in a way that the main distros have been slow to pick this up. I'm preparing a minimal fix (minimal in the sense that I'm avoiding doing any clean up work that this area is in desperate need of and just switching to use CLOCK_MONOTONIC). Are you in a position to test with a fixed bundle?

Yes, that is the main glibc development repo. Looking at the git history, it looks like the change was part of glibc 2.12, which was officially released sometime middle to late 2010 (at least that what it looks like). https://sourceware.org/git/?p=glibc.git;a=commit;h=e28c88707ef0529593fccedf1a94c3fce3df0ef3 Then it took a while before linux distributions upgraded to that version. RHEL looks to have picked it up in RHEL 6.4, which was released 2013-02-21. http://distrowatch.com/table.php?distribution=redhat Other distributions might have picked it up at other points in time. I have no real way of knowing when the 32 bit version will appear, but I'm guessing the process will be similar, so it might be a while before that shows up in linux distros.

I can't tell from the links exactly what these were commited to - is this mainline? I'm not sure how glibc is developed and distributed. It is very surprising that this went in in 2009 but we did not see any reports until early 2012 and then now. Do we know what glibc version this corresponds to and which distributions it would have appeared in? Likewise do we know when the 32-bit version will appear?

The changes for x86_64 glibc that caused the problems were commited in 2009: https://sourceware.org/git/?p=glibc.git;a=commit;h=e88726b483a275824e852f64476087568dbae7bb The corresponding 32-bit changes were commited earlier this year (2013): https://sourceware.org/git/?p=glibc.git;a=commit;h=4f682b2ae941b9bacde6015799b7ae77301a6d87

Also note that it is tricky to make the necessary changes to this code because of all the baggage that still exists from the early NPTL days in particular the WorkAroundNTPLTimedWaitHang. It would be best to strip out that code completely and simplify the overall logic. Of course we need to verify that the glibc bug it guards against is in fact gone. The shared form of the solaris code, regarding the limitations of "now + 100,000,000" also complicates changing to CLOCK_REALTIME. Again we could strip out this code (the planned refactoring never occurred) but that again makes the fix more disruptive. It would be better, but of course there is more room for error to creep in.

The fix in the linux kernel/glibc hit early 2012 (or late 2011) as near as I can tell but only for 64-bit. A 32-bit version of the patch is also in the pipeline as I understand. The issue has been raised on the mailing lists: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-September/009104.html It is important to note that: a) Not all 64-bit linux systems are affected (though we don't know exactly which kernel/glibc version is) b) It is only large backward time jumps that cause significant problems (which might be caused by an initial forward jump followed by a backward correction). Small time changes will causes waits to return early or slightly late. These timed-wait functions use timeouts are heuristics and the exact time value is not part of the functional semantics, so in most cases these early/late returns are not even noticeable. It is only when a large time jump backwards occurs that the wait becomes correspondingly longer and so a "hang" is perceived.

Priority raised to P3 Impact = High - potential hang of application threads or excessive non responsiveness Likelihood = Low but rising 64-bit linux was modified and is now affected by this; 32-but linux will follow Workaround = Medium - avoid changing the system time

Removed myself as assignee as this bug is not actively being worked on.

diffs for import to add sleep tracing

Sample invocation sudo java -XX:+TraceSleep -cp ~/Test/8* TimeJumpWithWait 2>&1 | tee /tmp/TimeJumpWithWait.log sudo java -XX:+TraceSleep -cp ~/Test/8* TimeJumpWithWait 2>&1 | tee /tmp/TimeJumpWithWait.log

Java test code that will reproduce the park issue also known ar JDK-6900441 One testing the sleep interface One testing the wait interface

EVALUATION See description.