United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6900441 : PlatformEvent.park(millis) on Linux could still be affected by changes to the time-of-day clock

Details
Type:
Bug
Submit Date:
2009-11-11
Status:
Closed
Updated Date:
2014-06-27
Project Name:
JDK
Resolved Date:
2013-09-24
Component:
hotspot
OS:
linux_ubuntu,linux
Sub-Component:
runtime
CPU:
x86,ppc,generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
e5.0u21,hs24,hs25,6,6u29,7
Fixed Versions:

Related Reports
Backport:
Backport:
Backport:
Backport:
Duplicate:
Duplicate:
Relates:
Relates:
Relates:

Sub Tasks

Description
This is a continuation of 6311057 which only partially "fixed" the problem because Linux itself is operating incorrectly in this area. I have it on good authority that "The futex wait implementation is historically wrong." and that "I think with newer kernel/glibc combinations it will be correct."

The real fix here is to change the pthread_cond so that it is associated with the monotonic clock (pthread_condattr_setclock(CLOCK_MONOTONIC) - and with pthread_cond_init being passed the attr object) and to calculate the asbolute time as "millis from now" on the monotonic clock. This will make the PlatformEvent::park(millis) code immune to changes in the time-of-day clock.

This change is relatively simple but it does affect all timed-waiting calls that ultimately use PlatformEvent. This should be okay as the majority of Java APIs that involve timed-waiting all specify relative times the same as Thread.sleep. I believe this will cover Thread.sleep and Object.wait.

Note that there is a similar problem with the java.util.concurrent.locks.LockSupport.park methods which are implemented using the platform specific Parker class, which again uses a pthread_mutex and pthread_cond under the covers. However as they support both absolute and relative wait times the fix is more involved as we need to use the monotonic clock for relative waits, but the TOD clock for absolute ones! That would require using two different pthread_cond objects.

Update: Note that time jump forwards lead to early returns. For os::sleep there is a guard in place to prevent this already. For wait(ms) and park(nanos) such a guard could also be implemented but is unnecessary in practice because an early timeout can not be distinguished from a "spurious wakeup" which is permitted by the spec and which Java code has to account for.
                                    

Comments
The backport to JDK6 is turning out to be problematic. We are currently investigating options, but we will not be able to take this to the January CPU. Removing critical-watch.
                                     
2013-10-24
URL:   http://hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/2e6938dd68f2
User:  amurillo
Date:  2013-09-24 21:09:48 +0000

                                     
2013-09-24
Changed Fix Version from 9 to 8. Unfortunately the 9 caused a backport issue for hs25 to be created instead of using this as the main issue. This issue will now be used for 8.
                                     
2013-09-16
There is no regression test for this as the available tests need to be run manually with superusers privileges on a system where we can mess with the system time.
                                     
2013-09-16
Yes, that is the main glibc development repo.
Looking at the git history, it looks like the change was part of glibc 2.12, which was officially released sometime middle to late 2010 (at least that what it looks like).
https://sourceware.org/git/?p=glibc.git;a=commit;h=e28c88707ef0529593fccedf1a94c3fce3df0ef3

Then it took a while before linux distributions upgraded to that version.
RHEL looks to have picked it up in RHEL 6.4, which was released 2013-02-21.
http://distrowatch.com/table.php?distribution=redhat
Other distributions might have picked it up at other points in time.

I have no real way of knowing when the 32 bit version will appear, but I'm guessing the process will be similar, so it might be a while before that shows up in linux distros.
                                     
2013-09-05
Yes, I can test your fix if you need, I have a system that it reproduces on.
                                     
2013-09-05
Thanks for the info. It is good in a way that the main distros have been slow to pick this up.

I'm preparing a minimal fix (minimal in the sense that I'm avoiding doing any clean up work that this area is in desperate need of and just switching to use CLOCK_MONOTONIC).

Are you in a position to test with a fixed bundle?
                                     
2013-09-05
Also note that it is tricky to make the necessary changes to this code because of all the baggage that still exists from the early NPTL days in particular the  WorkAroundNTPLTimedWaitHang. It would be best to strip out that code completely and simplify the overall logic. Of course we need to verify that the glibc bug it guards against is in fact gone.

The shared form of the solaris code, regarding the limitations of "now + 100,000,000" also complicates changing to CLOCK_REALTIME. Again we could strip out this code (the planned refactoring never occurred) but that again makes the fix more disruptive. It would be better, but of course there is more room for error to creep in.
                                     
2013-09-04
I can't tell from the links exactly what these were commited to - is this mainline? I'm not sure how glibc is developed and distributed. It is very surprising that this went in in 2009 but we did not see any reports until early 2012 and then now. Do we know what glibc version this corresponds to and which distributions it would have appeared in?

Likewise do we know when the 32-bit version will appear?
                                     
2013-09-04
The changes for x86_64 glibc that caused the problems were commited in 2009:
https://sourceware.org/git/?p=glibc.git;a=commit;h=e88726b483a275824e852f64476087568dbae7bb

The corresponding 32-bit changes were commited earlier this year (2013):
https://sourceware.org/git/?p=glibc.git;a=commit;h=4f682b2ae941b9bacde6015799b7ae77301a6d87
                                     
2013-09-04
Priority raised to P3

Impact = High - potential hang of application threads or excessive non responsiveness
Likelihood = Low but rising  64-bit linux was modified and is now affected by this; 32-but linux will follow
Workaround = Medium - avoid changing the system time
                                     
2013-09-03
The fix in the linux kernel/glibc hit early 2012 (or late 2011) as near as I can tell but only for 64-bit. A 32-bit version of the patch is also in the pipeline as I understand.

The issue has been raised on the mailing lists:

http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-September/009104.html 

It is important to note that:

a) Not all 64-bit linux systems are affected (though we don't know exactly which kernel/glibc version is)
b) It is only large backward time jumps that cause significant problems (which might be caused by an initial forward jump followed by a backward correction). Small time changes will causes waits to return early or slightly late. These timed-wait functions use timeouts are heuristics and the exact time value is not part of the functional semantics, so in most cases these early/late returns are not even noticeable. It is only when a large time jump backwards occurs that the wait becomes correspondingly longer and so a "hang" is perceived.
                                     
2013-09-03
Removed myself as assignee as this bug is not actively being worked on.
                                     
2013-08-29
 Sample invocation
sudo java -XX:+TraceSleep -cp ~/Test/8* TimeJumpWithWait 2>&1 | tee /tmp/TimeJumpWithWait.log
sudo java -XX:+TraceSleep -cp ~/Test/8* TimeJumpWithWait 2>&1 | tee /tmp/TimeJumpWithWait.log

                                     
2013-07-15
diffs for import to add sleep tracing

                                     
2013-07-15
Java test code that will reproduce the park issue also known ar JDK-6900441 

One testing the sleep interface 
One testing the wait interface
                                     
2013-07-11
Note that even CLOCK_MONOTONIC is subject to some TOD adjustments. The solution on newer linuxes is to use CLOCK_MONOTONIC_RAW if available.
                                     
2013-06-27
EVALUATION

See description.
                                     
2009-11-11



Hardware and Software, Engineered to Work Together