JDK-6939126 : System.nanoTime not increasing
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 6u18-rev
  • Priority: P2
  • Status: Closed
  • Resolution: Not an Issue
  • OS: solaris_10
  • CPU: sparc
  • Submitted: 2010-03-30
  • Updated: 2011-02-16
  • Resolved: 2010-05-07
Related Reports
Relates :  
Description
Customer noticed a problem where threads that called
java.util.concurrent.ArrayBlockingQueue.poll never returned, which
caused severe downstream problems in our server.  Customer isolated the
problem down to System.nanoTime not increasing.  The value returned by
this method has not changed in several hours.

The files in this package demonstrate that System.nanoTime() is not
advancing and that calls to ArrayBlockingQueue.poll() do not return.

The TimeDebugTask class was loaded into our server and the runTask
method was invoked.  The output of this method can be see in the
TimeDebugTask.stderr file.  You'll notice that all six calls to
System.nanoTime() returned the same value (5187510361303074) even
though System.currentTimeMillis() advanced by over a second.

You'll also see that the call to sleeper.sleep(1000L), which
internally calls ArrayBlockingQueue.poll(), never returns until the
thread is forceably interrupted.  Here was the stack trace of the Task
thread that was interrupted:

"Task Thread 1 (R5433)" prio=3 tid=0x00000000054f4800 nid=0xd0 waiting on condition [0xfffffd6937bb9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0xfffffd69c79808c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
        at com.unboundid.util.WakeableSleeper.sleep(WakeableSleeper.java:104)
        at com.unboundid.directory.server.tasks.TimeDebugTask.runTask(TimeDebugTask.java:110)
        at com.unboundid.directory.server.backends.task.Task.execute(Task.java:1179)
        at com.unboundid.directory.server.backends.task.TaskThread.runThread(TaskThread.java:195)
        at com.unboundid.directory.server.api.DirectoryThread.run(DirectoryThread.java:184)

The JVM is still up and running in this state, and customer has a mechanism
to load and run arbitrary code, so if you have some code that would
help diagnose this, please let us know.

Customer have a standalone Java server process running with JDK 1.6 update
18 on a Sun x4600 with Solaris 10 (SunOS ospccr17 5.10
Generic_127128-11 i86pc i386 i86pc).  The JDK is using the following
arguments:

-d64 -server -Xmx89g -Xms89g -XX:MaxNewSize=2g -XX:NewSize=2g
 -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled
 -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled
 -XX:ParallelCMSThreads=4 -XX:CMSMaxAbortablePrecleanTime=10000
 -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseParNewGC -XX:+UseMembar
 -XX:+UseBiasedLocking -XX:+UseLargePages -XX:PermSize=64M
 -XX:+UseNUMA -XX:+HeapDumpOnOutOfMemoryError
 -XX:-ReduceInitialCardMarks

They have seen this problem on multiple x4600s, they believes this issue was introduced with jdk 1.6.0_18 with CR ID 

 6863420 os::javaTimeNanos() go backward on Solaris x86

and problem seems to only affect a single process after running for a long time.
They have sent the following gc snipped that is very interesting:

  2010-04-10T00:28:45.082-0700: [CMS-concurrent-preclean: 18289640.709/18289642.986 secs] [Times: user=11.97 sys=7.69, real=18289642.99 secs]

the full GC log is located attached.

Also, looking at getTimeNanos() method implementation:

// gethrtime can move backwards if read from one cpu and then a different cpu
// getTimeNanos is guaranteed to not move backward on Solaris
inline hrtime_t getTimeNanos() {
  if (VM_Version::supports_cx8()) {
    const hrtime_t now = gethrtime();
    // Use atomic long load since 32-bit x86 uses 2 registers to keep long.
    const hrtime_t prev = Atomic::load((volatile jlong*)&max_hrtime);
    if (now <= prev)  return prev;   // same or retrograde time;
    const hrtime_t obsv = Atomic::cmpxchg(now, (volatile jlong*)&max_hrtime, prev);
    assert(obsv >= prev, "invariant");   // Monotonicity
    // If the CAS succeeded then we're done and return "now".
    // If the CAS failed and the observed value "obs" is >= now then
    // we should return "obs".  If the CAS failed and now > obs > prv then
    // some other thread raced this thread and installed a new value, in which case
    // we could either (a) retry the entire operation, (b) retry trying to install now
    // or (c) just return obs.  We use (c).   No loop is required although in some cases
    // we might discard a higher "now" value in deference to a slightly lower but freshly
    // installed obs value.   That's entirely benign -- it admits no new orderings compared
    // to (a) or (b) -- and greatly reduces coherence traffic.
    // We might also condition (c) on the magnitude of the delta between obs and now.
    // Avoiding excessive CAS operations to hot RW locations is critical.
    // See http://blogs.sun.com/dave/entry/cas_and_cache_trivia_invalidate
    return (prev == obsv) ? now : obsv ;
  } else {
    return oldgetTimeNanos();
  }
}

This method will return 'prev' until 'prev > now':

if (now <= prev)  return prev;   // same or retrograde time;

But according to the comment, gethrtime can move backwards.  So if it moves backwards and stays behind, then it will continue to return the same value (i.e. prev).  This would definitely explain the behavior that they're seeing.  Am I missing something?

Comments
EVALUATION not a jvm issue, it's fixed in Solaris 10 in kernel patch 137138-09 (Update 6).
07-05-2010