JDK-6546278 : Synchronization problem in the pseudo memory barrier code
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 5.0u12,6,6u1,6u3
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS:
    linux,linux_redhat_4.0,solaris_8,solaris_10 linux,linux_redhat_4.0,solaris_8,solaris_10
  • CPU: x86,sparc
  • Submitted: 2007-04-16
  • Updated: 2011-03-08
  • Resolved: 2011-03-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other JDK 6 JDK 7 Other
5.0u14,hs11Fixed 6u4Fixed 7Fixed hs11Fixed
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Description
FULL PRODUCT VERSION :
Hotspot/Java:

- 1.6.0 b105
- sources:
  jdk-6-fcs-bin-b105-jrl-29_nov_2006.jar
  jdk-6-fcs-src-b105-jrl-29_nov_2006.jar
- build options: STATIC_MOTIF=false

FULL OS VERSION :
- uname: Linux b1c1s9 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006
x86_64 x86_64 x86_64 GNU/Linux
- RHEL 4, (patch level 4)
- 2xDual Core Intel Xenon CPUs, (shows as 8-way machine)

A DESCRIPTION OF THE PROBLEM :
The problem is detected as relatively rare random 7-30 seconds
application pauses. Typically, these occur once every 1-4 hours in
production. With application pause time tracking enabled, the problem
can be easily seen in output logs as "application stopped" time. During
these stoppage times, a full CPU is being consumed in kernel mode.

After building the JVM from source and inserting debugging statements in
various places, we were able to determine that the pause was the result
of a synchronization problem in the psuedo memory barrier code that
attempts to control multiple processor JVM safe point entry.

We verified this by attempting to use the reinstated -XX:+UseMembar
option. This did appear to clear the problem, however the overall
performance of the system was not acceptable with this option invoked
since it uses a true memory barrier instruction to synchronized the
multiple processors.

Further investigation into the problem pointed to a race condition and
associated thread starvation during entry into the JVM global safe
point. The psuedo memory barrier code is dependent on SIGSEGV error
processing generated while attempting to access a block of shared memory
protected by another thread. While one thread was blocked trying to
protect the shared memory to enter the safe point, another thread looped
repeatedly in the SIGSEGV handler code. This continued for random
lengths of time until the protecting thread managed to get a time slice
on the same CPU.

We believe this appears random because it only occurs on safe point
entry when there are other threads executing and when the thread trying
to force the safe point and the outstanding threads are on the same CPU.
It also appears to happen very frequently, but long pauses seem to occur
only rarely: often the number of iterations through the SIGSEGV loop are
less than 10 and the pause escapes detection.

THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See description

EXPECTED VERSUS ACTUAL BEHAVIOR :
See description
ERROR MESSAGES/STACK TRACES THAT OCCUR :
Not available

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
Not available
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
We can make available a patch that we are using successfully under production
loads. This patch tracks the number of times a thread iterates through
the SIGSEGV handler and yields the CPU to the safepoint serializing
thread if the count exceeds 10. This eliminates the longer pauses while
still allowing the loop to "spin" as it does naturally frequently.

We are not sure this is the optimal patch, but it does clearly
demonstrate the issue we were encountering with the psudeo memory
barrier implementation in our system environments.
Fixed mis-spelling of "pseudo" in Synopsis field.

Comments
WORK AROUND Try: -XX:+UseMembar
10-07-2007

EVALUATION The webrev is at: http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/rt_baseline/2007/20070620130324.xl116366.hotspot/workspace/webrevs/webrev-2007.06.20/index.html
22-06-2007

EVALUATION Even though the root cause could be the long pause of mprotect call during safepoint (see bug 6336900), it is a reasonable workaround to issue a poll call inside the signal handler to yield to other thread such as VMThread so that the memory serialize page's permission could be restored soon and eventually both threads could make progress. Simens networks has tested the fix and the long pause time is gone.
06-06-2007