United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6546278 Synchronization problem in the pseudo memory barrier code
JDK-6546278 : Synchronization problem in the pseudo memory barrier code

Details
Type:
Bug
Submit Date:
2007-04-16
Status:
Closed
Updated Date:
2011-03-08
Project Name:
JDK
Resolved Date:
2011-03-08
Component:
hotspot
OS:
solaris_8,linux_redhat_4.0,linux,solaris_10
Sub-Component:
runtime
CPU:
x86,sparc
Priority:
P4
Resolution:
Fixed
Affected Versions:
5.0u12,6,6u1,6u3
Fixed Versions:
hs11 (b01)

Related Reports
Backport:
Backport:
Backport:
Duplicate:
Duplicate:
Relates:

Sub Tasks

Description
FULL PRODUCT VERSION :
Hotspot/Java:

- 1.6.0 b105
- sources:
  jdk-6-fcs-bin-b105-jrl-29_nov_2006.jar
  jdk-6-fcs-src-b105-jrl-29_nov_2006.jar
- build options: STATIC_MOTIF=false

FULL OS VERSION :
- uname: Linux b1c1s9 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006
x86_64 x86_64 x86_64 GNU/Linux
- RHEL 4, (patch level 4)
- 2xDual Core Intel Xenon CPUs, (shows as 8-way machine)

A DESCRIPTION OF THE PROBLEM :
The problem is detected as relatively rare random 7-30 seconds
application pauses. Typically, these occur once every 1-4 hours in
production. With application pause time tracking enabled, the problem
can be easily seen in output logs as "application stopped" time. During
these stoppage times, a full CPU is being consumed in kernel mode.

After building the JVM from source and inserting debugging statements in
various places, we were able to determine that the pause was the result
of a synchronization problem in the psuedo memory barrier code that
attempts to control multiple processor JVM safe point entry.

We verified this by attempting to use the reinstated -XX:+UseMembar
option. This did appear to clear the problem, however the overall
performance of the system was not acceptable with this option invoked
since it uses a true memory barrier instruction to synchronized the
multiple processors.

Further investigation into the problem pointed to a race condition and
associated thread starvation during entry into the JVM global safe
point. The psuedo memory barrier code is dependent on SIGSEGV error
processing generated while attempting to access a block of shared memory
protected by another thread. While one thread was blocked trying to
protect the shared memory to enter the safe point, another thread looped
repeatedly in the SIGSEGV handler code. This continued for random
lengths of time until the protecting thread managed to get a time slice
on the same CPU.

We believe this appears random because it only occurs on safe point
entry when there are other threads executing and when the thread trying
to force the safe point and the outstanding threads are on the same CPU.
It also appears to happen very frequently, but long pauses seem to occur
only rarely: often the number of iterations through the SIGSEGV loop are
less than 10 and the pause escapes detection.

THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See description

EXPECTED VERSUS ACTUAL BEHAVIOR :
See description
ERROR MESSAGES/STACK TRACES THAT OCCUR :
Not available

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
Not available
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
We can make available a patch that we are using successfully under production
loads. This patch tracks the number of times a thread iterates through
the SIGSEGV handler and yields the CPU to the safepoint serializing
thread if the count exceeds 10. This eliminates the longer pauses while
still allowing the loop to "spin" as it does naturally frequently.

We are not sure this is the optimal patch, but it does clearly
demonstrate the issue we were encountering with the psudeo memory
barrier implementation in our system environments.
Fixed mis-spelling of "pseudo" in Synopsis field.

                                    

Comments
EVALUATION

Even though the root cause could be the long pause of mprotect call during safepoint (see bug 6336900), it is a reasonable workaround to issue a poll call inside the signal handler to yield to other thread such as VMThread so that the memory serialize page's permission could be restored soon and eventually both threads could make progress.

Simens networks has tested the fix and the long pause time is gone.
                                     
2007-06-06
EVALUATION

The webrev is at:
http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/rt_baseline/2007/20070620130324.xl116366.hotspot/workspace/webrevs/webrev-2007.06.20/index.html
                                     
2007-06-22
WORK AROUND

Try: -XX:+UseMembar
                                     
2007-07-10



Hardware and Software, Engineered to Work Together