FULL PRODUCT VERSION :
- 1.6.0 b105
- build options: STATIC_MOTIF=false
FULL OS VERSION :
- uname: Linux b1c1s9 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006
x86_64 x86_64 x86_64 GNU/Linux
- RHEL 4, (patch level 4)
- 2xDual Core Intel Xenon CPUs, (shows as 8-way machine)
A DESCRIPTION OF THE PROBLEM :
The problem is detected as relatively rare random 7-30 seconds
application pauses. Typically, these occur once every 1-4 hours in
production. With application pause time tracking enabled, the problem
can be easily seen in output logs as "application stopped" time. During
these stoppage times, a full CPU is being consumed in kernel mode.
After building the JVM from source and inserting debugging statements in
various places, we were able to determine that the pause was the result
of a synchronization problem in the psuedo memory barrier code that
attempts to control multiple processor JVM safe point entry.
We verified this by attempting to use the reinstated -XX:+UseMembar
option. This did appear to clear the problem, however the overall
performance of the system was not acceptable with this option invoked
since it uses a true memory barrier instruction to synchronized the
Further investigation into the problem pointed to a race condition and
associated thread starvation during entry into the JVM global safe
point. The psuedo memory barrier code is dependent on SIGSEGV error
processing generated while attempting to access a block of shared memory
protected by another thread. While one thread was blocked trying to
protect the shared memory to enter the safe point, another thread looped
repeatedly in the SIGSEGV handler code. This continued for random
lengths of time until the protecting thread managed to get a time slice
on the same CPU.
We believe this appears random because it only occurs on safe point
entry when there are other threads executing and when the thread trying
to force the safe point and the outstanding threads are on the same CPU.
It also appears to happen very frequently, but long pauses seem to occur
only rarely: often the number of iterations through the SIGSEGV loop are
less than 10 and the pause escapes detection.
THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try
THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
EXPECTED VERSUS ACTUAL BEHAVIOR :
ERROR MESSAGES/STACK TRACES THAT OCCUR :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
We can make available a patch that we are using successfully under production
loads. This patch tracks the number of times a thread iterates through
the SIGSEGV handler and yields the CPU to the safepoint serializing
thread if the count exceeds 10. This eliminates the longer pauses while
still allowing the loop to "spin" as it does naturally frequently.
We are not sure this is the optimal patch, but it does clearly
demonstrate the issue we were encountering with the psudeo memory
barrier implementation in our system environments.
Fixed mis-spelling of "pseudo" in Synopsis field.