Duplicate :
|
|
Duplicate :
|
|
Relates :
|
FULL PRODUCT VERSION : Hotspot/Java: - 1.6.0 b105 - sources: jdk-6-fcs-bin-b105-jrl-29_nov_2006.jar jdk-6-fcs-src-b105-jrl-29_nov_2006.jar - build options: STATIC_MOTIF=false FULL OS VERSION : - uname: Linux b1c1s9 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux - RHEL 4, (patch level 4) - 2xDual Core Intel Xenon CPUs, (shows as 8-way machine) A DESCRIPTION OF THE PROBLEM : The problem is detected as relatively rare random 7-30 seconds application pauses. Typically, these occur once every 1-4 hours in production. With application pause time tracking enabled, the problem can be easily seen in output logs as "application stopped" time. During these stoppage times, a full CPU is being consumed in kernel mode. After building the JVM from source and inserting debugging statements in various places, we were able to determine that the pause was the result of a synchronization problem in the psuedo memory barrier code that attempts to control multiple processor JVM safe point entry. We verified this by attempting to use the reinstated -XX:+UseMembar option. This did appear to clear the problem, however the overall performance of the system was not acceptable with this option invoked since it uses a true memory barrier instruction to synchronized the multiple processors. Further investigation into the problem pointed to a race condition and associated thread starvation during entry into the JVM global safe point. The psuedo memory barrier code is dependent on SIGSEGV error processing generated while attempting to access a block of shared memory protected by another thread. While one thread was blocked trying to protect the shared memory to enter the safe point, another thread looped repeatedly in the SIGSEGV handler code. This continued for random lengths of time until the protecting thread managed to get a time slice on the same CPU. We believe this appears random because it only occurs on safe point entry when there are other threads executing and when the thread trying to force the safe point and the outstanding threads are on the same CPU. It also appears to happen very frequently, but long pauses seem to occur only rarely: often the number of iterations through the SIGSEGV loop are less than 10 and the pause escapes detection. THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes STEPS TO FOLLOW TO REPRODUCE THE PROBLEM : See description EXPECTED VERSUS ACTUAL BEHAVIOR : See description ERROR MESSAGES/STACK TRACES THAT OCCUR : Not available REPRODUCIBILITY : This bug can be reproduced always. ---------- BEGIN SOURCE ---------- Not available ---------- END SOURCE ---------- CUSTOMER SUBMITTED WORKAROUND : We can make available a patch that we are using successfully under production loads. This patch tracks the number of times a thread iterates through the SIGSEGV handler and yields the CPU to the safepoint serializing thread if the count exceeds 10. This eliminates the longer pauses while still allowing the loop to "spin" as it does naturally frequently. We are not sure this is the optimal patch, but it does clearly demonstrate the issue we were encountering with the psudeo memory barrier implementation in our system environments. Fixed mis-spelling of "pseudo" in Synopsis field.
|