JDK-4844565 : UseParallelGC problem with 1.4.1_01 and 1.4.1_02 on IA-32 "Foster" chips.
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 1.4.1_02
  • Priority: P2
  • Status: Closed
  • Resolution: Cannot Reproduce
  • OS: windows_2000
  • CPU: x86
  • Submitted: 2003-04-07
  • Updated: 2005-06-13
  • Resolved: 2005-06-13
Related Reports
Relates :  
Description
Two reports follow, both on the same problem (escalation with sustaining
engineering might follow):


REPORT #1
---------

I would like to report a problem that we have been seeing with the Sun
JVM 1.4.1_01 and 1.4.1_02.

This problem can be reproduced by running the industry standard benchmark
SpecJBB 2000 (version 1.02).  The following command-line can be used:

java -server -verbosegc -XX:NewSize=634m -XX:MaxNewSize=634m -Xms1600m
-Xmx1600m -XX:+UseParallelGC -Xbatch -Xss128k -cp
.\jbb.jar;.\jbb_no_precompile.jar;.\check.jar;.\reporter.jar;.
spec.jbb.JBBmain -propfile SPECjbb.props

There is nothing special about this command-line, other than turning
on UseParallelGC.  If UseParallelGC is not enabled the problem does not
occur.  If UseParallelGC is enabled then we have only observed this problem
on our servers when populated with IA-32 1.6GHz Xeon MPs ("Foster") --
i.e. if we use the same system, backplane et al, but replace the CPUs to
anything other than Foster chips then we have not been able to reproduce
the problem.  When configured with IA-32 Foster CPUs, the problem occurs
at arbitrary points (although always during GC - as verified with verbosegc).
Sometimes the benchmark is successful without encountering the faults,
so in order to reproduce the problem we have to run the test more than
once or lengthen the duration of the benchmark.

We have enountered a number of symptoms -- e.g. here are a couple of stack
traces at the point of failure: 

MarkSweep::follow_stack() line 92 + 9 bytes
PSMarkSweep::mark_sweep_phase1(int & 0, int 0) line 255
PSMarkSweep::invoke_at_safepoint(int 0, int & 0) line 89 + 26 bytes
PSScavenge::invoke_at_safepoint(unsigned int 0, int 1, int & 0) line 383 +
11 bytes
ParallelScavengeHeap::collect_at_safepoint(ParallelScavengeHeap * const
0x004f005f, ParallelScavengeHeap::CollectionType MarkSweep, unsigned int 0,
int & 0) line 244 + 25 bytes
VM_ParallelScavengeGCCollect::doit(VM_ParallelScavengeGCCollect * const
0x004f005f) line 125
VM_Operation::evaluate(VM_Operation * const 0x004f005f) line 30
VMThread::evaluate_operation(VMThread * const 0x004f005f, VM_Operation *
0x6d630aab) line 258
VMThread::loop(VMThread * const 0x004f005f) line 334
VMThread::run(VMThread * const 0x004f005f) line 186
_start(Thread * 0x00000000) line 286


instanceKlass::oop_follow_contents(instanceKlass * const 0x2b7eb748, oopDesc
* 0x6d598ec0) line 986
MarkSweep::mark_and_follow(oopDesc * * 0x2b7eb708) line 58 + 13 bytes
objArrayKlass::oop_follow_contents(objArrayKlass * const 0x2b7eb748, oopDesc
* 0x2b7eb6f8) line 211 + 6 bytes
MarkSweep::follow_stack() line 94 + 12 bytes
PSMarkSweep::mark_sweep_phase1(int & 1, int 710017024) line 254
09e80000()
66d38778()

In addition, we have seen the problem apparently manifest itself by becoming
"stuck" repeatedly entering GC cycles while the benchmark issues
NullPointerException's.

It appears that as the number of CPUs used increases (and number of GC
threads), the chances of the problem appearing also increases.
e.g. a 32x configuration appears to be more susceptible to the problem
than a 16x than an 8x.

I dont know how difficult it might be for you to reproduce or recognize
the problem.  It is strange that we have only encountered it to date with
the Foster chips.

I'm guessing that some heap state has gotten corrupted which later leads to
the faults but am not clear how to pursue deeper in order to give you more
information.


REPORT #2:
---------

The problem looks familiar to bug 4827353 ("atomic::membar doesn't on x86")
for 1.4.2, which has been fixed.  The patches were to the following two
files:

  \hotspot\src\cpu\i486\vm\assembler_i486.cpp
  \hotspot\src\os_cpu\win32_i486\vm\atomic_win32_i486.inline.hpp

I applied these changes to both the JVM port we have for our system,
as well as to the base Sun JVM, version 1.4.1_01 which is the base for
our current JVM.

Having tested these patches and observed the same failure with the 
Sun JVM as before, it appears that this problem reported in 1.4.1_01 and
1.4.1_02 is not fixed by these patches.

Is it possible that these fixes that were added to 1.4.2 are dependent
on other changes made earlier in that stream, or that I overlooked 
additional changes specifically related to this bug fix?

The problem has only appeared when using 1.6GHz "Foster" IA-32 in a
multiprocessor system (8x, 16x, 32x etc....)  Replace the Intel chips
for faster or slower and we have no signs of trouble.  It is still a
big mystery...

Comments
EVALUATION Is this still a problem? Does the problem occur with JDK-1.4.2? ###@###.### 2003-08-21 ---------------------------------------------------------------- I'm closing this bug. ###@###.### 2005-06-13 16:18:59 GMT
21-08-2003