Bug ID: JDK-5024566 Object integrity maybe changing using ParallelGC when a Full GC occurs

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 1.4.2_01

Priority: P2
Status: Closed
Resolution: Fixed
OS: solaris_8
CPU: sparc

Submitted: 2004-03-31
Updated: 2004-06-11
Resolved: 2004-05-10

Other
1.4.2_05 05Fixed

Customer is running Solaris 8 on eight CPU system.
When they experience a Full GC using 1.4.2, their transaction server 
throws DataValidationExceptions complaining the integrity of the data has
changed after the collection is finished. This causes rollbacks in
transaction requests and trades will go unfilled and money is lost. 
The server is only using parallelgc at that time for cleaning the
young generation.

The only change they say in their trade environment is switching out 
1.4.1_03 and using 1.4.2_03. The transaction server is written in c++
so they have natives threads referencing Java Objects. Though turning 
on the CMS collection with the UseParGC seems to hide the problem. They
never see the update exceptions after a Full GC using these options
with 1.4.2.

The customer has run a couple of test with the UseParNewGC collector for several
hours and did not experience any UpdateExceptions from their transaction
server after Full GC occurred. The heap options are listed below:
 
command: /usr/local/j2sdk1.4.2_01/bin/java
-server -showversion -Xms512m
-Xmx512m -XX:NewSize=500m -XX:MaxNewSize=500m -XX:InitialSurvivorRatio=4
-XX:TargetSurvivorRatio=100 -XX:+PrintCompilation -XX:+UseParNewGC
-XX:MaxPermSize=256MB -XX:PermSize=3m -XX:MinPermHeapExpansion=1m
-XX:MaxPermHeapExpansion=10m -XX:-UseAdaptiveSizePolicy
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution
-XX:+PrintHeapAtGC -verbose:gc -XX:+PrintGCTimeStamps -Xnoclassgc

Why would there be such a difference in behavior?

Background 
-----------
The application is more like 98% Java, and 2% C++.  
The C++ code handles some of their ORB transport code (using 
a C-API to a 3rd-party sockets vendor). When the Java code talks 
to another process, it calls down to the C++ layer.  This C++ code 
establishes the connection to the outside, and creates a 
"Receive" thread to receive messages from the newly created socket.  
Or, if another process initiates contact, the same receive thread is
created for incoming messages.  

When a new message is received from a remote process, very 
minimal processing is done at the C++ layer before the JNI UpCall 
takes place.  The Java code invoked from JNI, processes the ORB message
and figures out which handling thread (a pure Java thread) the 
message should be dispatched to.  The message is just put onto 
an internal queue, and then the dispatch thread picks it up and 
calls application code (like plug in code) to actually do 
application-level work.

Objects in Question
-------------------
So, the C++ stuff is pretty thin, and just interacts with the
older C-API to the 3rd party vendor software (which itself is
really just a layer on top of sockets).  The C++ threads that
are created are connected to the JVM so they can make calls to
the VM to create buffers, which the incoming messages are copied
into.  That buffer is basically the only Java object that the
C++ thread creates, and it is passed up during the JNI up-call.
This buffer is copied into separate objects created by the ORB
code, so after the JNI call, the C++ created buffers are no
longer referenced.

So, the objects that get modified (unexpectedly) after the few
FullGC's in 1.4.2, are not C++ created, nor are they 
stored/referenced in the C++ code.
----------

The GC output and log information is available in the attachments.

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.4.2_05 generic FIXED IN: 1.4.2_05 tiger-beta2 INTEGRATED IN: 1.4.2_05 tiger-beta2 VERIFIED IN: 1.4.2_05
08-07-2004
EVALUATION ###@###.### 2004-04-18 PSPromotionManager::oop_promotion_failed() pushes object that could not be promoted onto the claimed stack. claimed_stack()->push(obj); at or near line 345. The claimed stack is a per thread stack that has a fixed size. The could does not check if the push succeeded. If the push fails, the object may not be scanned during promotion failure handling. An application could have a reference to an object in eden or from-space that does not get updated to the final. This is consistent with some of CBOE's observations. However, this should sometimes lead to a VM crash which is not observed. This problem exists in 1.4.2 and 1.5. 1.4.1 does not use a claimed stack but rather uses a GrowableArray for saving such objects so would not have this problem. I've built a 1.4.2 VM with a fix to see if it will eliminate the problem. ###@###.### 2004-04-19 This bug can lead to two copies of the same object - one in to-space and the other in eden or from-space. Suppose object A has a reference to X and A is scanned so that X is copied from eden to to-space. Say object B has a reference to X but is not scanned. Then B's reference points to the copy in eden. When the full collection comes along to clean up after the failed promotion, A and B are both scanned and their respective copies of X (both appearing alive) survive the collection resulting in two copies.
08-07-2004
WORK AROUND Use -XX:+UseParNewGC instead
08-07-2004
PUBLIC COMMENTS no comment
08-07-2004