United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-5024566 Object integrity maybe changing using ParallelGC when a Full GC occurs
JDK-5024566 : Object integrity maybe changing using ParallelGC when a Full GC occurs

Details
Type:
Bug
Submit Date:
2004-03-31
Status:
Closed
Updated Date:
2004-06-11
Project Name:
JDK
Resolved Date:
2004-05-10
Component:
hotspot
OS:
solaris_8
Sub-Component:
gc
CPU:
sparc
Priority:
P2
Resolution:
Fixed
Affected Versions:
1.4.2_01
Fixed Versions:
1.4.2_05 (05)

Related Reports
Backport:

Sub Tasks

Description
Customer is running Solaris 8 on eight CPU system.
When they experience a Full GC using 1.4.2, their transaction server 
throws DataValidationExceptions complaining the integrity of the data has
changed after the collection is finished. This causes rollbacks in
transaction requests and trades will go unfilled and money is lost. 
The server is only using parallelgc at that time for cleaning the
young generation.

The only change they say in their trade environment is switching out 
1.4.1_03 and using 1.4.2_03. The transaction server is written in c++
so they have natives threads referencing Java Objects. Though turning 
on the CMS collection with the UseParGC seems to hide the problem. They
never see the update exceptions after a Full GC using these options
with 1.4.2.

The customer has run a couple of test with the UseParNewGC collector for several
hours and did not experience any UpdateExceptions from their transaction
server after Full GC occurred. The heap options are listed below:
 
command: /usr/local/j2sdk1.4.2_01/bin/java
-server -showversion -Xms512m
-Xmx512m -XX:NewSize=500m -XX:MaxNewSize=500m -XX:InitialSurvivorRatio=4
-XX:TargetSurvivorRatio=100 -XX:+PrintCompilation -XX:+UseParNewGC
-XX:MaxPermSize=256MB -XX:PermSize=3m -XX:MinPermHeapExpansion=1m
-XX:MaxPermHeapExpansion=10m -XX:-UseAdaptiveSizePolicy
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution
-XX:+PrintHeapAtGC -verbose:gc -XX:+PrintGCTimeStamps -Xnoclassgc

Why would there be such a difference in behavior?

Background 
-----------
The application is more like 98% Java, and 2% C++.  
The C++ code handles some of their ORB transport code (using 
a C-API to a 3rd-party sockets vendor). When the Java code talks 
to another process, it calls down to the C++ layer.  This C++ code 
establishes the connection to the outside, and creates a 
"Receive" thread to receive messages from the newly created socket.  
Or, if another process initiates contact, the same receive thread is
created for incoming messages.  

When a new message is received from a remote process, very 
minimal processing is done at the C++ layer before the JNI UpCall 
takes place.  The Java code invoked from JNI, processes the ORB message
and figures out which handling thread (a pure Java thread) the 
message should be dispatched to.  The message is just put onto 
an internal queue, and then the dispatch thread picks it up and 
calls application code (like plug in code) to actually do 
application-level work.

Objects in Question
-------------------
So, the C++ stuff is pretty thin, and just interacts with the
older C-API to the 3rd party vendor software (which itself is
really just a layer on top of sockets).  The C++ threads that
are created are connected to the JVM so they can make calls to
the VM to create buffers, which the incoming messages are copied
into.  That buffer is basically the only Java object that the
C++ thread creates, and it is passed up during the JNI up-call.
This buffer is copied into separate objects created by the ORB
code, so after the JNI call, the C++ created buffers are no
longer referenced.

So, the objects that get modified (unexpectedly) after the few
FullGC's in 1.4.2, are not C++ created, nor are they 
stored/referenced in the C++ code.
----------

The GC output and log information is available in the attachments.



                                    

Comments
PUBLIC COMMENTS

no comment
                                     
2004-07-08
WORK AROUND

Use -XX:+UseParNewGC instead
                                     
2004-07-08
EVALUATION

###@###.### 2004-04-18

PSPromotionManager::oop_promotion_failed() pushes object that 
could not be promoted onto the claimed stack.

    claimed_stack()->push(obj);

at or near line 345.  The claimed stack is a per thread stack that has
a fixed size.  The could does not check if the push succeeded.  If
the push fails, the object may not be scanned during promotion failure
handling.

An application could have a reference to an object in eden or from-space that
does not get updated to the final.  This is consistent with some of CBOE's
observations.  However, this should sometimes lead to a VM crash which
is not observed.

This problem exists in 1.4.2 and 1.5.

1.4.1 does not use a claimed stack but rather uses a GrowableArray for
saving such objects so would not have this problem. 

I've built a 1.4.2 VM with a fix to see if it will eliminate the 
problem.

###@###.### 2004-04-19

This bug can lead to two copies of the same object - one in to-space and the 
other in eden or from-space.  Suppose object A has a reference
to X and A is scanned so that X is copied from eden to to-space. 
Say object B  has a reference to X but is not scanned.  Then
B's reference points to the copy in eden.  When the full collection 
comes along to clean up after the failed promotion, A and B are
both scanned and their respective copies of X (both appearing
alive) survive the collection resulting in two copies.
                                     
2004-07-08
CONVERTED DATA

BugTraq+ Release Management Values

COMMIT TO FIX:
1.4.2_05
generic

FIXED IN:
1.4.2_05
tiger-beta2

INTEGRATED IN:
1.4.2_05
tiger-beta2

VERIFIED IN:
1.4.2_05


                                     
2004-07-08



Hardware and Software, Engineered to Work Together