JDK-7011759 : 6u16: Recovering buffer manager read stream underflow from protocols are failing
  • Type: Bug
  • Component: other-libs
  • Sub-Component: corba:orb
  • Affected Version: 6u16
  • Priority: P2
  • Status: Closed
  • Resolution: Duplicate
  • OS: solaris_10
  • CPU: sparc
  • Submitted: 2011-01-12
  • Updated: 2012-10-11
  • Resolved: 2012-10-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7
6-poolResolved 7Resolved
Description
Applied Materials has a major customer experienced an issue in their semiconductor production fab when one of java processes had to be restarted because of com.sun.corba.se.impl.encoding.BufferManagerReadStream underflow

This is on Solaris 10, Java 1.6.0_16. 

After this happens communication between those processes does not succeed. After restarting the process that logs the failure communication resumes normally.

It used to work on 1.4 and 1.5

Having checked with the code and CR#6372405, we had introduced private long FRAGMENT_TIMEOUT = 60000 in 6u10 onwards.

The old implementation was waited indefinitely.

Dec 23, 2010 11:38:17 AM com.sun.corba.se.impl.encoding.BufferManagerReadStream underflow
WARNING: "IOP00410217: (COMM_FAILURE) Timeout while reading data in buffer manager"
org.omg.CORBA.COMM_FAILURE:   vmcid: SUN  minor code: 217  completed: No
	at com.sun.corba.se.impl.logging.ORBUtilSystemException.bufferReadManagerTimeout(Unknown Source)
	at com.sun.corba.se.impl.logging.ORBUtilSystemException.bufferReadManagerTimeout(Unknown Source) 

There is a 60 second timeout when waiting for the data.


Dec 23, 2010 11:38:17 AM com.sun.corba.se.impl.interceptors.PIHandlerImpl peekClientRequestInfoImplStack
WARNING: "IOP00710817: (INTERNAL) Assertion failed: client request info stack is null"
org.omg.CORBA.INTERNAL:   vmcid: SUN  minor code: 817  completed: No
	at com.sun.corba.se.impl.logging.InterceptorsSystemException.clientInfoStackNull(Unknown Source)
	at com.sun.corba.se.impl.logging.InterceptorsSystemException.clientInfoStackNull(Unknown Source)
	at com.sun.corba.se.impl.interceptors.PIHandlerImpl.peekClientRequestInfoImplStack(Unknown Source) 

The cu asked why is the client request info stack null?  This stack trace is the same thread as what just logged the underflow?

We have the cursory glanced the code

765      * Convenience method to get the ClientRequestInfoImpl object off the 
766      * top of the ThreadLocal stack.  Throws an INTERNAL exception if 
767      * the Info stack is empty.
768      */
769     private ClientRequestInfoImpl peekClientRequestInfoImplStack() {
770         RequestInfoStack infoStack = 
771             (RequestInfoStack)threadLocalClientRequestInfoStack.get();
772         ClientRequestInfoImpl info = null;
773         if( !infoStack.empty() ) {
774         info = (ClientRequestInfoImpl)infoStack.peek();
775     } else {
776         throw wrapper.clientInfoStackNull() ;
777     }
778         
779         return info;
780     }


We are not sure this is related to the timeout issue or not.

We have suggested the customer to put

1. -ORBDebug transport,subcontract,transientObjectManager,invocationTiming to enable further debugging on the corba side
2. -verbose:gc -Xloggc:/path/to/gc-`/bin/date +%Y-%m-%d--%H-%M`.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps to check the jvm health
3. run snoop to capture the network packets

when the problem happens next time

Please review the attached logs and docs if this is a bug or not.

Thanks

Comments
SUGGESTED FIX The method in com.sun.corba.se.impl.orbutil.threadpool.ThreadPoolImpl needs to be modified as below > /** > * To be called from the workqueue when work is added to the > * workQueue. This method would create new threads if required > * or notify waiting threads on the queue for available work > */ > void notifyForAvailableWork(WorkQueue aWorkQueue) { > synchronized (aWorkQueue) { > if (availableWorkerThreads < aWorkQueue.workItemsInQueue() ) { > createWorkerThread(); > } else { > aWorkQueue.notify(); > } > } > }
14-03-2011

EVALUATION Below is the email communication from the customer explaining the cause and resolution. -------- Original Message -------- Subject: Re: BufferManagerReadStream underflow timeout - SOLVED Date: Fri, 4 Mar 2011 10:18:16 -0700 From: ###@###.### To: ###@###.### CC: ###@###.###, ###@###.###, ###@###.### I believe that I have finally figured this out. Since I sent this email to you early on I wanted to follow-up. I believe this is the problem that I have seen reported by others on the internet. This is in the method ThreadPoolImpl.notifyForAvailableWork(). /** * To be called from the workqueue when work is added to the * workQueue. This method would create new threads if required * or notify waiting threads on the queue for available work */ void notifyForAvailableWork(WorkQueue aWorkQueue) { synchronized (workQueue) { if (availableWorkerThreads == 0) { createWorkerThread(); } else { aWorkQueue.notify(); } } } In my case I had a lot of threads creating connections at the same time. I also had a single connection receiving messages. The problem is that these threads were stacked up waiting to get into the synchronized block here. The first thread gets in and finds that there is an available thread(the last one) and does a notify. Before the available thread pool thread wakes from the notify another thread gets into the block and does a notify. It happens again with another thread. Now the last available thread has been notified multiple times and gets the first element off the work queue. He then goes and listens on a socket. In fact all the threads are now busy on long lived jobs. Not until two minutes later in my case did something finally call notifyForAvailableWork again creating another thread allowing the work queue to finally be processed. This is a little funny to solve. It seems weird moving the availableWorkThread count to be adjusted in the notify but perhaps that is the right thing. Otherwise you need another counter that counts notifies being processed. Other options would involve reworking this I am afraid. I do hope that we can come to a solution quickly. Thanks for looking into this! Clint (P.S. It seems bad practice to synchronize on workQueue and then notify on aWorkQueue - I know they are the same object but it is pretty weird and will cause an exception if they aren't the same.)
07-03-2011