JDK-6304067 : 1.4.2xx only - Sun ORB Hang
  • Type: Bug
  • Component: other-libs
  • Sub-Component: corba:orb
  • Affected Version: 1.4.2_05
  • Priority: P1
  • Status: Resolved
  • Resolution: Fixed
  • OS: solaris_8
  • CPU: sparc
  • Submitted: 2005-07-29
  • Updated: 2010-02-17
  • Resolved: 2005-09-02
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
1.4.2_10 b02Fixed
Related Reports
Relates :  
Description
Operating System:SUN Solaris
OS version:Solaris 8
Product Name:JRE (SUN ORB part of it anyway)
Product version:1.4.2_05 (SUN ORB part of it anyway)
Hardware platform:SunFire V880
Severity: 1

Short description: 
When the default values of (com.sun.CORBA.connection.ORBHighWaterMark | com.sun.CORBA.connection.ORBLowWaterMark) are exceeded, the "JavaIDL Listener" thread becomes permenantly (perhaps semi-permenantly) hung in trying to cleanup a particular IIOPConnection, which prevents any and all new  clients from successfully connecting to this CORBA Server (EventBrokerServer). The server eventually had to be killed in order to recover it from its problematic state. The "JavaIDL Listener" thread was stuck in the state described in the below stack trace. This problem has manifested itself on 2 other servers.

Full problem description:
When the default values of (com.sun.CORBA.connection.ORBHighWaterMark | com.sun.CORBA.connection.ORBLowWaterMark) are exceeded and the SUN ORB is trying to cleanup, 

1. the following "com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn" exceptions to be thrown and noticed in the file that is capturing the stdout and sdterr streams.

com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn
	at com.sun.corba.se.internal.iiop.IIOPConnection.delete(Unknown Source)
	at com.sun.corba.se.internal.iiop.IIOPConnection.writeLock(Unknown Source)
	at com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls(Unknown Source)
	at com.sun.corba.se.internal.iiop.ReaderThread.run(Unknown Source)
com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn
	at com.sun.corba.se.internal.iiop.IIOPConnection.delete(Unknown Source)
	at com.sun.corba.se.internal.iiop.IIOPConnection.writeLock(Unknown Source)
	at com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls(Unknown Source)
	at com.sun.corba.se.internal.iiop.ReaderThread.run(Unknown Source)
com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn
	at com.sun.corba.se.internal.iiop.IIOPConnection.delete(Unknown Source)

"JavaIDL Listener" daemon prio=5 tid=0x0056cbe0 nid=0xc waiting for monitor entry [ed6ff000..ed6ffc30]
	at com.sun.corba.se.internal.iiop.IIOPConnection.cleanUp(Unknown Source)
	- waiting to lock <0xf0c4f0f8> (a java.lang.Object)
	- locked <0xf0c4ebf0> (a com.sun.corba.se.internal.iiop.IIOPConnection)
	at com.sun.corba.se.internal.iiop.ConnectionTable.cleanUp(Unknown Source)
	at com.sun.corba.se.internal.iiop.ConnectionTable.checkConnectionTable(Unknown Source)
	at com.sun.corba.se.internal.iiop.ListenerThread.run(Unknown Source)

2. The "JavaIDL Reader" thread shown below in a permenant (possibly semi-permenant) hung state because it had synchronized on com.sun.corba.se.internal.iiop.IIOPConnection's "stateEvent" and within that synchronization block had synchronized on "writeEvent" and then was waiting to be notified on the "writeEvent" while holding on to the synchronization on "stateEvent". No one seems to be notifying it.

"JavaIDL Reader for 10.97.4.21:2783" daemon prio=5 tid=0x005dcc08 nid=0x24e2 in Object.wait() [e987f000..e987fc30]
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Unknown Source)
	at com.sun.corba.se.internal.iiop.IIOPConnection.writeLock(Unknown Source)
	- locked <0xf0c4f100> (a java.lang.Object)
	- locked <0xf0c4f0f8> (a java.lang.Object)
	at com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls(Unknown Source)
	at com.sun.corba.se.internal.iiop.ReaderThread.run(Unknown Source)

3. At a later point in time, when the "JavaIDL Listener" thread detected that it needed to cleanup some connections and selected and tried to cleanup the above mentioned IIOPConnection, it eventually ended up calling writeLock() and getting permenantly (semi-permenantly) blocked on trying to synchronize on "stateEvent".

"JavaIDL Listener" daemon prio=5 tid=0x0056cbe0 nid=0xc waiting for monitor entry [ed6ff000..ed6ffc30]
	at com.sun.corba.se.internal.iiop.IIOPConnection.cleanUp(Unknown Source)
	- waiting to lock <0xf0c4f0f8> (a java.lang.Object)
	- locked <0xf0c4ebf0> (a com.sun.corba.se.internal.iiop.IIOPConnection)
	at com.sun.corba.se.internal.iiop.ConnectionTable.cleanUp(Unknown Source)
	at com.sun.corba.se.internal.iiop.ConnectionTable.checkConnectionTable(Unknown Source)
	at com.sun.corba.se.internal.iiop.ListenerThread.run(Unknown Source)

I can be noticed that the "JavaIDL Listener" has locked <0xf0c4ebf0> and is waiting to lock <0xf0c4f0f8> which has been locked by the "JavaIDL Reader for 10.97.4.21:2783", which has subsequently locked <0xf0c4f100> and is waiting for a notification on the same which never happens.

Code snipets from SUN's CORBA code.
com.sun.corba.se.internal.iiop.IIOPConnection

    /**
     * Cleans up this Connection.
     * This is called from ConnectionTable, from the ListenerThread.
     * Note:it is possible for this to be called more than once
     */
    public synchronized void cleanUp() throws Exception
    {
        writeLock();
	  ...
    }

    /**
     * Sets the writeLock for this connection.
     * If the writeLock is already set by someone else, block till the
     * writeLock is released and can set by us.
     * IMPORTANT: this connection's lock must be acquired before
     * setting the writeLock and must be unlocked after setting the writeLock.
     */
    protected boolean writeLock()
    {

        // Keep looping till we can set the writeLock.
        while ( true ) {
            synchronized ( stateEvent ){
                switch ( state ) {
                
                case OPENING:
                    try {
                        stateEvent.wait();
                    } catch (InterruptedException ie) {};
                    // Loop back
                    break;
                
                case ESTABLISHED:
                    synchronized (writeEvent) {
                        if (!writeLocked) {
                            writeLocked = true;
                            return true;
                        }
                    
                        try {
                            writeEvent.wait();
                        } catch (InterruptedException ie) {};
                    }
                    // Loop back
                    break;
                
                    //
                    // XXX
                    // Need to distinguish between client and server roles
                    // here probably.
                    //
                case ABORT:
                    throw new COMM_FAILURE( MinorCodes.WRITE_ERROR_SEND,
                                            CompletionStatus.COMPLETED_NO);
                     
                case CLOSE_RECVD:
                    // the connection has been closed or closing
                    // ==> throw rebind exception
                
                    throw new COMM_FAILURE( MinorCodes.CONN_CLOSE_REBIND,
                                            CompletionStatus.COMPLETED_NO);
                
                default:
                    if (orb.transportDebugFlag)
                        dprint("Connection:writeLock: weird state");
                
                    delete(Connection.CONN_ABORT);
                    return false;
                }
            }
        }
    }

    /**
     * Release the write lock on this connection.
     */
    protected void writeUnlock()
    {
        synchronized (writeEvent) {
            writeLocked = false;
            writeEvent.notify(); // wake up one guy waiting to write
        }
    }

"JavaIDL Reader" thread is stuck on writeEvent.wait() and 
"JavaIDL Listener" thread is stuck on synchronized (stateEvent).

Basically no one seems to be calling writeUnlock() which is the only way the "JavaIDL Reader" thread will release the lock on "stateEvent" and allow the "JavaIDL Listener " thread to proceed.

We have three thread dumps from 3 different servers that can be sent via email once you contact us.

Test case: 
We are working on it, but as you can imagine it is not very easy to create a test case for this since we are not as familiar with the internals of the SUN ORB and do not know exactly when and how the problem can be made to manifest itself. 
What we are sure of are the following things.
1. The problem seems to be manifesting itself only after the (com.sun.CORBA.connection.ORBHighWaterMark | com.sun.CORBA.connection.ORBLowWaterMark) are exceeded and cleaup code is executed.
2. It seems that the problem can happen regardless of who is executing the cleanup code i.e. "JavaIDL Listener" or some other thread.
3. Some "JavaIDL Reader" thread invariably gets stuck waiting for the "writeLocked" on its IIOPConnection to be set to false which seems to never happen. 
4. Since the thread mentioned in #3 also has other lock that eventually "JavaIDL Listener" thread gets stuck on, the problem manifests itself.

including herewith 3 files each of which represents one of our servers. Each file has multiple threaddumps and the last threaddump in each shows the exact scenario described as part of the problem report.
 
Also, we have been digging deeper and we have discovered that the following seems to be at the core of the issue.
 
It seems that when the "JavaIDL Listener" (com.sun.corba.se.internal.iiop.ListenerThread) thread is trying to establish a new connection and discovers that cleanup needs to be done, it calls com.sun.corba.se.internal.iiop.ConnectionTable.checkConnectionTable() which in turn calls com.sun.corba.se.internal.iiop.ConnectionTable.cleanup() which identifies the connection to cleanup and then calls com.sun.corba.se.internal.iiop.IIOPConnection.cleanup(). I have cut and pasted that code below. We believe that when this method calls sendCloseConnection(), this creates a classic race condifiton, because this method performs the writeLock() and then one of two things can happen.
1. The Listener thread will successfully lock the stateEvent before any other thread has a chance to lock it and in this case all the right things will happen. This is the green light scenario.
2. The sendCloseConnection() would cause the connection to be closed from the other end, which would cause the "JavaIDL Reader" (com.sun.corba.se.internal.iiop.ReaderThread) thread that is doing socketRead() to get a IOException, in which case it will end up calling the com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls() with lockHeld set to false. I have also cut and pasted that code below. As you can see, since the lochHeld is false, purge_calls() calls com.sun.corba.se.internal.iiop.IIOPConnection.writeLock().  I have also cut and pasted that code below. As you can see, the com.sun.corba.se.internal.iiop.IIOPConnection.writeLock() method locks the stateEvent and then subsequently locks the writeEvent, and since writeLocked would be set to true becaue the Listener thread has already called writeLock(), it simply does a wait() on the writeEvent to be informed when the writeLocked gets set to false. But unfortunately this will be a permenant wait because only the Listener thread would call writeUnlocked() but since it first needs to lock stateEvent (which is currently locked by Reader thread) it is waiting for the stateEvent to be unlocked which will never be unlocked because the Reader thread has it locked and is waiting to be notified on the writeEvent and so on and so forth. So while not actually technically a deadlock situation, it has for all intents and purposes the same impact as a deadlock. This is the red light scenario. Basically if the reader thread beats the listener thread in locking the stateEvent, we are permenantly screwed. This is purely a timing window issue and is exactly what is happening in the problems that we are experiencing.
 
One of our engineers (Tai-Sin Quah) has developed a partial test case which while not reproducing the exact same situation, does reproduce something pretty similar and close to it. I am calling it partial because it only reporoduces the problem on Windows platform (all the time on Windows XP amd sometimes on Windows 2000) but not on SUN platform. BUT the reason I am mentioning this is that when he commented out the sendCloseConnection() (we are not suggesting that that is the right solution, we were simply hacking to prove our theory) the problem could not be reproduced at all thereby proving our analysis.
 
We would also like to point out further that the following code in the cleanup() method
            // stop the reader without causing it to do purge_calls
            Exception ex = new Exception();
            reader.stop(ex);
indicates that the developers were indeed aware of the race condition and were trying to prevent it by stopping the reader thread. Unfortunately there are two problems/concerns with this code.
1. A deprecated and unsafe method stop() is being used which SUN strongly encourages recommends be not used.
2. It is being done too late in the game to prevent the race condition mentioned above. In the event that listener thread wins the race condition to lock stateEvent (green light scenario) this may help, but in the event that the reader thread wins the race condition to lock stateEvent (red light scenario) the listener thread will never get to that safety net code and for all intents and purposes the listener thread and the reader thread will be lockedup for ever.
 
Hopes all this analysis helps us to get to a quick effecient and elegant resolution to the problem at hand.

cannot upgrade to 1.4.2_08 because the version of our product that the customer is running was QAed and certified only against 1.4.2_05 and also because it happens to be a 24X7 highly available production environment.
 
However, the testcase that was mentioned has exhibited the following behaviour so far.
 
1. On Windows XP, it reproduces the issue everytime with 1.4.2_06 as well as 1.4.2_08.
2. On Windows 2000, we have only tried 1.4.2_08. It reproduces the problem only if
    a) the stdout and stderr are redirected to a flat file
    b) the stderr is redirected to a flat file.
3. On Solaris, we have only tried 1.4.2_08 and the problem cannot be reproduced.

*** Attached 4 Files including 3 thread dumps and code snippet file having suspicion ***

Thanking you in anticipation.

Comments
EVALUATION It appears that this bug may be identical to bug 4752117, which was fixed some time ago in the app server deliveries. 4752117 is fixed in the main CORBA code base, in AS 8.x, and in J2SE 1.5.0 and later, but apparently not in J2SE 1.4.2_08. The change that was made for 4752117 is basically to avoid holding the lock on stateEvent while waiting to acquire the lock on writeEvent in the writeLock() method. Obviously the original implementation was bad, because the following could happen, possibly in multiple different ways: 1. Thread 1 returns from writeLock() 2. Thread 2 tries to call writeLock(), but gets blocked waiting for Thread 1 to call writeUnlock(). This causes Thread 2 to hold the lock on stateEvent (in the version in the 1.4.2_08 code). 3. Now thread 1 runs again, and does anything that requires a lock on stateEvent. At this point, these two threads are deadlocked. It appears from the changes that were made that there is no need to hold the stateEvent lock inside writeLock() while waiting for the writeEvent lock. This change was made in the app server releases as well as in JDK 5.0, but apparently not in J2SE 1.4.2. Bug 4758801 was filed about this same problem to make the fix in the ORB in the JDK releases, but it appears that this problem was only fixed in 1.4.1_05 and the fix was not applied to the 1.4.2_xx stream whenever that fix was made. Obviously we need to make this fix in 1.4.2 as well. We will continue evaluating this problem further, but I think we are close to a solution.
30-07-2005