JDK-8000729 : Race condition in CORBA code causes re-use of ABORTed connections
  • Type: Backport
  • Backport of: JDK-7056731
  • Component: other-libs
  • Sub-Component: corba:idl
  • Affected Version: 6u18,6u24
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2012-10-10
  • Updated: 2012-10-20
  • Resolved: 2012-10-10
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6
6u33Fixed
Description
FULL PRODUCT VERSION :
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode)


ADDITIONAL OS VERSION INFORMATION :
Linux xyz 2.6.18-238.5.1.el5PAE #1 SMP Mon Feb 21 06:01:16 EST 2011 i686 i686 i386 GNU/Linux

A DESCRIPTION OF THE PROBLEM :
A race condition has been identified in the CORBA code that means ABORTed connections are not taken out of the CORBA connection cache. This race condition causes an exception just after the connection has been marked as ABORT but before the connection is removed from the cache. This, and other similar problems whereby lack of exception handling with ABORTed connections leaves the CORBA client in an inconsistent state has been encountered in production environments.

Affects versions 1.6.0_24+



STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
In the standard flow of events we expect logic similar to the following:
1) A request is sent to the CORBA client code
2) The CORBA code creates a connection, stores it in the cache and sends the request to the remote server
3) While waiting for the response the CORBA client code puts the request thread in a 'waiting room'
4) When the response has been returned the CORBA client code unregisters the waiter from the waiting room and returns back to the requestor

The race condition can arise when the server is killed just after it has returned a response to the client but before the CORBA client code has had time to remove the thread from the 'waiting room':
1) A thread has been placed in the CORBA response 'waiting room'.
2) The CORBA server returns a response, the CORBA client code starts processing it, closes things up, and is currently at the line unregisterWaiter(orb) in com.sun.corba.se.impl.protocol.CorbaClientRequestDispatcherImpl (line 889).
3) Before it executes unregisterWaiter(orb), the CORBA server is killed.
4) At this point, the thread monitoring the connection cleans up by running the purgeCalls method of SocketOrChannelConnectionImpl.java. This method first sets the connection state to ABORT and then calls the signalExceptionToAllWaiters method to be executed in the CorbaResponseWaitingRoomImpl class. This method throws an exception because the response has actually completed successfully. The exception is rethrown up and beyond the purgeCalls method because there is no exception hander. This leads to the inconsistent state where the connection is marked as ABORT but has not been removed from the connection cache.
5) When another request comes into the CORBA client code, it retrieves this ABORTed connection for use: the client then throws the exception:
org.omg.CORBA.COMM_FAILURE: vmcid: SUN minor code: 203 completed: No
when it tries to use this connection.

This raises two issues with the CORBA code base:
1) Marking a connection as ABORT and removing the connection from the cache should be an all-or-nothing process. This can be achieved in SocketOrChannelConnectionImpl by simply adding a try/finally clause to the purgeCalls method and placing the cache remove code in the finally part (see the end for code details). This means that now the only way that the connection can be left in an inconsistent state is if the socket close method calls block (this is done right after setting the connection state to ABORT). This case shouldn't happen however as they are asynchronous.
2) Managing the messageMediator by closing Input/Output objects in the com.sun.corba.se.impl.protocol.CorbaClient.RequestDispatcherImpl.endRequest method needs to be handled in an atomic way with the unregisterWaiter call. Note that an exception handler in the SocketOrChannelConnectionImpl.purgeCalls method will mean this isn't necessary to fix this exact issue.

A test case has been constructed that contains a simple 'hello world' CORBA client and server. It can be compiled with the command:
./compile.sh
The client/server can be run 'normally' with the command (Java 1.6.0_24):
./correctRun.sh
This should return a series of 5 'Hello world !!'s. The race condition issue can be raised with the following command (Java 1.6.0_24):
./raceCondition.sh
This should return a series of debug information followed by 4 of the 'COMM_FAILURE' exceptions.



EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The race condition script does the following:
- starts orbd
- starts the CORBA server
- starts the CORBA client in the debugger (note execution does not start)
- connects to the debugger and places a break point in com.sun.corba.se.impl.protocol.CorbaClientRequestDispatcherImpl.unregisterWaiter
- starts the CORBA client code running: it will meet a break point
- the first break point corresponds to the CORBA request method "get": we continue execution
- the second break point corresponds to the CORBA request method "is_a": we continue execution
- the third break point corresponds to the CORBA request method "resolve_str": we continue execution
- the forth break point corresponds to the CORBA request method "sayHello": since this is the actual request we are interested in we are going to induce race condition failure and do not continue execution right away
- suspend the main thread in the client (i.e. the code currently sitting at the breakpoint) and resume the rest of the threads (we are interested in the thread that manages the connection)
- kill the server process
- this will have a completed response in the waiting room that will throw an exception when called from the purgeCalls method. The connection will be marked as ABORT but not removed from the connection cache.
- clear the break point in the client and resume the main thread
- the remaining attempts will use the ABORT connection and throw the 'COMM_FAILURE' exception
ACTUAL -
The following has been encountered in a production environment (Java 1.6.0_24):
org.omg.CORBA.COMM_FAILURE: vmcid: SUN minor code: 203 completed: No
at com.sun.corba.se.impl.logging.ORBUtilSystemException.writeErrorSend(ORBUtilSystemException.java:2259)
at com.sun.corba.se.impl.logging.ORBUtilSystemException.writeErrorSend(ORBUtilSystemException.java:2281)
at com.sun.corba.se.impl.transport.SocketOrChannelConnectionImpl.writeLock(SocketOrChannelConnectionImpl.java:957)
at com.sun.corba.se.impl.encoding.BufferManagerWriteStream.sendFragment(BufferManagerWriteStream.java:86)
at com.sun.corba.se.impl.encoding.BufferManagerWriteStream.sendMessage(BufferManagerWriteStream.java:104)
at com.sun.corba.se.impl.encoding.CDROutputObject.finishSendingMessage(CDROutputObject.java:144)
at com.sun.corba.se.impl.protocol.CorbaMessageMediatorImpl.finishSendingRequest(CorbaMessageMediatorImpl.java:247)
at com.sun.corba.se.impl.protocol.CorbaClientRequestDispatcherImpl.marshalingComplete1(CorbaClientRequestDispatcherImpl.java:355)
at com.sun.corba.se.impl.protocol.CorbaClientRequestDispatcherImpl.marshalingComplete(CorbaClientRequestDispatcherImpl.java:336)
at com.sun.corba.se.impl.protocol.CorbaClientDelegateImpl.invoke(CorbaClientDelegateImpl.java:129)
at com.sun.corba.se.impl.protocol.CorbaClientDelegateImpl.is_a(CorbaClientDelegateImpl.java:213)
at org.omg.CORBA.portable.ObjectImpl._is_a(ObjectImpl.java:112)
at weblogic.corba.j2ee.naming.Utils.narrowContext(Utils.java:126)
at weblogic.corba.j2ee.naming.InitialContextFactoryImpl.getInitialContext(InitialContextFactoryImpl.java:94)
at weblogic.corba.j2ee.naming.InitialContextFactoryImpl.getInitialContext(InitialContextFactoryImpl.java:31)
at weblogic.jndi.WLInitialContextFactory.getInitialContext(WLInitialContextFactory.java:41)
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:667)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:288)
at javax.naming.InitialContext.init(InitialContext.java:223)
at javax.naming.InitialContext.<init>(InitialContext.java:197)

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
-- Hello.idl --
module HelloApp
{
  interface Hello
  {
    string sayHello();
    oneway void shutdown();
  };
};

-- HelloClient.java --
import HelloApp.*;
import org.omg.CosNaming.*;
import org.omg.CosNaming.NamingContextPackage.*;
import org.omg.CORBA.*;

public class HelloClient
{
    static Hello helloImpl;

    public static void main(String args[])
    {
	try{
	    // create and initialize the ORB
	    ORB orb = ORB.init(args, null);

	    // get the root naming context
	    org.omg.CORBA.Object objRef =
		orb.resolve_initial_references("NameService");
	    // Use NamingContextExt instead of NamingContext. This is
	    // part of the Interoperable naming Service.
	    NamingContextExt ncRef = NamingContextExtHelper.narrow(objRef);
 
	    // resolve the Object Reference in Naming
	    String name = "Hello";
	    helloImpl = HelloHelper.narrow(ncRef.resolve_str(name));

	    System.out.println("Obtained a handle on server object: " + helloImpl);
	    for (int i = 0; i < 5; i++) {
		try {
		    System.out.println(helloImpl.sayHello());
		} catch (Exception e) {
		    System.out.println("Exception: " + e.getMessage());
		    e.printStackTrace();
		}
		Thread.sleep(5000);
	    }

	} catch (Exception e) {
	    System.out.println("ERROR : " + e) ;
	    e.printStackTrace(System.out);
	}
    }

}

-- HelloServer.java --
import HelloApp.*;
import org.omg.CosNaming.*;
import org.omg.CosNaming.NamingContextPackage.*;
import org.omg.CORBA.*;
import org.omg.PortableServer.*;
import org.omg.PortableServer.POA;

import java.util.Properties;

class HelloImpl extends HelloPOA {
    private ORB orb;

    public void setORB(ORB orb_val) {
	orb = orb_val;
    }
    
    // implement sayHello() method
    public String sayHello() {
	return "\nHello world !!\n";
    }
    
    // implement shutdown() method
    public void shutdown() {
	orb.shutdown(false);
    }
}


public class HelloServer {

    public static void main(String args[]) {
	try{
	    // create and initialize the ORB
	    ORB orb = ORB.init(args, null);

	    // get reference to rootpoa & activate the POAManager
	    POA rootpoa = POAHelper.narrow(orb.resolve_initial_references("RootPOA"));
	    rootpoa.the_POAManager().activate();

	    // create servant and register it with the ORB
	    HelloImpl helloImpl = new HelloImpl();
	    helloImpl.setORB(orb);

	    // get object reference from the servant
	    org.omg.CORBA.Object ref = rootpoa.servant_to_reference(helloImpl);
	    Hello href = HelloHelper.narrow(ref);
	      
	    // get the root naming context
	    org.omg.CORBA.Object objRef =
		orb.resolve_initial_references("NameService");
	    // Use NamingContextExt which is part of the Interoperable
	    // Naming Service (INS) specification.
	    NamingContextExt ncRef = NamingContextExtHelper.narrow(objRef);

	    // bind the Object Reference in Naming
	    String name = "Hello";
	    NameComponent path[] = ncRef.to_name( name );
	    ncRef.rebind(path, href);

	    System.out.println("HelloServer ready and waiting ...");

	    // wait for invocations from clients
	    while (true) {
		orb.run();
	    }
	} catch (Exception e) {
	    System.out.println("ERROR: " + e);
	    e.printStackTrace(System.out);
	}
	  
	System.out.println("HelloServer Exiting ...");
	
    }
}

-- compile.sh --
#!/bin/bash
idlj -fall Hello.idl
javac *.java HelloApp/*.java

-- correctRun.sh --
#!/bin/bash
echo "starting orbd"
orbd -ORBInitialPort 1050 -ORBInitialHost localhost &
ORB_PROC=$!
sleep 5
echo "started orb"
echo "starting server"
java HelloServer -ORBInitialPort 1050 -ORBInitialHost localhost &
SERVER_PROC=$!
sleep 5
echo "started server"
echo "starting client"
java HelloClient -ORBInitialPort 1050 -ORBInitialHost localhost
kill -9 $SERVER_PROC
kill -9 $ORB_PROC
echo "finished!"

-- raceCondition.sh --
#!/bin/bash
echo "starting orbd"
orbd -ORBInitialPort 1050 -ORBInitialHost localhost &
ORB_PROC=$!
sleep 5 #give orbd time to start
echo "started orb"
echo "starting server"
java HelloServer -ORBInitialPort 1050 -ORBInitialHost localhost &
SERVER_PROC=$!
sleep 5 #give server time to start
echo "started server"
echo "starting client (debug mode)"
java -Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=y HelloClient -ORBInitialPort 1050 -ORBInitialHost localhost &
JVM_PROC=$!
sleep 5 #give jvm/debugger/client time to start
echo "started client (debug mode)"
echo "starting debugger and issuing commands"
(sleep 5;
echo "stop in com.sun.corba.se.impl.protocol.CorbaClientRequestDispatcherImpl.unregisterWaiter";
sleep 5;
echo "run";
sleep 5;
echo "cont";
sleep 5;
echo "cont";
sleep 5;
echo "cont";
sleep 5;
echo "suspend 1";
sleep 5;
kill -9 $SERVER_PROC &> /dev/null;
sleep 5;
echo "cont";
sleep 5;
echo "thread 1"
sleep 5;
echo "clear com.sun.corba.se.impl.protocol.CorbaClientRequestDispatcherImpl.unregisterWaiter"
sleep 5;
echo "resume 1";
)| jdb -attach 8000

kill -9 $ORB_PROC

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
-- Possible fix --

A try/finally clause was added to the purgeCalls method in com.sun.corba.se.impl.transport.SocketOrChannelConnectionImpl.java at line 1495 (using latest OpenJDK version) with the finally part containing the cache remove code. The changed code has been attached. Note that this will handle all exceptions but will leave a connection in an inconsistent state if the socket blocks on close however this should not happen as they are all asynchronous.

Comments
verified with com/sun/corba/7056731/7056731.sh on stt-28.ru, jdk6u38b03
20-10-2012