JDK-6390007 : cacao crashed on both clusters while geo cluster was starting on the nodes
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 1.1
  • Priority: P3
  • Status: Closed
  • Resolution: Cannot Reproduce
  • OS: solaris_10
  • CPU: generic
  • Submitted: 2006-02-24
  • Updated: 2010-08-18
  • Resolved: 2006-03-29
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
Running:

  - s10 + patches
  - sc31u4 + 120500-04 & 120489-01
  - cacao 1.1 + 120675-01
  - odyssey R2 2/21/06 nightly

Problem:

I was staring geo cluster on 2 clusters that have a partnership defined between them and odyssey failed to start due to cacao going down.  I have console msgs from each node below.  The corresponding cacao logs are attached.  Note that failover of odyssey infrastructure to the backup node succeded fine.  Sometime later I switched the geo-infrastructure rg to the nodes where the failure occured and ody started up fine at that time.

***On phys-sabre-1 (1st cluster) -

# geoadm start
... checking for management agent ...
... management agent check done ....
... starting product infrastructure ... please wait ...
#
[thread 144 also had an error]
# An unexpected error has been detected by HotSpot Virtual Machine:
#
#  SIGBUS (0xa) at pc=0xf03d8428, pid=27182, tid=45
#
# Java VM: Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode)
# Problematic frame:
# C  [libscrgadm.so.1+0x8428]
#
# An error report file with more information is saved as hs_err_pid27182.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
Feb 23 15:41:06 phys-sabre-1 cacao[27180]: SUNWcacao launcher : cacao exited abnormaly

Feb 23 15:41:06 phys-sabre-1 cacao[27180]: SUNWcacao launcher : no retries available, stop monitoring of cacao

Feb 23, 2006 3:41:06 PM GenericConenctor RequestHandler-connectionException
WARNING: java.io.EOFException
java.io.EOFException
        at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2502)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1267)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:339)
        at com.sun.jmx.remote.socket.SocketConnection.readMessage(SocketConnection.java:211)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$MessageReader.run(ClientSynchroMessageConnectionImpl.java:391)
        at com.sun.jmx.remote.opt.util.ThreadService$ThreadServiceJob.run(ThreadService.java:208)
        at com.sun.jmx.remote.opt.util.JobExecutor.run(JobExecutor.java:59)
Feb 23, 2006 3:41:06 PM ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.net.ConnectException: Connection refused
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at $Proxy0.startFailoverGroup(Unknown Source)
        at ServiceControl.main(ServiceControl.java:96)
Caused by: javax.management.remote.generic.ConnectionClosedException: The connection has been closed by the server.
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl.close(ClientSynchroMessageConnectionImpl.java:338)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:276)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:231)
        at javax.management.remote.generic.ClientIntermediary$GenericClientCommunicatorAdmin.doStop(ClientIntermediary.java:839)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.restart(ClientCommunicatorAdmin.java:133)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.gotIOException(ClientCommunicatorAdmin.java:34)
        at javax.management.remote.generic.GenericConnector$RequestHandler.connectionException(GenericConnector.java:667)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$MessageReader.run(ClientSynchroMessageConnectionImpl.java:398)
        at com.sun.jmx.remote.opt.util.ThreadService$ThreadServiceJob.run(ThreadService.java:208)
        at com.sun.jmx.remote.opt.util.JobExecutor.run(JobExecutor.java:59)
Feb 23, 2006 3:41:08 PM ClientIntermediary close
INFO: java.io.IOException: The connection is not currently established.
java.io.IOException: The connection is not currently established.
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl.checkState(ClientSynchroMessageConnectionImpl.java:567)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl.sendOneWay(ClientSynchroMessageConnectionImpl.java:161)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:260)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:231)
        at javax.management.remote.generic.ClientIntermediary$GenericClientCommunicatorAdmin.doStop(ClientIntermediary.java:839)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.restart(ClientCommunicatorAdmin.java:133)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.gotIOException(ClientCommunicatorAdmin.java:34)
        at javax.management.remote.generic.GenericConnector$RequestHandler.connectionException(GenericConnector.java:667)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$MessageReader.run(ClientSynchroMessageConnectionImpl.java:398)
        at com.sun.jmx.remote.opt.util.ThreadService$ThreadServiceJob.run(ThreadService.java:208)
        at com.sun.jmx.remote.opt.util.JobExecutor.run(JobExecutor.java:59)
Feb 23 15:41:08 phys-sabre-1 SC[SUNW.scmasa,geo-infrastructure,geo-failovercontrol,scmasa_svc_start]: Failed to start /usr/cluster/lib/rgm/rt/hamasa/cmas_service_ctrl_start geo-infrastructure.

Feb 23 15:41:08 phys-sabre-1 Cluster.RGM.rgmd: Method <scmasa_svc_start> failed on resource <geo-failovercontrol> in resource group <geo-infrastructure> [exit code <1>, time used: 6% of timeout <600 seconds>] 

Feb 23, 2006 3:41:10 PM ServiceControl main
WARNING: Unable to connect to the CACAO agent. The agent may be down or restarting
Feb 23 15:41:19 phys-sabre-1 ip: TCP_IOC_ABORT_CONN: local = 010.006.173.091:0, remote = 000.000.000.000:0, start = -2, end = 6

Feb 23 15:41:19 phys-sabre-1 ip: TCP_IOC_ABORT_CONN: aborted 0 connection 

Registering resource type <SUNW.HBmonitor>...done.
Resource type <SUNW.scmasa> has been registered already
Creating failover resource group <geo-clusterstate>...done.
Creating failover resource group <geo-infrastructure>...done.
Creating logical host resource <geo-clustername>...
Logical host resource created successfully ....
Creating resource <geo-hbmonitor> ...done.
Creating resource <geo-failovercontrol> ...done.
Bringing RG <geo-infrastructure> to managed state ...done.
Enabling resource <geo-clustername> ...done.
Enabling resource <geo-hbmonitor> ...done.
Enabling resource <geo-failovercontrol> ...done.
Node phys-sabre-1: Bringing resource group <geo-infrastructure> online ...scswitch: Resource group geo-infrastructure failed to start on chosen node and may fail over to other node(s)
FAILED: scswitch -z -g geo-infrastructure -h phys-sabre-1

# 

***On phys-sabre-3 (2nd cluster) -

# geoadm start
... checking for management agent ...
... management agent check done ....
... starting product infrastructure ... please wait ...
[thread 45 also had an error]#
# An unexpected error has been detected by HotSpot Virtual Machine:
#
#  SIGBUS (0xa) at pc=0xf04a8298, pid=17957, tid=157
#
# Java VM: Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode)
# Problematic frame:
# C  [libscrgadm.so.1+0x8298]
#

# An error report file with more information is saved as hs_err_pid17957.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
Feb 23 15:45:50 phys-sabre-3 cacao[17955]: SUNWcacao launcher : cacao exited abnormaly

Feb 23 15:45:50 phys-sabre-3 cacao[17955]: SUNWcacao launcher : no retries available, stop monitoring of cacao

Feb 23, 2006 3:45:50 PM GenericConenctor RequestHandler-connectionException
WARNING: java.io.EOFException
java.io.EOFException
        at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2502)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1267)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:339)
        at com.sun.jmx.remote.socket.SocketConnection.readMessage(SocketConnection.java:211)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$MessageReader.run(ClientSynchroMessageConnectionImpl.java:391)
        at com.sun.jmx.remote.opt.util.ThreadService$ThreadServiceJob.run(ThreadService.java:208)
        at com.sun.jmx.remote.opt.util.JobExecutor.run(JobExecutor.java:59)
Feb 23, 2006 3:45:50 PM ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.net.ConnectException: Connection refused
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at $Proxy0.startFailoverGroup(Unknown Source)
        at ServiceControl.main(ServiceControl.java:96)
Caused by: javax.management.remote.generic.ConnectionClosedException: The connection has been closed by the server.
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl.close(ClientSynchroMessageConnectionImpl.java:338)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:276)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:231)
        at javax.management.remote.generic.ClientIntermediary$GenericClientCommunicatorAdmin.doStop(ClientIntermediary.java:839)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.restart(ClientCommunicatorAdmin.java:133)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.gotIOException(ClientCommunicatorAdmin.java:34)
        at javax.management.remote.generic.GenericConnector$RequestHandler.connectionException(GenericConnector.java:667)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$MessageReader.run(ClientSynchroMessageConnectionImpl.java:398)
        at com.sun.jmx.remote.opt.util.ThreadService$ThreadServiceJob.run(ThreadService.java:208)
        at com.sun.jmx.remote.opt.util.JobExecutor.run(JobExecutor.java:59)
Feb 23, 2006 3:45:52 PM ClientIntermediary close
INFO: java.io.IOException: The connection is not currently established.
java.io.IOException: The connection is not currently established.
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl.checkState(ClientSynchroMessageConnectionImpl.java:567)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl.sendOneWay(ClientSynchroMessageConnectionImpl.java:161)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:260)
        at javax.management.remote.generic.GenericConnector.close(GenericConnector.java:231)
        at javax.management.remote.generic.ClientIntermediary$GenericClientCommunicatorAdmin.doStop(ClientIntermediary.java:839)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.restart(ClientCommunicatorAdmin.java:133)
        at com.sun.jmx.remote.opt.internal.ClientCommunicatorAdmin.gotIOException(ClientCommunicatorAdmin.java:34)
        at javax.management.remote.generic.GenericConnector$RequestHandler.connectionException(GenericConnector.java:667)
        at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$MessageReader.run(ClientSynchroMessageConnectionImpl.java:398)
        at com.sun.jmx.remote.opt.util.ThreadService$ThreadServiceJob.run(ThreadService.java:208)
        at com.sun.jmx.remote.opt.util.JobExecutor.run(JobExecutor.java:59)
Feb 23 15:45:52 phys-sabre-3 SC[SUNW.scmasa,geo-infrastructure,geo-failovercontrol,scmasa_svc_start]: Failed to start /usr/cluster/lib/rgm/rt/hamasa/cmas_service_ctrl_start geo-infrastructure.

Feb 23 15:45:52 phys-sabre-3 Cluster.RGM.rgmd: Method <scmasa_svc_start> failed on resource <geo-failovercontrol> in resource group <geo-infrastructure> [exit code <1>, time used: 15% of timeout <600 seconds>] 

Feb 23, 2006 3:45:54 PM ServiceControl main
WARNING: Unable to connect to the CACAO agent. The agent may be down or restarting
Feb 23 15:46:03 phys-sabre-3 ip: TCP_IOC_ABORT_CONN: local = 010.006.173.096:0, remote = 000.000.000.000:0, start = -2, end = 6

Feb 23 15:46:03 phys-sabre-3 ip: TCP_IOC_ABORT_CONN: aborted 0 connection 

Registering resource type <SUNW.HBmonitor>...done.
Resource type <SUNW.scmasa> has been registered already
Creating failover resource group <geo-clusterstate>...done.
Creating failover resource group <geo-infrastructure>...done.
Creating logical host resource <geo-clustername>...
Logical host resource created successfully ....
Creating resource <geo-hbmonitor> ...done.
Creating resource <geo-failovercontrol> ...done.
Bringing RG <geo-infrastructure> to managed state ...done.
Enabling resource <geo-clustername> ...done.
Enabling resource <geo-hbmonitor> ...done.
Enabling resource <geo-failovercontrol> ...done.
Node phys-sabre-3: Bringing resource group <geo-infrastructure> online ...scswitch: Resource group geo-infrastructure failed to start on chosen node and may fail over to other node(s)
FAILED: scswitch -z -g geo-infrastructure -h phys-sabre-3

#

Comments
EVALUATION Per bug submitter, they are no longer able to reproduce and no longer have the log files. If the problem arises again. Please see who owns the said library that the crash occured in.. Closing as not reproducible.
29-03-2006

EVALUATION # An unexpected error has been detected by HotSpot Virtual Machine: # # SIGBUS (0xa) at pc=0xf03d8428, pid=27182, tid=45 # # Java VM: Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode) # Problematic frame: # C [libscrgadm.so.1+0x8428] this error message seems to point that the crash is due to the scradm library. If this verify by the responsible engineering team , this not a cacao bug
24-02-2006