JDK-7059899 : Stack overflows in Java code cause 64-bit JVMs to exit due to SIGSEGV
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 6u26
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: linux_redhat_5.0
  • CPU: x86
  • Submitted: 2011-06-27
  • Updated: 2014-07-11
  • Resolved: 2012-03-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 JDK 8 Other
6u32Fixed 7u4Fixed 8Fixed hs20.7Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
FULL PRODUCT VERSION :
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

FULL OS VERSION :
Linux 2.6.18-194.17.4.el5 #1 SMP Wed Oct 20 13:03:08 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

EXTRA RELEVANT SYSTEM CONFIGURATION :
Java arguments/parameters: -d64 -server -Xms1024m -Xmx1024m -Xss320k -XX:MaxPermSize=384m

A DESCRIPTION OF THE PROBLEM :
As our clients have migrated to 64-bit JVMs we've seen a significant increase of JVM crashes due to SIGSEGV on Unix platforms. These are spontaneous and do not trigger a HotSpot crash report. Each crash involved an application bug that caused deep recursion that should have resulted in a java.lang.StackOverflowError, for instance an infinite struts forward or a search continuance/referral loop during LDAP authentication. This appears to affect both Solaris and Linux platforms, however we've only investigated further on Linux, but in both cases a 64-bit JVM was the common factor.

We referred to the Java SE Troubleshooting Guide section '4.1.3 Crash due to Stack Overflow' and found that likely, the StackShadowPages value is too small for these platforms. The guide discusses custom JNI libraries, however we're seeing these crashes due to 'normal' native calls; usually socket operations, reads or writes. That lead us to investigate further, as this should not be the case. According to the OpenJDK source the default on x84 platforms is 3, and is doubled to 6 on AMD64. There is a x86 Solaris value, seemingly to accomodate C++ compiler bugs on that platform, however our experience has shown that perhaps this value (20) is more broadly applicable:

http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/9b013e207574/src/cpu/x86/vm/globals_x86.hpp

       60 #ifdef AMD64
       61 // Very large C++ stack frames using solaris-amd64 optimized builds
       62 // due to lack of optimization caused by C++ compiler bugs
       63 define_pd_global(intx, StackShadowPages, SOLARIS_ONLY(20) NOT_SOLARIS(6) DEBUG_ONLY(+2));
       64 #else
       65 define_pd_global(intx, StackShadowPages, 3 DEBUG_ONLY(+5));
       66 #endif // AMD64

Lab testing indicates that 17 is the smallest StackShadowPages size that prevents the JVM from crashing with a segmentation fault. We have not confirmed the value on Solaris (UltraSPARC, we don't support our product on Solaris x86), however we have certainly seen these conditions affect both platforms. So we can only conclude that either 64-bit native stack frames on AMD64 are generally far larger than their 32-bit equivalents or there's a problem with the way the value is calculated (I believe it's OS pagesize * StackShadowPages), allowing previously benign stack overflows in Java code to crash the JVM.

Others have also encountered 64-bit specific SIGSEGVs, bug 6346701 seems to report exactly this kind of condition, however I could not see an outright discussion anywhere that indicated that there could be a problem with the default shipping value or calculation of the number of pages to look ahead before invoking native methods:

http://confluence.atlassian.com/display/GHKB/JIRA+with+GreenHopper+Crashes+Java+with+a+SIGSEGV+Fault+on+Linux+64bit+JVMs
http://fusesource.com/forums/thread.jspa?messageID=7830

We isolated the problem using core dumps and identified the offending threads w/ gdb and then used jstack to generate thread dumps:
 
-- Crash 1 - Two application methods that call each other recursively, executing database statements, causing overflow during Oracle thin driver socket read:

Thread 18073: (state = IN_NATIVE)
- java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
- java.net.SocketInputStream.read(byte[], int, int) @bci=84, line=129 (Compiled frame)
- oracle.net.ns.Packet.receive() @bci=31, line=240 (Compiled frame)
- oracle.net.ns.DataPacket.receive() @bci=1, line=92 (Compiled frame)
- oracle.net.ns.NetInputStream.getNextPacket() @bci=48, line=172 (Compiled frame)
- oracle.net.ns.NetInputStream.read(byte[], int, int) @bci=33, line=117 (Compiled frame)
- oracle.net.ns.NetInputStream.read(byte[]) @bci=5, line=92 (Compiled frame)
- oracle.jdbc.driver.T4CMAREngine.buffer2Value(byte) @bci=325, line=2320 (Compiled frame)
- oracle.jdbc.driver.T4CMAREngine.unmarshalUB4() @bci=2, line=1200 (Compiled frame)
- oracle.jdbc.driver.T4CTTIoer.unmarshal() @bci=200, line=270 (Compiled frame)
- oracle.jdbc.driver.T4C8Oall.receive() @bci=1507, line=1015 (Compiled frame)
- oracle.jdbc.driver.T4CPreparedStatement.doOall8(boolean, boolean, boolean, boolean) @bci=655, line=194 (Compiled frame)
- oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe() @bci=39, line=791 (Compiled frame)
- oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe() @bci=104, line=866 (Compiled frame)
- oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout() @bci=139, line=1186 (Compiled frame)
- oracle.jdbc.driver.OraclePreparedStatement.executeInternal() @bci=98, line=3387 (Compiled frame)
- oracle.jdbc.driver.OraclePreparedStatement.executeQuery() @bci=13, line=3431 (Compiled frame)
- oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery() @bci=4, line=1491 (Compiled frame)
- org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery() @bci=9, line=93 (Compiled frame)
- org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery() @bci=9, line=93 (Compiled frame)
...

-- Crash 2 - Infinite LDAP search referral/continuance due to incorrectly configured Active Directory server, overflowing during socket write:

Thread 11962: (state = IN_NATIVE)
 - java.net.SocketOutputStream.socketWrite0(java.io.FileDescriptor, byte[], int, int) @bci=0 (Interpreted frame)
 - java.net.SocketOutputStream.socketWrite(byte[], int, int) @bci=44, line=92 (Interpreted frame)
 - java.net.SocketOutputStream.write(byte[], int, int) @bci=4, line=136 (Interpreted frame)
 - java.io.BufferedOutputStream.flushBuffer() @bci=20, line=65 (Interpreted frame)
 - java.io.BufferedOutputStream.flush() @bci=1, line=123 (Interpreted frame)
 - com.sun.jndi.ldap.Connection.writeRequest(com.sun.jndi.ldap.BerEncoder, int, boolean) @bci=73, line=396 (Interpreted frame)
 - com.sun.jndi.ldap.LdapClient.ldapBind(java.lang.String, byte[], javax.naming.ldap.Control[], java.lang.String, boolean) @bci=196, line=334 (Interpreted frame)
 - com.sun.jndi.ldap.LdapClient.authenticate(boolean, java.lang.String, java.lang.Object, int, java.lang.String, javax.naming.ldap.Control[], java.util.Hashtable) @bci=315, line=192 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtx.connect(boolean) @bci=316, line=2694 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtx.<init>(java.lang.String, java.lang.String, int, java.util.Hashtable, boolean) @bci=390, line=293 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(java.lang.String, java.util.Hashtable) @bci=227, line=175 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(java.lang.Object, java.util.Hashtable) @bci=12, line=134 (Interpreted frame)
 - com.sun.jndi.url.ldap.ldapURLContextFactory.getObjectInstance(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=17, line=35 (Interpreted frame)
 - javax.naming.spi.NamingManager.getURLObject(java.lang.String, java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=62, line=584 (Interpreted frame)
 - javax.naming.spi.NamingManager.processURL(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=31, line=364 (Interpreted frame)
 - javax.naming.spi.NamingManager.processURLAddrs(javax.naming.Reference, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=56, line=344 (Interpreted frame)
 - javax.naming.spi.NamingManager.getObjectInstance(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=124, line=316 (Interpreted frame)
 - com.sun.jndi.ldap.LdapReferralContext.<init>(com.sun.jndi.ldap.LdapReferralException, java.util.Hashtable, javax.naming.ldap.Control[], javax.naming.ldap.Control[], java.lang.String, boolean, int) @bci=212, line=93(Interpreted frame)
 - com.sun.jndi.ldap.LdapReferralException.getReferralContext(java.util.Hashtable, javax.naming.ldap.Control[]) @bci=38, line=132 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtx.searchAux(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, boolean, boolean, com.sun.jndi.toolkit.ctx.Continuation) @bci=269, line=1838 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtx.c_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=14, line=1749 (Interpreted frame)
 - com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=72, line=368 (Interpreted frame)
 - com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=32, line=338 (Interpreted frame)
 - com.sun.jndi.ldap.LdapReferralContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=44, line=639 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtx.searchAux(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, boolean, boolean, com.sun.jndi.toolkit.ctx.Continuation) @bci=282, line=1844 (Interpreted frame)
 - com.sun.jndi.ldap.LdapCtx.c_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=14, line=1749 (Interpreted frame)
 - com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=72, line=368 (Interpreted frame)
 - com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=32, line
...

THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
We do not have an easily reducible use case to reproduce this problem. All issues involve our entire application stack.

ERROR MESSAGES/STACK TRACES THAT OCCUR :
None. JVM process exits due to signal 11 (SIGSEGV).

REPRODUCIBILITY :
This bug can be reproduced always.

CUSTOMER SUBMITTED WORKAROUND :
Set -XX:StackShadowPages=20

Comments
EVALUATION http://hg.openjdk.java.net/lambda/lambda/hotspot/rev/849412a95e45
22-03-2012

EVALUATION http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/849412a95e45
18-02-2012

EVALUATION http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/849412a95e45
18-02-2012

EVALUATION http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/rev/849412a95e45
17-02-2012

EVALUATION http://hg.openjdk.java.net/hsx/hotspot-rt/hotspot/rev/849412a95e45
14-02-2012

EVALUATION The stack overflows are because of the change to increase the stack in socketWrite. See CRs 7079763 and 6807602. The StackShadowPages will be increased to 20.
10-02-2012