JDK-6336770 : Runtime.exec hangs on Solaris in opendir after fork when LD_PRELOAD=libmtmalloc.so
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 1.4.2_04,1.4.2_10
  • Priority: P3
  • Status: Closed
  • Resolution: Won't Fix
  • OS: solaris_9,solaris_10
  • CPU: generic,sparc
  • Submitted: 2005-10-13
  • Updated: 2010-05-08
  • Resolved: 2006-01-10
Related Reports
Relates :  
In Web Server, calling Runtime.exec can hang the child process following a fork because Java_java_lang_UNIXProcess_forkAndExec calls opendir(3C). opendir calls malloc(3C) and malloc attempts to acquire a mutex whose state is undefined following fork1(2) in a multithreaded program.

This problem is causing the NetConnect production outage described in Web Server CR 6325704.

The child process has the following stack:

8503:	webservd -r /opt/SUNWwbsvr -d /opt/SUNWwbsvr/https-srs/config -n https
 febf58f4 lwp_park (0, 0, 0)
 febf166c mutex_lock_queue (fec08b44, 0, 21480, fec08000, 0, 0) + 104
 febf206c slow_lock (21480, 917a4000, 40, 0, 0, 0) + 58
 ff3816f4 malloc_internal (2010, 21480, 0, 0, 0, 0) + 48
 fe04f3fc opendir  (8f81d848, fd1bff58, 2137, fd9b1d48, 0, 8f81e3c0) + 8
 fd1b7568 Java_java_lang_UNIXProcess_forkAndExec (0, 10f968a8, fd1d36a4, 8f81dee4, 0, 8f81dee0) + 674
 f840b96c ???????? (8f81def0, b7, 0, f84152a0, 0, 8f81ddf8)
 f8405774 ???????? (8f81df7c, 0, 0, f84160d0, 1c, 8f81de78)
 f840010c ???????? (8f81e008, 8f81e210, a, f567b530, 10, 8f81df10)
 fd55d48c __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (8f81e208, 8f81e0bc, 8f81e114, 3166b48, 3166b48, fd5b7a54) + 27c
 fd582420 __1cUjni_invoke_nonstatic6FpnHJNIEnv__pnJJavaValue_pnI_jobject_nLJNICallType_pnK_jmethodID_pnSJNI_ArgumentPusher_pnGThread__v_ (53b2ae4, 0, 71c104, 2, 2275b80, 8f81e1ec) + 50c
 fd6a74f4 jni_NewObjectV (3166bdc, 71c100, 2275b80, 8f81e2d0, 0, 0) + 224
 fd1ac478 JNU_NewObjectByName (3166bdc, fd1bed34, fd1bed4c, 8f81e3c8, 0, 8f81e3c0) + b0
 fd1b12d8 Java_java_lang_Runtime_execInternal (3166bdc, 8f81e3cc, 8f81e3c8, 0, 8f81e3c0, 0) + 80
 f840b96c ???????? (8f81e3cc, b7, 0, 98b20020, 0, f4073eb8)
 f840010c ???????? (8f81f770, 8f81f950, a, f4143138, 30, 8f81f658)
 fd55d48c __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (8f81f948, 8f81f824, 8f81f854, 3166b48, 3166b48, 6965006c) + 27c
 fd654af4 __1cRjni_invoke_static6FpnHJNIEnv__pnJJavaValue_pnI_jobject_nLJNICallType_pnK_jmethodID_pnSJNI_ArgumentPusher_pnGThread__v_ (3166bdc, 8f81f948, 0, 0, 42ca0, 8f81f92c) + 220
 fd7cafd0 jni_CallStaticIntMethodV (3166bdc, 717f64, 42ca0, 8f81fa10, 8f81f9ac, fd7d539c) + 120
 fdb97c04 __1cHJNIEnv_TCallStaticIntMethod6MpnH_jclass_pnK_jmethodID_E_i_ (3166bdc, 717f64, 42ca0, 0, 8f81fabc, 809e6c) + 24
 fdb949b8 __1cONSAPIConnectorHservice6MpnRJ2EEVirtualServer__i_ (8f81fabc, 809e6c, 718480, f66c9c0, 0, f66c9b4) + 458
 fdb92c10 service_j2ee (3da48, 109ee0, 109f58, 2710, fdba218b, 0) + 40
 ff1cf944 __1cNfunc_exec_str6FpnKFuncStruct_pnGpblock_pnHSession_pnHRequest__i_ (668, 3da48, 109ee0, 109f58, 0, 0) + 248
 ff1d0d64 INTobject_execute (772d8, 109ee0, 109f58, 0, 38020, 357ac0) + 5e8
 ff1d5d94 INTservact_service (109ee0, 109f58, ff2e79e4, 0, 0, ff2e79bc) + 4d8
 ff1d64a4 INTservact_handle_processed (109ee0, 109f58, 20, 2, 189e8d8, 76648) + 158
 ff2189a4 __1cLHttpRequestUUnacceleratedRespond6Mpc_v_ (109e40, ff2e7a08, 2f48, 50, 109f58, 109ee0) + 3c8
 ff218094 __1cLHttpRequestNHandleRequest6MpnGnetbuf__i_ (109e40, 189c090, 189e118, 189e108, 2000, 189c0f0) + 62c
 ff216490 __1cNDaemonSessionDrun6M_v_ (8a8020, 2000, ff2ed644, 0, 0, ff2ed5fc) + 17c
 ff106dec ThreadMain (8a8020, 789840, 3, 0, 400, 4d4) + 24
 feddfd64 _pt_root (789840, 0, 0, 0, 20000, fedf8c28) + d0
 febf57b4 _lwp_start (0, 0, 0, 0, 0, 0)

EVALUATION Given that: - the problem only occurs when using the JVM in a highly unusual way (using LD_PRELOAD). - there are two underlying bugs in Solaris - it would be very reasonable for Solaris to fix libmtmalloc and backport that fix - the problem will go away eventually on its own as Solaris users upgrade - there is a simple workaround - the most reliable way to workaround this in the Java code would mean either doing an extra exec on each subprocess invocation, or using non-standard Solaris-specific APIs, making the JDK's rather brittle process code even less maintainable. I am lowering the priority and recommend that this be eventually closed as Will Not Fix.

WORK AROUND Don't do LD_PRELOAD=/usr/lib/libmtmalloc.so.1 on Solaris versions before Solaris 10. Or upgrade to Solaris 10!

EVALUATION Hmmm... 4486978 libthread panic: fault in libthread critical section states: The problem is happening because there is a malloc call after a fork1() and before the exec(). This should not be done. Our current understanding of the problem is that... webservd is invoked with an interposed malloc library, libmtmalloc, using LD_PRELOAD=...libmtmalloc.so libmtmalloc is not "fork1-safe", i.e. calling malloc() after fork() may deadlock. The JDK, in its implementation of Runtime.exec, calls fork(), then opendir() in the child, then exec(). opendir() used to call malloc in Solaris releases before Solaris 10, but now calls a libc-private malloc implementation, lmalloc, which means that this problem should be gone on Solaris 10. 4945570 deadlock due to interaction between fork() and interposed malloc() I've filed bug 6350045: libmtmalloc should protect itself from deadlocks after fork1() by using pthread_atfork() to address the bug in libmtmalloc. A user can avoid deadlocks by either of: - running their app on Solaris 10 - not using the unsafe libmtmalloc (that would entail a small change to the invocation of webservd) The JDK could be made more bulletproof by using a helper program that closes all file descriptors and then execs the real target program, but that - complicates the Runtime.exec machinery and makes it less maintainable - slows down all users of Runtime.exec We are inclined to not fix (workaround?) this bug in the JDK and instead encourage users of heavily multi-threaded applications to upgrade to Solaris 10 as quickly as possible. We also encourage the Solaris maintainers to make libmtmalloc fork1-safe, allowing legacy apps to run on older Solaris releases. Arguably, the use of non-standard non-fork1-safe malloc libraries is not a supported configuration; it's a performance hack that can be made to work, but not on all possible Solaris platforms. Also, the underlying problem is that the default malloc() on Solaris is deemed to be too slow in heavily multi-threaded applications. It seems worthwhile investing engineering resources on improving the default malloc implementation.

EVALUATION Very interesting. I have not been aware of any deadlock. The man page for fork1(2) seems to give the answer: fork() Safety If a Solaris threads application calls fork1() or a POSIX threads application calls fork(), and the child does more than simply call exec(), there is a possibility of deadlock occurring in the child. The application should use pthread_atfork(3C) to ensure safety with respect to this deadlock. Should there be any outstanding mutexes throughout the process, the application should call pthread_atfork() to wait for and acquire those mutexes prior to calling fork() or fork1(). See "MT-Level of Libraries" on the attri- butes(5) manual page.