United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6336770 : Runtime.exec hangs on Solaris in opendir after fork when LD_PRELOAD=libmtmalloc.so

Details
Type:
Bug
Submit Date:
2005-10-13
Status:
Closed
Updated Date:
2010-05-08
Project Name:
JDK
Resolved Date:
2006-01-10
Component:
core-libs
OS:
solaris_9,solaris_10
Sub-Component:
java.lang
CPU:
generic,sparc
Priority:
P3
Resolution:
Won't Fix
Affected Versions:
1.4.2_04,1.4.2_10
Fixed Versions:

Related Reports
Relates:

Sub Tasks

Description
In Web Server, calling Runtime.exec can hang the child process following a fork because Java_java_lang_UNIXProcess_forkAndExec calls opendir(3C). opendir calls malloc(3C) and malloc attempts to acquire a mutex whose state is undefined following fork1(2) in a multithreaded program.

This problem is causing the NetConnect production outage described in Web Server CR 6325704.

The child process has the following stack:

8503:	webservd -r /opt/SUNWwbsvr -d /opt/SUNWwbsvr/https-srs/config -n https
 febf58f4 lwp_park (0, 0, 0)
 febf166c mutex_lock_queue (fec08b44, 0, 21480, fec08000, 0, 0) + 104
 febf206c slow_lock (21480, 917a4000, 40, 0, 0, 0) + 58
 ff3816f4 malloc_internal (2010, 21480, 0, 0, 0, 0) + 48
 fe04f3fc opendir  (8f81d848, fd1bff58, 2137, fd9b1d48, 0, 8f81e3c0) + 8
 fd1b7568 Java_java_lang_UNIXProcess_forkAndExec (0, 10f968a8, fd1d36a4, 8f81dee4, 0, 8f81dee0) + 674
 f840b96c ???????? (8f81def0, b7, 0, f84152a0, 0, 8f81ddf8)
 f8405774 ???????? (8f81df7c, 0, 0, f84160d0, 1c, 8f81de78)
 f840010c ???????? (8f81e008, 8f81e210, a, f567b530, 10, 8f81df10)
 fd55d48c __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (8f81e208, 8f81e0bc, 8f81e114, 3166b48, 3166b48, fd5b7a54) + 27c
 fd582420 __1cUjni_invoke_nonstatic6FpnHJNIEnv__pnJJavaValue_pnI_jobject_nLJNICallType_pnK_jmethodID_pnSJNI_ArgumentPusher_pnGThread__v_ (53b2ae4, 0, 71c104, 2, 2275b80, 8f81e1ec) + 50c
 fd6a74f4 jni_NewObjectV (3166bdc, 71c100, 2275b80, 8f81e2d0, 0, 0) + 224
 fd1ac478 JNU_NewObjectByName (3166bdc, fd1bed34, fd1bed4c, 8f81e3c8, 0, 8f81e3c0) + b0
 fd1b12d8 Java_java_lang_Runtime_execInternal (3166bdc, 8f81e3cc, 8f81e3c8, 0, 8f81e3c0, 0) + 80
 f840b96c ???????? (8f81e3cc, b7, 0, 98b20020, 0, f4073eb8)
 ...
 f840010c ???????? (8f81f770, 8f81f950, a, f4143138, 30, 8f81f658)
 fd55d48c __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (8f81f948, 8f81f824, 8f81f854, 3166b48, 3166b48, 6965006c) + 27c
 fd654af4 __1cRjni_invoke_static6FpnHJNIEnv__pnJJavaValue_pnI_jobject_nLJNICallType_pnK_jmethodID_pnSJNI_ArgumentPusher_pnGThread__v_ (3166bdc, 8f81f948, 0, 0, 42ca0, 8f81f92c) + 220
 fd7cafd0 jni_CallStaticIntMethodV (3166bdc, 717f64, 42ca0, 8f81fa10, 8f81f9ac, fd7d539c) + 120
 fdb97c04 __1cHJNIEnv_TCallStaticIntMethod6MpnH_jclass_pnK_jmethodID_E_i_ (3166bdc, 717f64, 42ca0, 0, 8f81fabc, 809e6c) + 24
 fdb949b8 __1cONSAPIConnectorHservice6MpnRJ2EEVirtualServer__i_ (8f81fabc, 809e6c, 718480, f66c9c0, 0, f66c9b4) + 458
 fdb92c10 service_j2ee (3da48, 109ee0, 109f58, 2710, fdba218b, 0) + 40
 ff1cf944 __1cNfunc_exec_str6FpnKFuncStruct_pnGpblock_pnHSession_pnHRequest__i_ (668, 3da48, 109ee0, 109f58, 0, 0) + 248
 ff1d0d64 INTobject_execute (772d8, 109ee0, 109f58, 0, 38020, 357ac0) + 5e8
 ff1d5d94 INTservact_service (109ee0, 109f58, ff2e79e4, 0, 0, ff2e79bc) + 4d8
 ff1d64a4 INTservact_handle_processed (109ee0, 109f58, 20, 2, 189e8d8, 76648) + 158
 ff2189a4 __1cLHttpRequestUUnacceleratedRespond6Mpc_v_ (109e40, ff2e7a08, 2f48, 50, 109f58, 109ee0) + 3c8
 ff218094 __1cLHttpRequestNHandleRequest6MpnGnetbuf__i_ (109e40, 189c090, 189e118, 189e108, 2000, 189c0f0) + 62c
 ff216490 __1cNDaemonSessionDrun6M_v_ (8a8020, 2000, ff2ed644, 0, 0, ff2ed5fc) + 17c
 ff106dec ThreadMain (8a8020, 789840, 3, 0, 400, 4d4) + 24
 feddfd64 _pt_root (789840, 0, 0, 0, 20000, fedf8c28) + d0
 febf57b4 _lwp_start (0, 0, 0, 0, 0, 0)

                                    

Comments
EVALUATION

Given that:
- the problem only occurs when using the JVM in a highly unusual way
  (using LD_PRELOAD).
- there are two underlying bugs in Solaris
- it would be very reasonable for Solaris to fix libmtmalloc and backport that fix
- the problem will go away eventually on its own as Solaris users upgrade
- there is a simple workaround
- the most reliable way to workaround this in the Java code would mean either
  doing an extra exec on each subprocess invocation, or using
  non-standard Solaris-specific APIs, making the JDK's rather brittle
  process code even less maintainable.

I am lowering the priority and recommend that this be eventually closed as
Will Not Fix.
                                     
2005-12-15
WORK AROUND

Don't do
LD_PRELOAD=/usr/lib/libmtmalloc.so.1
on Solaris versions before Solaris 10.

Or upgrade to Solaris 10!
                                     
2005-11-13
EVALUATION

Hmmm...
4486978 libthread panic: fault in libthread critical section 

states:

The problem is happening because there is a malloc call after a fork1()
and before the exec(). This should not be done.

Our current understanding of the problem is that...

webservd is invoked with an interposed malloc library, libmtmalloc,
using LD_PRELOAD=...libmtmalloc.so

libmtmalloc is not "fork1-safe", i.e. calling malloc() after fork()
may deadlock.

The JDK, in its implementation of Runtime.exec,
calls fork(), then opendir() in the child, then exec().

opendir() used to call malloc in Solaris releases before Solaris 10, 
but now calls a libc-private malloc implementation, lmalloc,
which means that this problem should be gone on Solaris 10.

4945570 deadlock due to interaction between fork() and interposed malloc()

I've filed bug
6350045: libmtmalloc should protect itself from deadlocks after fork1() by using pthread_atfork()
to address the bug in libmtmalloc.

A user can avoid deadlocks by either of:
- running their app on Solaris 10
- not using the unsafe libmtmalloc (that would entail a small change to the
  invocation of webservd)

The JDK could be made more bulletproof by using a helper program that
closes all file descriptors and then execs the real target program, but that
- complicates the Runtime.exec machinery and makes it less maintainable
- slows down all users of Runtime.exec
We are inclined to not fix (workaround?) this bug in the JDK
and instead encourage users of heavily
multi-threaded applications to upgrade to Solaris 10 as quickly as possible.
We also encourage the Solaris maintainers to make libmtmalloc fork1-safe,
allowing legacy apps to run on older Solaris releases.
Arguably, the use of non-standard non-fork1-safe malloc libraries is not a
supported configuration; it's a performance hack that can be made to work,
but not on all possible Solaris platforms.

Also, the underlying problem is that the default malloc() on Solaris is deemed to
be too slow in heavily multi-threaded applications.  
It seems worthwhile investing engineering resources on improving the 
default malloc implementation.
                                     
2005-10-14
EVALUATION

Very interesting.  I have not been aware of any deadlock.
The man page for fork1(2) seems to give the answer:

  fork() Safety
     If a Solaris threads application calls fork1()  or  a  POSIX
     threads  application  calls  fork(), and the child does more
     than simply call exec(), there is a possibility of  deadlock
     occurring   in   the   child.  The  application  should  use
     pthread_atfork(3C) to ensure safety  with  respect  to  this
     deadlock. Should there be any outstanding mutexes throughout
     the process, the application should call pthread_atfork() to
     wait  for  and acquire those mutexes prior to calling fork()
     or fork1(). See   "MT-Level  of  Libraries"  on  the  attri-
     butes(5) manual page.
                                     
2005-10-14



Hardware and Software, Engineered to Work Together