JDK-8215355 : Object monitor deadlock with no threads holding the monitor (using jemalloc 5.1)
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 7,8,11,12
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2018-12-12
  • Updated: 2022-06-27
  • Resolved: 2019-11-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 13 JDK 14 JDK 8 Other
11.0.7-oracleFixed 13.0.4Fixed 14 b24Fixed 8u251Fixed openjdk8u262Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
ADDITIONAL SYSTEM INFORMATION :
We've tested on the following operating systems: Debian 8 and 9, and Ubuntu 16.04. 
And these OpenJDK JVM runtimes:  multiple versions of 1.8 between 1.8.0.121 and 1.8.0.202, 1.9, and 1.11.

A DESCRIPTION OF THE PROBLEM :
We've run into an issue where our application deadlocks, and when looking at thread dumps we see that threads are waiting for an object monitor that no other thread is currently holding. The thread that the monitor references is parked in a threadpool waiting for its next task.

Test application - https://github.com/djscholl/jemalloc-java-deadlock
Example log output when a deadlock occurred - https://gist.github.com/djscholl/413071ef29671fb53b5a64e105421f1a

See https://github.com/jemalloc/jemalloc/issues/1392 for a more detailed description.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
https://github.com/djscholl/jemalloc-java-deadlock is a test application. There is a README with instructions.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The application should finish all iterations and not exit the loop early.
ACTUAL -
The application exits the loop early because of a timeout caused by the deadlock.

---------- BEGIN SOURCE ----------
https://github.com/djscholl/jemalloc-java-deadlock
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
So far we've worked around this by not upgrading to jemalloc 5.1. jemalloc 4.5 does not seem to cause the problem. Also glibc malloc does not seem to have the problem.

FREQUENCY : occasionally



Comments
Fix request (13u): I'd like to port it to 13u, too. The patch applies cleanly. Testing tier1.
02-06-2020

Probably an oversight, sorry. Fixed now.
23-03-2020

Why have you added both jdk8u-fix-request & jdk8u-critical-request?
18-03-2020

Fix Request (8u): Please approve backporting this to OpenJDK 8u. This patch is in 8u251 Oracle. The patch didn't apply cleanly but got resolved trivially and got reviewed by David Holmes. I've managed to reproduce the issue prior patch and it seems gone after. Testing tier 1 fastdebug. webrev: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8215355/01/webrev/ Original change: https://hg.openjdk.java.net/jdk/jdk/rev/189f47d990b5 RFR: http://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-March/011364.html
11-03-2020

Fix request (11u) I would like to downport this for parity with 11.0.7-oracle. Applies clean.
06-12-2019

URL: https://hg.openjdk.java.net/jdk/jdk/rev/189f47d990b5 User: dholmes Date: 2019-11-19 04:41:52 +0000
19-11-2019

This bug can be traced back to JDK-6699669. Prior to JDK-6699669 we used a "highest_locker" field to track whether a monitor may be stack-locked. That code was: if (_highest_lock >= adr && adr >= end) return true; where the >= is correct. But JDK-6699669 introduced the more general check: return (_stack_base >= adr && adr >= (_stack_base - _stack_size)); where the use of >= in relation to _stack_base is incorrect.
18-11-2019

Thread::on_local_stack(address adr) does: return (_stack_base >= adr && adr >= stack_end()); but _stack_base is actually the first address beyond our stack, so the check should be: return (_stack_base > adr && adr >= stack_end()); The problem is likely only seen on jemalloc because it happens to hand out memory in such a way that we have JavaThread instances abutting memory allocated for thread stacks. This also explains why the problem disappeared when they added padding prior to the JavaThread instance.
15-11-2019

A breakthrough. I instrumented the ObjectMonitor code to watch the upgrade from a BasicLock (aka stack-lock) to the JavaThread*: exit BL: obj->mon->thr->stack 0x000000044f800590 -> 0x00007f75eb218180 -> 0x00007f75d41a0000, 0x00007f75cc092ea0 enter BL: obj->mon->thr->stack 0x000000044f800590 -> 0x00007f75eb218180 -> 0x00007f75f85ef800, 0x00007f75d4092000 Here was see one thread doing a monitor_exit and finding it had the monitor stack-locked, but now it has been inflated so it transitions to being locked by the thread reference. That thread is 0x00007f75d41a0000 JavaThread "pool-1-thread-255" [_thread_blocked, id=25737, stack(0x00007f75cbf95000,0x00007f75cc096000)] then we see a second thread operating on the same (object,monitor) pair doing a similar transition on monitor_enter. But that should be impossible as only one thread can ever stack-lock an Object! The second thread is: 0x00007f75f85ef800 JavaThread "pool-1-thread-131" [_thread_blocked, id=25613, stack(0x00007f75d3f91000,0x00007f75d4092000)] but note that the "address" that indicated the Object was stack-locked by this thread is 0x00007f75d4092000 - the very end of its stack! But it gets worse - take a look here: 0x00007f75d4092000 JavaThread "pool-1-thread-135" [_thread_blocked, id=25617, stack(0x00007f75d3b8d000,0x00007f75d3c8e000)] that address is also the address of a JavaThread! So suppose pool-1-thread-135 actually owns the monitor when pool-1-thread-131 try to lock it. thread-131 sees the owner field representing thread-135 but things that address is part of its stack and so it has the monitor stack-locked! So now two thread "own" the monitor at the same time and chaos can ensue.
15-11-2019

I continue to easily reproduce the problem without any insight as to the cause. When we unlock the monitor we find: - it is owned by a different thread; or - the number of recursive locks is 1 not 0; or - both of the above Contrary to what what was posted in the jemalloc issue on git hub the problem is not related to the uses of UNSAFE via ThreadLocalRandom. I can reproduce with the synchronized code reduced to just: synchronized (lock) { } I've checked monitor inflation and there is nothing untoward that I can see there. All threads are looping synchronizing on the same Object and thus with the same monitor.
15-11-2019

I'm reopening this based on the updated information in https://github.com/jemalloc/jemalloc/issues/1392 though I still cannot see how the findings there would cause these issues. They add padding before the JavaThread allocation and the problem goes away, but the issue appears to be with the ObjectMonitor memory not the JavaThread memory.
12-11-2019

Additional information from submitter, requesting it to reopen this issue == A jemalloc developer has done significant debugging and has shown some strong proof towards this being a java issue. Here is his comment on the jemalloc issue - https://github.com/jemalloc/jemalloc/issues/1392#issuecomment-549554636
05-11-2019

My next failure was different: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (objectMonitor.cpp:365), pid=17824, tid=18004 # guarantee(_recursions == 0) failed: invariant this is the kind of failure that explains what we see. If the recursion count starts off wrong then an unlock will be treated as a recursive unlock and leave the monitor locked forever. However the "unbalanced monitor locking" crash still occurred. I think the issue I am seeing is memory corruption of the ObjectMonitor instance. Next failure: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (objectMonitor.cpp:366), pid=25105, tid=25285 # guarantee(_owner == Self) failed: invariant # so definitely appears to be memory corruption to me. And this is failing much more often now.
10-10-2019

I've exhausted what I can do to investigate this. All appearances are that it is a jemalloc issue related to the thread caching. Closing as "External". Please reopen if further information comes to light.
10-10-2019

Problem still reproduces with jemalloc 5.2.1
09-10-2019

MALLOC_CONF=tcache:false seems to avoid the problem. No failures in 400 runs, normally happens < 30. Disabling biased locking does not fix the problem, but does seem to reduce the frequency. Problem still reproduces with -Xint, so not a JIT problem (like JDK-8213825). Problem still reproduces with SerialGC so unlikely to be a GC problem.
09-10-2019

I just discovered that the tcache uses compiler-based TLS i.e. __thread T t = T_INITIALIZER; so I wonder if this issue relates to that somehow.
09-10-2019

Bizarre: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (objectMonitor.cpp:365), pid=30637, tid=30818 # guarantee(_recursions == 0) failed: _recursions was 0x0000000000000000 when I read into a local I see: # guarantee(_r == 0) failed: _recursions was 0x0000000000000001 but when I debug the corefile _recursions is 0 again. So we seem to have multiple threads active within the same monitor, which should of course not be possible. BTW when we fail a guarantee it usually happens in less than a second.
08-10-2019

I changed a bunch of assertions to guarantees and got this: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (objectMonitor.cpp:884), pid=29858, tid=30061 # guarantee(false) failed: Non-balanced monitor enter/exit! Likely JNI locking # # JRE version: Java(TM) SE Runtime Environment (14.0) (build 14-internal+0-2019-10-08-0614107.daholme...) # Java VM: Java HotSpot(TM) 64-Bit Server VM (14-internal+0-2019-10-08-0614107.daholme..., mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xbaafc1] ObjectMonitor::exit(bool, Thread*)+0x371 and again we have a truncated stack! --------------- T H R E A D --------------- Current thread (0x00007fcc27f6b000): JavaThread "pool-1-thread-168" [_thread_in_Java, id=30061, stack(0x00007fcc27c68000,0x00007fcc27d69000)] Stack: [0x00007fcc27c68000,0x00007fcc27d69000], sp=0x00007fcc27d65e50, free space=1015k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xbaafc1] ObjectMonitor::exit(bool, Thread*)+0x371 V [libjvm.so+0xc9560a] SharedRuntime::complete_monitor_unlocking_C(oopDesc*, BasicLock*, JavaThread*)+0x3a
08-10-2019

I took another look at this today, but still feel it has to be a jemalloc issue. Still only seems to reproduce with product bits. On a failing run the first thread dump shows the owner thread actively owning it while every other pool thread waits. Then the second dump shows the "owner" has completed its task and is back in the threadpool waiting for its next task. Meanwhile all the other thread remain blocked and the thread dump shows that the "owner" thread still appears to be the owner of the monitor. However I augmented the test so that after the loop with the synchronized block we check Thread.holdsLock(lock) and that indicates the thread not still holding the lock. One curiosity is that in my current batch of testing it always seems to be the same thread that is involved: pool-1-thread-151
08-10-2019

ILW = HLM = P3
18-06-2019

Investigating. Initial logs show the lock owner thread is still runnable.
13-06-2019

So far it is not clear the root cause From my above observations it looks like, this is an application issue. If this is found to be locking bug, then it is looks close to JDK-8213825, which is still in Open state. We may have to wait for some more time untill JDK-8213825 gets fixed.
10-01-2019

Couple of other observations with new test case shared by submitter. With reference to the attachments log_threadstack1 - Thread stack dump (Collected during the hang) log_jstack_output - jstackoutput (Collected during the hang jhsdb jstack --mixed --pid) log_threadstack2 - Thread stackdump (Collected 10 minutes after the hang) log_threadstack1 (Collected during the hang) shows 999 threads are waiting for "pool-1-thread-405" and "pool-1-thread-405" is runnable. Line 443: "pool-1-thread-1" prio=5 Id=12 BLOCKED on java.lang.Object@15d0c81b owned by "pool-1-thread-405" Id=416 Line 6099: "pool-1-thread-405" prio=5 Id=416 RUNNABLE log_threadstack2 (Collected 10 minutes after the hang), still 999 threads are waiting for "pool-1-thread-405" and "pool-1-thread-405" is waiting for " java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@80503" Line 5687: "pool-1-thread-405" prio=5 Id=416 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@80503 Unfortunately i didn't collect the jstack output after 10 minutes of hang, this procedure itself require lot of time. David let me know, if i can be any help on this.
21-12-2018

Another observation, issue is not reproducible with BiasedLocking disabled ( -XX:-UseBiasedLocking). Tested with 2 rounds (1 hour)
21-12-2018

Please check if this reproduces with -Xint - thanks.
21-12-2018

"memory zapping" involves writing a known "bad pattern" through memory to trach down use of non-zero'd or other uninitialized memory.
21-12-2018

Additional information from submitter == Thanks for the work you all have done so far to look into this issue! I've got some comments/questions. 1. We've reproduced this without MessageDigest.getInstance(). I just updated the github example application to simplify the code and remove MessageDigest. I also added some more output from the tool, this time with the simpler version. I took two stack dumps since I noticed that the first one the application takes actually shows the thread still holding the monitor, but if you wait 60s (probably much less) and take another stack dump, you can see that it's no longer holding it, but everyone else is still waiting on it. 2. What do you mean by "memory zapping"? 3. We tried some more jemalloc configurations, and it reproduces on all of them. Not sure if these are in any way similar to what the debug version is doing with its "memory zapping" ./configure --enable-stats --enable-fill --with-malloc-conf=junk:true and ./configure --enable-stats --enable-fill --with-malloc-conf=zero:true I agree that this only appears to be reproducible when using jemalloc 5.1. ==
21-12-2018

Issue is reproducible with -Xint option
21-12-2018

Requested submitter for further information == From our analysis we found that issue doesn't reproduce on debug build as we do extra memory zapping, where as issue is only reproducible on product build that could be due to memory zeroing issue. Analysis from java debugging tools and gdb we found that stack traces are broken. From the issue discussion https://github.com/jemalloc/jemalloc/issues/1392 it appear that issue was not reproducible with glibc malloc or jemalloc 4.5.0. Certainly https://github.com/jemalloc/jemalloc/commit/498856f44a30b31fe713a18eb2fc7c6ecf3a9f63 changeset in jemalloc 5.1 is causing this issue. Kindly investigate more with changes added in https://github.com/jemalloc/jemalloc/commit/498856f44a30b31fe713a18eb2fc7c6ecf3a9f63 changeset, and provide us exact line of change that claims issue in java with reduced test case (Without any dependency with jemalloc)
20-12-2018

Thanks [~fmatte]. Unfortunately the logs are of no help - for some reason all the stacks are truncated both in jstack and gdb! Unsure if we can make any progress on this. It could be a memory zeroing issue ... hence the extra memory zapping done in fastdebug stops it from appearing.
19-12-2018

Hi David, Please find the attachment jemalloc.log with complete stack information (jhsdb jstack --mixed --pid) JVM has not detected any deadlock "No deadlocks found." Do let me know for any other input. threads bt from gdb - gdb-threads-bt.log
19-12-2018

It stands to reason that if jemalloc is a malloc replacement then we're likely looking at some kind of memory corruption.
18-12-2018

[~fmatte] Thanks for clarifying it does not reproduce in fastdebug. The stack traces in the log files are not compete - note the "..." at the end. I need to see the full stack, including native frames if possible. Probably need to attach debugger.
18-12-2018

JDK 12 b20 also doesn't reproduce for me.
18-12-2018

JDK 12 b24 doesn't seem to reproduce this issue. But if the monitor concerned has been removed this is not a fix but an avoidance of the problem.
18-12-2018

The discussion on the jemalloc page suggests this is somehow an issue introduced by the jemalloc native code. Exactly how is very unclear. I tried to reproduce but seems my Linux version may be too old (or new?): seq: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.17' not found (required by /export/users/dh198349/9058520/libjemalloc.so) [~fmatte] Can we get a complete stack - Java and native - of the thread that supposedly owns the monitor?
18-12-2018

A bottleneck is not the same as no-one owning the lock that is being waited on!
18-12-2018

Possible, I will try the test case and see if I can reproduce it with JDK 12.
18-12-2018

[~valeriep] Is it possible the fix for JDK-7092821 might also fix this issue?
17-12-2018

Issue is reproducible on 12 also, we could see multiple threads waiting for same lock "pool-1-thread-1" prio=5 Id=12 BLOCKED on sun.security.provider.Sun@1efed156 owned by "pool-1-thread-593" Id=604 at java.base@12-ea/java.security.Provider.getService(Provider.java:1252) - blocked on sun.security.provider.Sun@1efed156 at java.base@12-ea/sun.security.jca.ProviderList.getService(ProviderList.java:380) at java.base@12-ea/sun.security.jca.GetInstance.getInstance(GetInstance.java:157) at java.base@12-ea/java.security.Security.getImpl(Security.java:700) at java.base@12-ea/java.security.MessageDigest.getInstance(MessageDigest.java:178) at app//DeadlockTest.lambda$null$1(DeadlockTest.java:43) at app//DeadlockTest$$Lambda$11/0x0000000801177440.run(Unknown Source) at java.base@12-ea/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@776ec8df "pool-1-thread-2" prio=5 Id=13 BLOCKED on sun.security.provider.Sun@1efed156 owned by "pool-1-thread-593" Id=604 at java.base@12-ea/java.security.Provider.getService(Provider.java:1252) - blocked on sun.security.provider.Sun@1efed156 at java.base@12-ea/sun.security.jca.ProviderList.getService(ProviderList.java:380) at java.base@12-ea/sun.security.jca.GetInstance.getInstance(GetInstance.java:157) at java.base@12-ea/java.security.Security.getImpl(Security.java:700) at java.base@12-ea/java.security.MessageDigest.getInstance(MessageDigest.java:178) at app//DeadlockTest.lambda$null$1(DeadlockTest.java:43) at app//DeadlockTest$$Lambda$11/0x0000000801177440.run(Unknown Source) at java.base@12-ea/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@4eec7777 "pool-1-thread-3" prio=5 Id=14 BLOCKED on sun.security.provider.Sun@1efed156 owned by "pool-1-thread-593" Id=604 at java.base@12-ea/java.security.Provider.getService(Provider.java:1252) - blocked on sun.security.provider.Sun@1efed156 at java.base@12-ea/sun.security.jca.ProviderList.getService(ProviderList.java:380) at java.base@12-ea/sun.security.jca.GetInstance.getInstance(GetInstance.java:157) at java.base@12-ea/java.security.Security.getImpl(Security.java:700) at java.base@12-ea/java.security.MessageDigest.getInstance(MessageDigest.java:178) at app//DeadlockTest.lambda$null$1(DeadlockTest.java:43) at app//DeadlockTest$$Lambda$11/0x0000000801177440.run(Unknown Source) at java.base@12-ea/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3b07d329 "pool-1-thread-4" prio=5 Id=15 BLOCKED on sun.security.provider.Sun@1efed156 owned by "pool-1-thread-593" Id=604 at java.base@12-ea/java.security.Provider.getService(Provider.java:1252) - blocked on sun.security.provider.Sun@1efed156
13-12-2018