JDK-8028073 : race condition in ObjectMonitor implementation causing deadlocks
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: hs24,hs25,6u71,7u51
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2013-11-08
  • Updated: 2014-10-24
  • Resolved: 2014-02-10
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 JDK 9
7u51Fixed 8u20Fixed 9 b04Fixed
Related Reports
Relates :  
Relates :  
Description
The bug JDK-8014910 (and its dup: JDK-8015729) was fixed by applying
the anti-delta to the fix of the JDK-8008962.
In fact, this was just a work around the issue that remained in the ObjectMonitor implementation.
The above fix lowered the deadlocks probability but they still have a potential to happen.
We may have a long history of tests timeouts in the nightly caused by this issue.
These timeouts are normally hard to investigate, they cause tests instability and slowdowns.

Below is the email exchange on this issue and variants of fixes.

Hi Dan,

On 8/11/2013 12:15 PM, Daniel D. Daugherty wrote:
> Been reading through these e-mails and the code for a couple of hours.
> I concur that an early return when (_succ != NULL) is not 'safe' since
> that successor might be parked. However, I think I would handle it
> differently and clear the _succ field if that _succ was about to park
> itself so that _succ selection was not cached across the park().

I also suggested this then retracted it. There would be a race with clearing _succ that would cause the same hang as today (unless the check is also modified in some way).

That said, another possibility would be that if a thread sees that it is _succ then it uses the timed-park instead of the untimed one in ReenterI:

           if (SyncFlags & 1 || _succ == Self) {
              Self->_ParkEvent->park ((jlong)1000) ;
           } else {
              Self->_ParkEvent->park () ;
           }


> Removing the '_succ != NULL' fragment below pretty much means that
> the non-NULL _succ value will not keep you from doing all the work
> that is below those code blocks in ObjectMonitor::exit(). I think
> not doing that work is exactly what a non-NULL _succ was supposed
> to help you do.

I also considered that the "find the successor" logic could be short-circuited if _succ was already non-null. I don't think it can be critical to performance though because if it was often the case that _succ was non-null then we would hit this hang - and we don't. Hence it seems to me it must be rare to hit this case and so finding the next successor when we already have one will not impact overall performance.

That said it may be that we can simply do the following:

 if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0) {
    return ;
 }
 if (_succ != NULL) {
    _succ->event->unpark();
    return;
 }
 // else we must set up the successor

But I worry about making changes to an extremely complex protocol, especially when we seem to have found a condition that indicates that the detailed explanation of the successor protocol (lines 1003 - 1032) is in fact flawed.

David
-----

>
> Dan
>
>
> On 11/6/13 2:31 AM, serguei.spitsyn@oracle.com wrote:
>> On 11/5/13 8:49 PM, David Holmes wrote:
>>> Hi Serguei,
>>>
>>> I don't agree with your fix Please read on ...
>>>
>>> This is very tricky code ...
>>>
>>> We have in essence:
>>>
>>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ != NULL)
>>>   return;
>>>
>>> So lets break this apart and consider each part one at a time:
>>>
>>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0)
>>>   // no one waiting for entry so nothing to do
>>>   return;
>>>
>>> No problem there. Otherwise we have at least one waiting thread
>>> either in the _EntryList or the _cxq.
>>
>>> (Note if _succ != NULL the queues can not be empty!) So we
>>> potentially have to wake a waiting thread up.
>>
>> Then there is no point to check the condition (_succ != NULL) as we
>> already checked
>> the condition (intptr_t(_EntryList)|intptr_t(_cxq)) == 0).
>> I see below, you've already come to the conclusion that the _succ
>> condition is not needed.
>>
>>
>>> Now if _succ == NULL then we definitely have to wake something up as
>>> we have to make it the new _succ! But if _succ != NULL then this has
>>> already been done - the successor has been chosen (not 100% sure by
>>> whom or when) so we "don't need to do that". In which case:
>>>
>>> if (_succ != NULL)
>>>   // someone already handled _succ
>>>   return;
>>>
>>> so the original code seems correct as it stands. But lets revisit the
>>> "don't need to do that" part.
>>>
>>> The key to the above is understanding how _succ already came to be
>>> set. So lets examine the scenario that was put on the open email
>>> thread. Thread t1 did a wait(). Thread t2 did a notify and exit,
>>> which unparked t1 after making it _succ. But thread t3 sneaks in and
>>> grabs the monitor via the fast-path. So t1 re-parks itself but is
>>> still _succ. Now t3 does the exit and it hits the logic we are
>>> discussing and it will just return because _succ != NULL. But we get
>>> a hang so this is "obviously" wrong so it should not have returned if
>>> _succ != NULL. But we previously established it should also not
>>> return if _succ == NULL (as the queues are not empty and a successor
>>> must be appointed!).
>>
>> My fix was based on the same judgement as above.
>> I agree, we can get rid of the _succ condition but it still can be
>> incomplete due to other aspects.
>>
>>>
>>> Based on that the "correct" form of the logic would seem to be simply:
>>>
>>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0)
>>>   // no one waiting for entry so nothing to do
>>>   return;
>>>
>>
>> 1000 runs of the test with the fix above did not hang.
>>
>>> Aside: Arguably when the code later finds a successor it could check
>>> that it isn't displacing the existing _succ thread (which would only
>>> be a fairness consideration).
>>>
>>> Alternatively I think the existing code would also work if when t3
>>> reparks it first clears itself from _succ. Or maybe t3 should have
>>> become _Responsible and so done a timed-park? It is hard to say what
>>> is the real underlying problem here because the protocol is so
>>> complicated. Any of the suggested fixes may be incomplete due to
>>> other aspects of that protocol.
>>>
>>> I'm assuming Dave Dice must be on vacation without email access,
>>> otherwise I'm sure he would have jumped in on this just because it is
>>> so much fun
>>
>> I hope, Dave will have this fun upon return from vacation.
>>
>>
>> Thanks for the nice analysis, David!
>> Serguei
>>
>>>
>>> Cheers,
>>> David
>>>
>>> On 6/11/2013 6:27 AM, serguei.spitsyn@oracle.com wrote:
>>>>
>>>> On 11/4/13 6:09 PM, serguei.spitsyn@oracle.com wrote:
>>>>> Hi David and Dave,
>>>>>
>>>>> (Added Staffan and Dan to the cc-list)
>>>>>
>>>>> It looks like I managed to solate the issue to the following fix:
>>>>
>>>> Webrev for convenience:
>>>> http://javaweb.sfbay.sun.com/java/svc/ss45998/webrevs/2013/hotspot/ObjectMonitor/
>>>>
>>>>
>>>> The first fragment under the condition "if (Knob_ExitPolicy == 0)" is
>>>> actual fix.
>>>> The other fragment is fixed for consistency in a case the
>>>> Knob_ExitPolicy is changed in the future.
>>>>
>>>> Thanks,
>>>> Serguei
>>>>
>>>>>
>>>>> % hg diff
>>>>> diff -r 04df110c8655 src/share/vm/runtime/objectMonitor.cpp
>>>>> --- a/src/share/vm/runtime/objectMonitor.cpp    Sat Nov 02 20:56:18
>>>>> 2013 +0100
>>>>> +++ b/src/share/vm/runtime/objectMonitor.cpp    Mon Nov 04 20:47:37
>>>>> 2013 -0800
>>>>> @@ -994,7 +994,7 @@ void ATTR ObjectMonitor::exit(bool not_s
>>>>>           // ST on x64, x86 and SPARC.
>>>>>           OrderAccess::release_store_ptr (&_owner, NULL) ; // drop the
>>>>> lock
>>>>>           OrderAccess::storeload() ; // See if
>>>>> we need to wake a successor
>>>>> -         if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
>>>>> NULL) {
>>>>> +         if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 && _succ ==
>>>>> NULL) {
>>>>>              TEVENT (Inflated exit - simple egress) ;
>>>>>              return ;
>>>>>           }
>>>>> @@ -1042,7 +1042,7 @@ void ATTR ObjectMonitor::exit(bool not_s
>>>>>           }
>>>>>           TEVENT (Exit - Reacquired) ;
>>>>>        } else {
>>>>> -         if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
>>>>> NULL) {
>>>>> +         if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 && _succ ==
>>>>> NULL) {
>>>>>              OrderAccess::release_store_ptr (&_owner, NULL) ;   //
>>>>> drop the lock
>>>>>              OrderAccess::storeload() ;
>>>>>              // Ratify the previously observed values.
>>>>>
>>>>>
>>>>> With this fix above the hang does not happen anymore (I test it with
>>>>> the Ioi's fix rolled back).
>>>>> The test passed in 2000 runs on Solaris-amd64.
>>>>> I'm currently verifying this fix with the nsk.jvmti and nsk.jdi tests.
>>>>> They look good so far.
>>>>>
>>>>> Additionally, it looks like the test MonitorEvents002 is executed
>>>>> noticeably faster,
>>>>> but need to measure it more carefully.
>>>>>
>>>>> The Ioi's fix of the JDK-8015729 was just a work around the existing
>>>>> issue.
>>>>> We may have other deadlocks in Nightlies that are caused by this
>>>>> issue.
>>>>>
>>>>> I will file a bug on this if it is Ok with you.
>>>>> Please, let me know your opinion and if you have any concerns.
>>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>
>>>>> On 10/29/13 10:49 AM, serguei.spitsyn@oracle.com wrote:
>>>>>> On 10/29/13 4:19 AM, David Holmes wrote:
>>>>>>> Dave,
>>>>>>>
>>>>>>> FYI this bug seems related to a query on the openlists about how
>>>>>>> succ_ was used. I haven't had time to try and analyze it in detail.
>>>>>>> Serguei has some comments below.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>> -------- Original Message --------
>>>>>>> Subject: [JBS] (JDK-8015729)
>>>>>>> nsk/jdi/stress/MonitorEvents/MonitorEvents002 hangs
>>>>>>> Date: Tue, 29 Oct 2013 10:23:33 +0000 (UTC)
>>>>>>> From: Serguei Spitsyn (JBS) <do-not-reply@openjdk.java.net>
>>>>>>> To: david.holmes@oracle.com
>>>>>>>
>>>>>>>
>>>>>>>     [
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8015729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421337#comment-13421337
>>>>>>>
>>>>>>> ]
>>>>>>>
>>>>>>> Serguei Spitsyn edited comment on JDK-8015729 at 10/29/13 10:22 AM:
>>>>>>> --------------------------------------------------------------------
>>>>>>>
>>>>>>> Focusing on the following fragment:
>>>>>>>
>>>>>>> void ATTR ObjectMonitor::exit(bool not_suspended, TRAPS) {
>>>>>>>   . . .
>>>>>>>    for (;;) {
>>>>>>>       . . .
>>>>>>>       if (Knob_ExitPolicy == 0) {
>>>>>>>          . . .
>>>>>>>          if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
>>>>>>> NULL) {
>>>>>>>             TEVENT (Inflated exit - simple egress) ;
>>>>>>>             return ;
>>>>>>>          }
>>>>>>>
>>>>>>> The line with "_succ != NULL" is a suspect.
>>>>>>> Is a possible unparking of the _succ thread missed above?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       was (Author: sspitsyn):
>>>>>>>     Focusing on the following fragment:
>>>>>>>
>>>>>>> void ATTR ObjectMonitor::exit(bool not_suspended, TRAPS) {
>>>>>>>   . . .
>>>>>>>    for (;;) {
>>>>>>>       . . .
>>>>>>>       if (Knob_ExitPolicy == 0) {
>>>>>>>          . . .
>>>>>>>          if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
>>>>>>> NULL) {
>>>>>>>             TEVENT (Inflated exit - simple egress) ;
>>>>>>>             return ;
>>>>>>>          }
>>>>>>>
>>>>>>> The line with "_succ != NULL" is a suspect.
>>>>>>> Just wonder if the condition is correct. Should we replace the '||'
>>>>>>> with '&&' ?
>>>>>>
>>>>>> Please, skip this comment.
>>>>>> Most likely, the line is correct as the test hangs reliably with the
>>>>>> replace of '||' to '&&'.
>>>>>>
>>>>>>
>>>>>>> Is the unparking of the _succ thread missed above?
>>>>>>
>>>>>> This is a concern.
>>>>>> It seems there is a synchronization whole allowing a race between
>>>>>> the ObjectMonitor::exit (T#3) and re-parking of the notified
>>>>>> thread T#1
>>>>>> blocked in ObjectMonitor::ReenterI().
>>>>>> We may need a better prove and details of what exactly happens in
>>>>>> such hangs.
>>>>>>
>>>>>> Thanks,
>>>>>> Serguei

Comments
Added 'noreg-other'. There is a test case attached to the bug report, but it requires use of a native JVM/TI agent. That's not easily/reliably done with the current JTREG mechanisms so there is no JTREG test.
28-02-2014

Here are the JDK9 webrev URLs: OpenJDK: http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/ Oracle internal: http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/
02-02-2014

I've attached the StessMonitorWait test that can be used to reproduce the hang and verify the fix.
31-01-2014

Also reproduced the hang using the following bits on svc6: $ uname -a SunOS svc6 5.11 11.1 i86pc i386 i86pc java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b14) Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode) Java HotSpot(TM) Server VM (24.51-b03) for solaris-x86 JRE (1.7.0_51-b14), built on Jan 13 2014 19:25:05 by "" with Sun Studio 12u1 Also reproduced the hang on oliver-mm1: $ uname -a Darwin oliver-mm1 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64 java version "1.8.0-ea" Java(TM) SE Runtime Environment (build 1.8.0-ea-b124) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b66, mixed mode) Java HotSpot(TM) 64-Bit Server VM (25.0-b66) for bsd-amd64 JRE (1.8.0-ea-b124), built on Jan 17 2014 09:42:15 by "java_re" with gcc 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00) It takes more runs to reproduce the hang with StressMonitorWait on MacOS 10.7.5. Based on the loop counts for the runs that didn't fail, it looks like MacOS X thread scheduling is not as "fair" as Solaris: SMW_WorkerThread: 'T1' is about to wait() in stress loop. StressMonitorWait: starting T2 StressMonitorWait: starting T3 SMW_WorkerThread: 'T2' is running. StressMonitorWait: starting T4 SMW_WorkerThread: 'T2' is about to enter-notify-exit in stress loop. SMW_WorkerThread: 'T3' is running. SMW_WorkerThread: 'T3' is about to enter-exit in stress loop. SMW_WorkerThread: 'T4' is running. SMW_WorkerThread: 'T4' is about to enter-delay-exit in stress loop. StressMonitorWait: done with stress testing. SMW_WorkerThread: 'T4' is done with enter-delay-exit in stress loop. SMW_WorkerThread: 'T3' is done with enter-exit in stress loop. SMW_WorkerThread: 'T2' is done with enter-notify-exit in stress loop. SMW_WorkerThread: 'T2' completed 2115 loops. SMW_WorkerThread: 'T2' is done. SMW_WorkerThread: 'T3' completed 1848 loops. SMW_WorkerThread: 'T3' is done. SMW_WorkerThread: 'T4' completed 120 loops. SMW_WorkerThread: 'T4' is done. SMW_WorkerThread: 'T1' is done with wait() in stress loop. SMW_WorkerThread: 'T1' completed 2 loops. SMW_WorkerThread: 'T1' is done. Bye from StressMonitorWait! status=0 In the above MacOS X run, T1 only completed 2 loops in 60 seconds. Also reproduced the hang using the following bits on svc6: java version "1.6.0_71" Java(TM) SE Runtime Environment (build 1.6.0_71-b12) Java HotSpot(TM) Server VM (build 20.71-b01, mixed mode) Java HotSpot(TM) Server VM (20.71-b01) for solaris-x86 JRE (1.6.0_71-b12), built on Dec 13 2013 11:34:12 by "" with Workshop 5.8 Update: forgot to mention that the 1.6.0_71 test run took a few more runs to get the hang. Like MacOS X 10.7.5, it looks like the thread scheduling is not as "fair": java version "1.6.0_71" Java(TM) SE Runtime Environment (build 1.6.0_71-b12) Java HotSpot(TM) Server VM (build 20.71-b01, mixed mode) Java HotSpot(TM) Server VM (20.71-b01) for solaris-x86 JRE (1.6.0_71-b12), built on Dec 13 2013 11:34:12 by "" with Workshop 5.8 /java/re/jdk/1.6.0_71/promoted/latest/binaries/solaris-amd64/bin/java -showversion -agentlib:StressMonitorWait StressMonitorWait JVMTI_EVENT_VM_INIT: event posted. java version "1.6.0_71" Java(TM) SE Runtime Environment (build 1.6.0_71-b12) Java HotSpot(TM) Server VM (build 20.71-b01, mixed mode) Hello from StressMonitorWait! StressMonitorWait: stress testing for 60 seconds. StressMonitorWait: starting T1 SMW_WorkerThread: 'T1' is running. SMW_WorkerThread: 'T1' has monitor. SMW_WorkerThread: 'T1' is about to wait() in stress loop. StressMonitorWait: starting T2 StressMonitorWait: starting T3 SMW_WorkerThread: 'T2' is running. SMW_WorkerThread: 'T3' is running. StressMonitorWait: starting T4 SMW_WorkerThread: 'T3' is about to enter-exit in stress loop. SMW_WorkerThread: 'T2' is about to enter-notify-exit in stress loop. SMW_WorkerThread: 'T4' is running. SMW_WorkerThread: 'T4' is about to enter-delay-exit in stress loop. StressMonitorWait: done with stress testing. SMW_WorkerThread: 'T4' is done with enter-delay-exit in stress loop. SMW_WorkerThread: 'T3' is done with enter-exit in stress loop. SMW_WorkerThread: 'T4' completed 118 loops. SMW_WorkerThread: 'T4' is done. SMW_WorkerThread: 'T2' is done with enter-notify-exit in stress loop. SMW_WorkerThread: 'T2' completed 20584 loops. SMW_WorkerThread: 'T2' is done. SMW_WorkerThread: 'T3' completed 2178 loops. SMW_WorkerThread: 'T3' is done. SMW_WorkerThread: 'T1' is done with wait() in stress loop. SMW_WorkerThread: 'T1' completed 2 loops. SMW_WorkerThread: 'T1' is done. Bye from StressMonitorWait! status=0 In the above 1.6.0_71 run, T1 only completed 2 loops in 60 seconds. Also reproduced the hang on my DevOps Linux box: $ uname -a Linux slc06fdn 2.6.39-400.21.1.el6uek.x86_64 #1 SMP Thu Apr 4 03:49:00 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b127) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode) Java HotSpot(TM) 64-Bit Server VM (25.0-b69) for linux-amd64 JRE (1.8.0-b127), built on Jan 29 2014 20:15:38 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) Also reproduced the hang on a Ghostbox Solaris SPARC machine: $ uname -a SunOS terminus 5.11 11.0 sun4u sparc SUNW,SPARC-Enterprise java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b127) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode) Java HotSpot(TM) 64-Bit Server VM (25.0-b69) for solaris-sparc JRE (1.8.0-b127), built on Jan 29 2014 21:52:06 by "" with Sun Studio 12u1 Also reproduced the hang using Win7-64/Cygwin under a VMware client: $ uname -a CYGWIN_NT-6.1-WOW64 DDAUGHER-PC 1.7.17(0.262/5/3) 2012-10-19 14:39 i686 Cygwin java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b127) Java HotSpot(TM) Client VM (build 25.0-b69, mixed mode) Java HotSpot(TM) Client VM (25.0-b69) for windows-x86 JRE (1.8.0-b127), built on Jan 29 2014 21:57:10 by "java_re" with MS VC++ 10.0 (VS2010) The hang does not reproduce with these bits on svc6: java version "1.5.0_61" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_61-b11) Java HotSpot(TM) Server VM (build 1.5.0_61-b11, mixed mode) Java HotSpot(TM) Server VM (1.5.0_61-b11) for solaris-x86, built on Dec 16 2013 12:58:21 by unknown with unknown Workshop:0x550 /java/re/jdk/1.5.0_61/promoted/latest/binaries/solaris-amd64/bin/java -showversion -agentlib:StressMonitorWait StressMonitorWait 43200 JVMTI_EVENT_VM_INIT: event posted. java version "1.5.0_61" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_61-b11) Java HotSpot(TM) Server VM (build 1.5.0_61-b11, mixed mode) Hello from StressMonitorWait! StressMonitorWait: stress testing for 43200 seconds. StressMonitorWait: starting T1 SMW_WorkerThread: 'T1' is running. SMW_WorkerThread: 'T1' has monitor. SMW_WorkerThread: 'T1' is about to wait() in stress loop. StressMonitorWait: starting T2 StressMonitorWait: starting T3 SMW_WorkerThread: 'T2' is running. SMW_WorkerThread: 'T2' is about to enter-notify-exit in stress loop. StressMonitorWait: starting T4 SMW_WorkerThread: 'T3' is running. SMW_WorkerThread: 'T3' is about to enter-exit in stress loop. SMW_WorkerThread: 'T4' is running. SMW_WorkerThread: 'T4' is about to enter-delay-exit in stress loop. StressMonitorWait: done with stress testing. SMW_WorkerThread: 'T2' is done with enter-notify-exit in stress loop. SMW_WorkerThread: 'T3' is done with enter-exit in stress loop. SMW_WorkerThread: 'T3' completed 333 loops. SMW_WorkerThread: 'T3' is done. SMW_WorkerThread: 'T2' completed 747 loops. SMW_WorkerThread: 'T2' is done. SMW_WorkerThread: 'T4' is done with enter-delay-exit in stress loop. SMW_WorkerThread: 'T4' completed 84468 loops. SMW_WorkerThread: 'T4' is done. SMW_WorkerThread: 'T1' is done with wait() in stress loop. SMW_WorkerThread: 'T1' completed 6 loops. SMW_WorkerThread: 'T1' is done. Bye from StressMonitorWait! status=0
31-01-2014

I'm working on a standalone reproducer without any code flow debug hooks. So far the results are promising. Thread T1 is trying to reenter the monitor after a monitor.wait() call: THREAD t@47 t@47(l@47) stopped in ___lwp_cond_wait at 0xfffffd7fff2735ea 0xfffffd7fff2735ea: ___lwp_cond_wait+0x000a: jae ___lwp_cond_wait+0x18 [ 0xfffffd7fff2735f8, .+0xe ] current thread: t@47 [1] ___lwp_cond_wait(0x9ca050, 0x9ca038, 0x0, 0x0, 0xfffffd7fff258720, 0x9ca00 0), at 0xfffffd7fff2735ea [2] __lwp_cond_wait(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff25873c =>[3] os::Solaris::cond_wait(cv = 0x9ca050, mx = 0x9ca038), line 254 in "os_sola ris.hpp" [4] os::PlatformEvent::park(this = 0x9ca000), line 6087 in "os_solaris.cpp" [5] ObjectMonitor::ReenterI(this = 0x72aa48, Self = 0x9ca800, SelfNode = 0xfff ffd7fc0231228), line 761 in "objectMonitor.cpp" [6] ObjectMonitor::wait(this = 0x72aa48, millis = 0, interruptible = true, __t he_thread__ = 0x9ca800), line 1615 in "objectMonitor.cpp" [7] ObjectSynchronizer::wait(obj = CLASS, millis = 0, __the_thread__ = 0x9ca80 0), line 387 in "synchronizer.cpp" [8] JVM_MonitorWait(env = 0x9caa20, handle = 0xfffffd7fc02314f8, ms = 0), line 522 in "jvm.cpp" [9] 0xfffffd7fed62a07d(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fed62a07d [10] 0xfffffd7fed62a07d(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fed62a07d [11] 0xfffffd7fed607520(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fed607520 [12] 0xfffffd7fed607520(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fed607520 [13] 0xfffffd7fed600671(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fed600671 [14] JavaCalls::call_helper(result = 0xfffffd7fc0231bd8, m = 0xfffffd7fc02319c 0, args = 0xfffffd7fc0231ae8, __the_thread__ = 0x9ca800), line 393 in "javaCalls .cpp" [15] os::os_exception_wrapper(f = 0xfffffd7ffde63ed0 = &JavaCalls::call_helper (JavaValue*,methodHandle*,JavaCallArguments*,Thread*), value = 0xfffffd7fc0231bd 8, method = 0xfffffd7fc02319c0, args = 0xfffffd7fc0231ae8, thread = 0x9ca800), l ine 4395 in "os_solaris.cpp" [16] JavaCalls::call(result = 0xfffffd7fc0231bd8, method = CLASS, args = 0xfff ffd7fc0231ae8, __the_thread__ = 0x9ca800), line 307 in "javaCalls.cpp" [17] JavaCalls::call_virtual(result = 0xfffffd7fc0231bd8, spec_klass = CLASS, name = 0x4b8658, signature = 0x4b9e18, args = 0xfffffd7fc0231ae8, __the_thread__ = 0x9ca800), line 204 in "javaCalls.cpp" [18] JavaCalls::call_virtual(result = 0xfffffd7fc0231bd8, receiver = CLASS, sp ec_klass = CLASS, name = 0x4b8658, signature = 0x4b9e18, __the_thread__ = 0x9ca8 00), line 210 in "javaCalls.cpp" [19] thread_entry(thread = 0x9ca800, __the_thread__ = 0x9ca800), line 2807 in "jvm.cpp" [20] JavaThread::thread_main_inner(this = 0x9ca800), line 1699 in "thread.cpp" [21] JavaThread::run(this = 0x9ca800), line 1679 in "thread.cpp" [22] java_start(thread_addr = 0x9ca800), line 979 in "os_solaris.cpp" [23] _thr_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff26d9db [24] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff26dc10 Current function is os::Solaris::cond_wait 254 static int cond_wait(cond_t *cv, mutex_t *mx) { return _cond_wait(cv, mx); } It is blocked in the park() call in ReenterI(). (dbx) frame 5 Current function is ObjectMonitor::ReenterI 761 Self->_ParkEvent->park () ; (dbx) print Self Self = 0x9ca800 (dbx) print *this *this = { _header = 0x1 _object = 0xfffffd7e6a2b6ff0 SharingPad = (-7.4786357953084e+240) _owner = (nil) _previous_owner_tid = 0 _recursions = 0 OwnerIsThread = 1 _cxq = 0xfffffd7ffc99e350 _EntryList = 0xfffffd7fc0231228 _succ = 0x9ca800 _Responsible = (nil) _PromptDrain = -235802127 _Spinner = -235802127 _SpinFreq = 0 _SpinClock = 0 _SpinDuration = 5000 _SpinState = -1012762419733073423 _count = 3 _waiters = 1 _WaitSet = (nil) _WaitSetLock = 0 _QMix = -235802127 FreeNext = (nil) StatA = -1012762419733073423 StatsB = -1012762419733073423 SpinCallbackFunction = (nil) SpinCallbackArgument = 0 _sync_ContendedLockAttempts = 0x423c98 _sync_FutileWakeups = 0x423d78 _sync_Parks = 0x423e58 _sync_EmptyNotifications = 0x423ed8 _sync_Notifications = 0x423fb8 _sync_SlowEnter = 0x424098 _sync_SlowExit = 0x424118 _sync_SlowNotify = 0x424198 _sync_SlowNotifyAll = 0x424218 _sync_FailedSpins = (nil) _sync_SuccessfulSpins = 0x4243d8 _sync_PrivateA = 0x4244b8 _sync_PrivateB = 0x424538 _sync_MonInCirculation = 0x4245b8 _sync_MonScavenged = 0x424698 _sync_Inflations = 0x4237c8 _sync_Deflations = 0x423c18 _sync_MonExtant = 0x424778 Knob_Verbose = 0 Knob_SpinLimit = 5000 } The monitor has no owner: _owner = (nil) and the current thread is the successor: (dbx) print Self Self = 0x9ca800 _succ = 0x9ca800 So far the hang reproduces with: java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b126) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b67, mixed mode) and with the current RT_Baseline dropped into the above JDK: java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b126) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b62-bh_hsx_rt_latest_base_dcubed-p roduct-debug, mixed mode) Java HotSpot(TM) 64-Bit Server VM (25.0-b62-bh_hsx_rt_latest_base_dcubed-product -debug) for solaris-amd64 JRE (1.8.0), built on Jan 29 2014 16:18:37 by "dcubed" with Sun Studio 12u1 I have the new test running in a 12 hour loop _without_ a JVM/TI agent using these bits: java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b126) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b62-bh_hsx_rt_latest_base_dcubed-p roduct, mixed mode) Java HotSpot(TM) 64-Bit Server VM (25.0-b62-bh_hsx_rt_latest_base_dcubed-product ) for solaris-amd64 JRE (1.8.0), built on Jan 29 2014 16:17:37 by "dcubed" with Sun Studio 12u1 I do not expect the non-JVM/TI agent version to hang at all. Update: the non-JVM/TI agent version of the test didn't hang in 12 hours: $ cat doit-no-jvmti.log java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b126) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b62-bh_hsx_rt_latest_base_dcubed-product, mixed mode) Java HotSpot(TM) 64-Bit Server VM (25.0-b62-bh_hsx_rt_latest_base_dcubed-product) for solaris-amd64 JRE (1.8.0), built on Jan 29 2014 16:17:37 by "dcubed" with Sun Studio 12u1 /work/local/jdk/1.8.0/solaris-x64/bin/java -showversion -server_bh_hsx_rt_latest_base_dcubed-prod StressMonitorWait 43200 java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b126) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b62-bh_hsx_rt_latest_base_dcubed-product, mixed mode) Hello from StressMonitorWait! StressMonitorWait: stress testing for 43200 seconds. StressMonitorWait: starting T1 SMW_WorkerThread: 'T1' is running. SMW_WorkerThread: 'T1' has monitor. SMW_WorkerThread: 'T1' is about to wait() in stress loop. StressMonitorWait: starting T2 StressMonitorWait: starting T3 StressMonitorWait: starting T4 SMW_WorkerThread: 'T2' is running. SMW_WorkerThread: 'T2' is about to enter-notify-exit in stress loop. SMW_WorkerThread: 'T3' is running. SMW_WorkerThread: 'T3' is about to enter-exit in stress loop. SMW_WorkerThread: 'T4' is running. SMW_WorkerThread: 'T4' is about to enter-delay-exit in stress loop. StressMonitorWait: done with stress testing. SMW_WorkerThread: 'T4' is done with enter-delay-exit in stress loop. SMW_WorkerThread: 'T1' is done with wait() in stress loop. SMW_WorkerThread: 'T3' is done with enter-exit in stress loop. SMW_WorkerThread: 'T1' completed 1666 loops. SMW_WorkerThread: 'T1' is done. SMW_WorkerThread: 'T2' is done with enter-notify-exit in stress loop. SMW_WorkerThread: 'T3' completed 48462 loops. SMW_WorkerThread: 'T2' completed 33383 loops. SMW_WorkerThread: 'T4' completed 84706 loops. SMW_WorkerThread: 'T3' is done. SMW_WorkerThread: 'T4' is done. SMW_WorkerThread: 'T2' is done. Bye from StressMonitorWait! status=0
30-01-2014

RE: JDK-8015729 I have to trust that JDK-8015729 was closed as a duplicate of JDK-8014910 in good faith and for good reason. When I read the analysis back then, it made sense to me, but maybe I missed something. In any case, the disposition of JDK-8015729 does not affect the bug that I'm hunting here. > // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ. > > If we see the owner as offproc then it seems we would follow: > > if (Knob_OState && NotRunnable (Self, ox)) { > TEVENT (Spin abort - notrunnable); > goto Abort ; > } The above code is from lines 2166-2169. Those are not the lines associated with KEY-EVENT-4b. Here are those lines: 2023 if (Knob_OState && NotRunnable (Self, (Thread *) _owner)) { 2024 TEVENT (Spin abort - notrunnable [TOP]); 2025 return 0 ; 2026 } This is mentioned in the "Successor Protocol for Contended Java Monitor Combo Operations" section of successor.notes near the end of the table for that section: : | : | : | | T1 | | | | T1-opt: line 2023-2025 in TrySpin checks : | : | : | | | | | | for a NotRunnable(_own) and returns without : | : | : | | | | | | setting _succ=NULL but those three small lines are buried near the end of a 16+ page document. I don't blame you if you missed it or if fell asleep before getting that far. :-)
30-01-2014

> The failure of MonitorEvents002 that is described by JDK-8015729 is due to the > regression that was fixed by JDK-8014910. I can not conclude that was indeed the case. Although the hang disappeared with that change I can not see evidence that everything was explained as a simple deadlock. Further if that issue was indeed a simple deadlock involving the init_lock then what on has that go to do with this issue in the objectMonitor code ??? > See my note above dated 2014-01-27 16:56 for the exact sequence of events > necessary to reproduce this particular hang. I can't quite see the path that leads to this: // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ. If we see the owner as offproc then it seems we would follow: if (Knob_OState && NotRunnable (Self, ox)) { TEVENT (Spin abort - notrunnable); goto Abort ; } where abort is: Abort: if (MaxSpin >= 0) Adjust (&_Spinner, -1) ; if (sss && _succ == Self) { _succ = NULL ; // Invariant: after setting succ=null a contending thread // must recheck-retry _owner before parking. This usually happens // in the normal usage of TrySpin(), but it's safest // to make TrySpin() as foolproof as possible. OrderAccess::fence() ; if (TryLock(Self) > 0) return 1 ; } return 0 ; Given sss is 1 this would seem to clear _succ
30-01-2014

The failure of MonitorEvents002 that is described by JDK-8015729 is due to the regression that was fixed by JDK-8014910. See my note above dated 2014-01-27 16:56 for the exact sequence of events necessary to reproduce this particular hang.
30-01-2014

It seems quite obvious that IF between waking up from the wait, and re-parking for the monitor in ReenterI, we call into JVM TI code that blocks on a RawMonitor then the unpark directed at _succ can be consumed by the RawMonitor code, and hence that might lead to a later park call blocking forever. However ReenterI checks the state of the lock so if it had been released we would not park and the consumed unpark would not be relevant. Secondly I can't locate anything in the description of 8015729 that indicates there is any JVM TI code that would block in response to any monitor related events. What I do see in 8015729 is the use of the debugger so I would expect this to tie in with blocking related to debugger suspend requests.
30-01-2014

Added doit-jvmg.log and threads.list.jvmg from previous comment.
30-01-2014

Re: adding another event object As Dave Dice stated in other email there are not big assumptions of there only being one event per thread as there are three distinct events currently used for three different APIs (sleep, object monitors, j.u.c park/unpark). As RawMonitors are a fourth API they really deserve a distinct event too.
29-01-2014

More on JVMTI_EVENT_MONITOR_CONTENDED_ENTERED... To reiterate, the MONITOR_CONTENDED_ENTERED event is posted by a thread that owns the monitor. Since the calling thread already owns the monitor, it doesn't matter if another thread tries to unpark() it as part of the ExitEpilog() dance. The thread that owns the monitor isn't going to call park() so there is no risk of losing an unpark() to the MONITOR_CONTENDED_ENTERED event handler.
29-01-2014

More on JVMTI_EVENT_MONITOR_CONTENDED_ENTER... To reiterate, the MONITOR_CONTENDED_ENTER event is posted by a thread that does not own the monitor. And the event is posted before the ObjectMonitor::EnterI() function is called so the contending thread hasn't been added to any of the queues used to decide which thread to make the successor. The optimized successor protocol issues an unpark() in ExitEpilog() to one of the threads picked from one of the internal queues. That unpark() call is made by the thread that previously owned the monitor. However, since the contending thread that is posting the MONITOR_CONTENDED_ENTER event is not yet on one of the internal queues, the unpark() cannot be consumed by the MONITOR_CONTENDED_ENTER event handler.
29-01-2014

More on JVMTI_EVENT_MONITOR_WAIT... This event is the only one of the four monitor related JVM/TI events to not be posted by code in src/share/vm/runtime/objectMonitor.cpp. That makes it fairly easy to forget about when analyzing ObjectMonitor::wait()... The event is posted from JVM_MonitorWait() in src/share/vm/prims/jvm.cpp and has to go through the ObjectSynchronizer layer before it gets down to ObjectMonitor::wait(). To reiterate, the MONITOR_WAIT event is posted by a thread that owns the monitor. And it is not possible for a properly constructed program to call monitor.notify()/monitor.notifyAll()/unpark() on the monitor while the owning thread is posting the MONITOR_WAIT event. In a properly constructed program: Thread T1 -------------- someCondition = false; synchronized (monitor) { while (!someCondition) { monitor.wait(); } // do the work associated with someCondition becoming true } Thread T2 -------------- // queue up some work that needs to be handled by T1 synchronized (monitor) { someCondition = true; monitor.notify(); // let T1 know the work is there } In Java, neither monitor.notify() or monitor.notifyAll() has any meaning if there is no other thread or threads in monitor.wait(). So for the above fragment, T1 has to enter the monitor first for T2's monitor.wait() call to have any meaning. Since T1 owns the monitor and is posting the MONITOR_WAIT event on the way to exiting that monitor, there is no way for another Java thread to correctly jump in with a monitor.notify() or monitor.notifyAll() and make the resulting unpark() call. That means there is no way for the MONITOR_WAIT event handler to consume an unpark() that was meant to bring T1 out of its wait() call. This analysis rules out the possibility that this code block: src/share/vm/runtime/objectMonitor.cpp: 1502 // As soon as the ObjectMonitor's ownership is dropped in the exit() 1503 // call above, another thread can enter() the ObjectMonitor, do the 1504 // notify(), and exit() the ObjectMonitor. If the other thread's 1505 // exit() call chooses this thread as the successor and the unpark() 1506 // call happens to occur while this thread is posting a 1507 // MONITOR_CONTENDED_EXIT event, then we run the risk of the event 1508 // handler using RawMonitors and consuming the unpark(). 1509 // 1510 // To avoid the problem, we re-post the event. This does no harm 1511 // even if the original unpark() was not consumed because we are the 1512 // chosen successor for this monitor. 1513 if (node._notified != 0 && _succ == Self) { 1514 node._event->unpark(); 1515 } is actually meant to prevent the MONITOR_WAIT event handler from accidentally consuming the unpark(). JVMTI_EVENT_MONITOR_WAITED is already analyzed in the previous couple of notes so I (probably) won't be adding a note for that event.
28-01-2014

Re: adding another event object I think there are some big assumptions in the system that a thread only has one event structure associated with it. It would be very risky to evolve away from that assumption. Re: event handling code There are four JVM/TI monitor events: JVMTI_EVENT_MONITOR_WAIT = 73, JVMTI_EVENT_MONITOR_WAITED = 74, JVMTI_EVENT_MONITOR_CONTENDED_ENTER = 75, JVMTI_EVENT_MONITOR_CONTENDED_ENTERED = 76, MONITOR_CONTENDED_ENTER is posted by thread that is trying to enter a monitor and finds that monitor to be contended; the thread posting the event does not own the monitor. MONITOR_CONTENDED_ENTERED is posted by a thread that has entered a contended monitor; the thread posting the event owns the monitor. MONITOR_WAIT is posted by a thread that owns a monitor and is about to call monitor.wait(); when the thread posts the event it still owns the monitor. MONITOR_WAITED is posted by a thread that previously called monitor.wait() and has been notified via monitor.notify() or monitor.notifyAll(); when the thread posts the event it does not own the monitor.
28-01-2014

Could there be multiple places in event handling code that could in theory consume unparks and so require the re-issue of an unpark() from different locations in the code? Seems to me that perhaps raw monitors - given they can be entered from within the normal monitor code - should have their own _event object per thread, so that this accidental consumption of unparks can not occur.
28-01-2014

I do think that this code: src/share/vm/runtime/objectMonitor.cpp: 1502 // As soon as the ObjectMonitor's ownership is dropped in the exit() 1503 // call above, another thread can enter() the ObjectMonitor, do the 1504 // notify(), and exit() the ObjectMonitor. If the other thread's 1505 // exit() call chooses this thread as the successor and the unpark() 1506 // call happens to occur while this thread is posting a 1507 // MONITOR_CONTENDED_EXIT event, then we run the risk of the event 1508 // handler using RawMonitors and consuming the unpark(). 1509 // 1510 // To avoid the problem, we re-post the event. This does no harm 1511 // even if the original unpark() was not consumed because we are the 1512 // chosen successor for this monitor. 1513 if (node._notified != 0 && _succ == Self) { 1514 node._event->unpark(); 1515 } is the right fix for the problem, but it happens to be in the wrong place. Also, MONITOR_CONTENDED_EXIT is the wrong name for the event. It is JVMTI_EVENT_MONITOR_WAITED.
28-01-2014

The original version of the fix from 5030359 had these lines: 3004 JvmtiExport::post_monitor_waited(jt, this, (ret == OS_TIMEOUT) ? true : false); 3005 if (_succ == Self) _succ = NULL ; This reset of _succ to NULL reduces, but does not completely fix the race that is happening in the successor protocol. The setting of '_succ = NULL' was removed as part of this delta: $ sp -r1.117 src/share/vm/runtime/synchronizer.cpp src/share/vm/runtime/SCCS/s.synchronizer.cpp: D 1.117 07/09/29 02:14:43 ss45998 234 233 00039/00024/04683 MRs: COMMENTS: 6562569: REGRESSION: can not see BLOCKED state change after patching 5104215 diff. The '_succ = NULL' isn't complete fix for the potential hang that is coming because another thread could simply made the event posting thread the 'successor' thread again.
28-01-2014

The particular hang that I've run into is supposed to already be solved by the following code: src/share/vm/runtime/objectMonitor.cpp: 1502 // As soon as the ObjectMonitor's ownership is dropped in the exit() 1503 // call above, another thread can enter() the ObjectMonitor, do the 1504 // notify(), and exit() the ObjectMonitor. If the other thread's 1505 // exit() call chooses this thread as the successor and the unpark() 1506 // call happens to occur while this thread is posting a 1507 // MONITOR_CONTENDED_EXIT event, then we run the risk of the event 1508 // handler using RawMonitors and consuming the unpark(). 1509 // 1510 // To avoid the problem, we re-post the event. This does no harm 1511 // even if the original unpark() was not consumed because we are the 1512 // chosen successor for this monitor. 1513 if (node._notified != 0 && _succ == Self) { 1514 node._event->unpark(); 1515 } $ hg annot src/share/vm/runtime/objectMonitor.cpp | grep 'MONITOR_CONTENDED_EXIT event' 1787: // MONITOR_CONTENDED_EXIT event, then we run the risk of the event $ hg log -r 1787 src/share/vm/runtime/objectMonitor.cpp changeset: 1787:fa83ab460c54 parent: 1785:75ab0162aa84 user: acorn date: Fri Oct 22 15:59:34 2010 -0400 summary: 6988353: refactor contended sync subsystem The src/share/vm/runtime/objectMonitor.cpp did not exist in changeset 1785 so we have to switch files: $ hg annot -r 1785 src/share/vm/runtime/synchronizer.cpp | grep 'MONITOR_CONTENDED_EXIT event' 0: // MONITOR_CONTENDED_EXIT event, then we run the risk of the event Changeset 0 indicates that we inherited this line from the TeamWare repo in HSX-12: src/share/vm/runtime/synchronizer.cpp: 3921 // As soon as the ObjectMonitor's ownership is dropped in the exit() 3922 // call above, another thread can enter() the ObjectMonitor, do the 3923 // notify(), and exit() the ObjectMonitor. If the other thread's 3924 // exit() call chooses this thread as the successor and the unpark() 3925 // call happens to occur while this thread is posting a 3926 // MONITOR_CONTENDED_EXIT event, then we run the risk of the event 3927 // handler using RawMonitors and consuming the unpark(). 3928 // 3929 // To avoid the problem, we re-post the event. This does no harm 3930 // even if the original unpark() was not consumed because we are the 3931 // chosen successor for this monitor. 3932 if (node._notified != 0 && _succ == Self) { 3933 node._event->unpark(); 3934 } and here's the JVM/TI event posting code and re-enter code for that version: 4007 // Reentry phase -- reacquire the monitor. 4008 // re-enter contended monitor after object.wait(). 4009 // retain OBJECT_WAIT state until re-enter successfully completes 4010 // Thread state is thread_in_vm and oop access is again safe, 4011 // although the raw address of the object may have changed. 4012 // (Don't cache naked oops over safepoints, of course). 4013 4014 // post monitor waited event. Note that this is past-tense, we are done waiting. 4015 if (JvmtiExport::should_post_monitor_waited()) { 4016 JvmtiExport::post_monitor_waited(jt, this, ret == OS_TIMEOUT); 4017 } 4018 OrderAccess::fence() ; 4019 4020 assert (Self->_Stalled != 0, "invariant") ; 4021 Self->_Stalled = 0 ; 4022 4023 assert (_owner != Self, "invariant") ; 4024 ObjectWaiter::TStates v = node.TState ; 4025 if (v == ObjectWaiter::TS_RUN) { 4026 enter (Self) ; 4027 } else { 4028 guarantee (v == ObjectWaiter::TS_ENTER || v == ObjectWaiter::TS _CXQ, "invariant") ; 4029 ReenterI (Self, &node) ; 4030 node.wait_reenter_end(this); 4031 } TeamWare attributes line 3926 to this delta: D 1.71 05/08/17 13:46:10 dice 151 145 03440/00406/00689 MRs: COMMENTS: Accepted child'\''s version in workspace "/net/saros/uu/160sas". so you have to manually track it back to real delta: $ sgv -r1.69.1.6 src/share/vm/runtime/synchronizer.cpp | grep 'MONITOR_CONTENDED_EXIT event' 1.69.1.1 // MONITOR_CONTENDED_EXIT event, then we run the risk of the event $ sp -r1.69.1.1 src/share/vm/runtime/synchronizer.cpp src/share/vm/runtime/SCCS/s.synchronizer.cpp: D 1.69.1.1 05/08/08 03:00:53 dice 144 143 03588/00396/00422 MRs: COMMENTS: Accepted child'\''s version in workspace "/net/saros/uu/160sas". And still not quite there so we track back more: $ sgv -r1.67.1.10 src/share/vm/runtime/synchronizer.cpp | grep 'MONITOR_CONTENDED_EXIT event' 1.63.1.1 // MONITOR_CONTENDED_EXIT event, then we run the risk of the event $ sp -r1.63.1.1 src/share/vm/runtime/synchronizer.cpp src/share/vm/runtime/SCCS/s.synchronizer.cpp: D 1.63.1.1 05/07/04 03:26:33 dice 131 124 03324/00368/00362 MRs: COMMENTS: 5030359 -- back-end synchonization improvements - adaptive spinning, etc So it looks like this fix was made by JDK-5030359. Here are the relevant line numbers from that version: 2913 // As soon as the ObjectMonitor's ownership is dropped in the exit() 2914 // call above, another thread can enter() the ObjectMonitor, do the 2915 // notify(), and exit() the ObjectMonitor. If the other thread's 2916 // exit() call chooses this thread as the successor and the unpark() 2917 // call happens to occur while this thread is posting a 2918 // MONITOR_CONTENDED_EXIT event, then we run the risk of the event 2919 // handler using RawMonitors and consuming the unpark(). 2920 // 2921 // To avoid the problem, we re-post the event. This does no harm 2922 // even if the original unpark() was not consumed because we are the 2923 // chosen successor for this monitor. 2924 if (node._notified != 0 && _succ == Self) { 2925 node._event->unpark(); 2926 } and here's the JVM/TI event posting code and the reenter code in that version in that version: 2995 // Reentry phase -- reacquire the monitor. 2996 // re-enter contended monitor after object.wait(). 2997 // retain OBJECT_WAIT state until re-enter successfully completes 2998 // Thread state is thread_in_vm and oop access is again safe, 2999 // although the raw address of the object may have changed. 3000 // (Don't cache naked oops over safepoints, of course). 3001 3002 // post monitor waited event. Note that this is past-tense, we are done waiting. 3003 if (JvmtiExport::should_post_monitor_waited()) { 3004 JvmtiExport::post_monitor_waited(jt, this, (ret == OS_TIMEOUT) ? true : false); 3005 if (_succ == Self) _succ = NULL ; 3006 } 3007 3008 assert (Self->_Stalled != 0, "invariant") ; 3009 Self->_Stalled = 0 ; 3010 3011 assert (_owner != Self, "invariant") ; 3012 ObjectWaiter::TStates v = node.TState ; 3013 if (v == ObjectWaiter::TS_RUN) { 3014 enter (Self) ; 3015 } else { 3016 guarantee (v == ObjectWaiter::TS_ENTER || v == ObjectWaiter::TS _CXQ, "invariant") ; 3017 ReenterI (Self, &node) ; 3018 node.TState = ObjectWaiter::TS_RUN ; 3019 }
28-01-2014

I've created a scenario that reproduces this hang: T1 - enters monitor and calls monitor.wait() T2 - enters the monitor, calls monitor.notify() and exits the monitor T3 - enters and exits the monitor T4 - enters the monitor, delays for 5 seconds, exits the monitor a JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and as a handler that: enters a raw monitor, waits for 1ms, exits a raw monitor. Here are the six events necessary to make this hang happen: // KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ field, but // KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the successor again. // KEY-EVENT-2a: The unpark() done by T3 when it made T1 the successor // KEY-EVENT-2b: is consumed by the JVM/TI event handler. // KEY-EVENT-3a: T3 made T1 the successor // KEY-EVENT-3b: but before T1 could reenter the monitor T4 grabbed it. // KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ. // KEY-EVENT-5a: T4 sees that T1 is still the successor so // KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog) // KEY-EVENT-6a: T1 is about to park and it is the successor, but // KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event handler // KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to be stuck. I created a new debug kit for controlling program flow between T1-T4 when they are using the target Java monitor. I haven't ironed out all the details, but these debug.hpp additions give a glimpse: $ hg diff src/share/vm/utilities/debug.hpp diff -r 4c241e42a3e4 src/share/vm/utilities/debug.hpp --- a/src/share/vm/utilities/debug.hpp Mon Jan 27 10:57:52 2014 +0100 +++ b/src/share/vm/utilities/debug.hpp Mon Jan 27 16:55:01 2014 -0700 @@ -264,4 +264,15 @@ void pd_ps(frame f); void pd_ps(frame f); void pd_obfuscate_location(char *buf, size_t buflen); +bool debug_sync_check_match(JavaThread * jt, const char * t_name, void * obj); +void debug_sync_dec(JavaThread * jt, const char * t_name, void * obj); +int debug_sync_get(JavaThread * jt, const char * t_name, void * obj); +void debug_sync_inc(JavaThread * jt, const char * t_name, void * obj); +void debug_sync_pause(JavaThread * jt, const char * t_name, void * obj, + int target_val); +void debug_sync_set(JavaThread * jt, const char * t_name, void * obj, + int new_val); +void * debug_sync_get_obj(); +void debug_sync_set_obj(jobject jobj); + #endif // SHARE_VM_UTILITIES_DEBUG_HPP
27-01-2014

Attached the latest version of my successor.notes file.
27-01-2014

Update successor.notes attachment with the following changes: changeset: 15:610efdb57a33 tag: tip user: dcubed date: Mon Dec 16 20:47:15 2013 -0700 summary: Add notes about JVM/TI events. Clarify calls from enter(line 398) to EnterI() which sets _cxq. changeset: 14:93f5abf774ee user: dcubed date: Mon Dec 16 16:04:31 2013 -0700 summary: Fix typo. Remove extraneous semi-colons. Redo doStress timeout checks so that all threads are at a well known point for a clean shutdown.
17-12-2013

David, I confirm, you are correct in your first statement. The MonitorEvents002 test hangs only in the condition when there is no token and current thread is a successor: (_succ == Self && Self->_ParkEvent->fired() == 0) Still need to check the same for the stand-alone test without the debugger involved. Also, I was not able to prove the JVMTI callout theory and catch the condition when the number of park tokens is different before and after the JVMTI event posts. It is always the same, even before the monitor re-entering thread hangs on park() forever. Note, all the JVMTI events related to the ObjectMonitor (and their parking tickets) are traced. I'm puzzled what can be missed. Otherwise, the JVMTI callout theory is wrong here.
22-11-2013

Release team: Approved for deferral.
21-11-2013

I've finally finished my first attempt to describe and analyze the successor protocol. Please see the "Successor Protocol for Contended Java Monitor Enter Operations" section in the attached successor.notes file. You'll need a window at least 160 columns wide to see the table.
20-11-2013

Serguei, as Dave pointed out earlier the condition you are finding in your trace does not necessarily reflect a problem if an unpark has already been issued for the _succ thread - you need to see inside the ParkEvent and see if the token is present at this time. If there is no token then we have a bug. Meanwhile we need to verify the JVMTI callout theory - which should be very easy to check: look for a park token before the callout and see if it is still there after.
16-11-2013

To answer Dave and David on their last comments about my tracing. Sorry, guys. I was trying to catch when the _succ == Self in the ReenterI loop just before parking, but forgot to check this condition so that my last log above is useless. However, I've caught it now: MYTRACE: #23: 0x6f3800, lock: 0xfffffd7f4372de40, succ: 0x6f3800, _cxq: 0, state: 4, ReenterI parking As I understand, it should not happen because the re-entering thread must clear the _succ, if _succ == Self. It is still a puzzle to me how this really happens. Please, note, this happens with my new standalone test that does NO debugging. Also, I want to copy the Dave's comment from the email to bug report as the fact is really interesting: "In the past when we���ve encountered progress bugs related to debuggers and event callbacks where some thread T1 would be parked (and not have a pending unpark) but _succ identified T1. This was caused by the debuggers consuming a park ���ticket��� during callbacks. Search for ���MONITOR_CONTENDED_EXIT��� in ObjectMonitor::wait() for the comments. This is has been an ongoing source of trouble. I wouldn���t be surprised if we didn���t get all the cases." This is an interesting fragment from the ObjectMonitor::wait() function that relates to this comment: // As soon as the ObjectMonitor's ownership is dropped in the exit() // call above, another thread can enter() the ObjectMonitor, do the // notify(), and exit() the ObjectMonitor. If the other thread's // exit() call chooses this thread as the successor and the unpark() // call happens to occur while this thread is posting a // MONITOR_CONTENDED_EXIT event, then we run the risk of the event // handler using RawMonitors and consuming the unpark(). // // To avoid the problem, we re-post the event. This does no harm // even if the original unpark() was not consumed because we are the // chosen successor for this monitor. if (node._notified != 0 && _succ == Self) { node._event->unpark(); } The possibility described in the fragment above looks scary. It has a potential to impact a number of jvmti and debugger SQE tests.
16-11-2013

OK. I've re-read my notes, moved the data analysis section that was dropped in the wrong place and verified all the line numbers to match the current version of the code. I've also ripped out the part where I mixed in an e-mail conversation with David H. with my possible identification of a race. If the mis-paste of the data analysis didn't mess up anyone else's brain, then the mixing of scenarios should have done the trick. :-) At this point (late on a Friday :-)), I still think that I've identified a possible race that requires interactions between three threads. My next step is to create my usual transaction diagrams to show the order of operations and see if that holds up to scrutiny. David H., sorry for getting things confused and jumbled together when I pasted the original set of notes from my BOS->DEN flight. I should have re-read the notes first. I guess I wasn't as recovered on Sunday as I thought... But hey, I did get _some_ sleep...
15-11-2013

// DDD - These notes are based on the following version of the file: // // hotspot/src/share/vm/runtime/objectMonitor.cpp: // changeset: 4801:f2110083203d // user: sla // date: Mon Jun 10 11:30:51 2013 +0200 // summary: 8005849: JEP 167: Event-Based JVM Tracing // // Normally the exiting thread is responsible for ensuring succession, // but if other successors are ready or other entering threads are spin ning // then this thread can simply store NULL into _owner and exit without // waking a successor. The existence of spinners or ready successors // guarantees proper succession (liveness). Responsibility passes to t he // ready or running successors. The exiting thread delegates the duty. // More precisely, if a successor already exists this thread is absolve d // of the responsibility of waking (unparking) one. // DDD - 8028073 is about the possibility that the successor thread is // parked when this thread exits the monitor so the above might not be // 100% correct. The current working theory is this: // // - A thread holding the lock is about to exit, picks a successor thread // and unparks it: // - see line 1313 for the set of _succ in ExitEpilog // - see lines 1164-1178 for the picking of the successor from the // _EntryList and the ExitEpilog call; // - see lines 1248-1251 for the second pass at picking of the successor // from the _EntryList (after _cxq is drained -> _EntryList) and the // ExitEpilog call; // - there is a third call to ExitEpilog called when Knob_QMode == 2 // which looks at _cxq before _EntryList (line 1087); Knob_QMode is // zero by default // - successor is unparked on line 1330 // // - However, the successor is racing with any newly entering thread that // hasn't parked itself. Here's the successor thread: // - see line 398 for the successor's call to EnterI and line 609 for // where the successor is parked; there's a timed park on line 603 // when SyncFlags has 0x1 set; SyncFlags is zero by default // - lines 593 and 612 are where the successor tries to grab the lock, // but that might fail // - line 643-646 the successor didn't get the lock, realizes that it // is the successor and sets _succ to NULL; fence() is called to // flush the _succ field. // RACE PART1 HERE // // - Here's the newly entering thread: // - the newly entering thread can get lucky and catch the contended // monitor with a NULL owner on lines 322-328 // - When the freshly entering thread exits, it sees that successor is // already set on lines 997-999 and does not unpark the successor // RACE PART2 HERE // // - We no longer have proper succession so have broken the liveness // guarantee. // // NEXT STEP: I need to create transaction diagrams for the possible RACE. // // DDD - Here are some notes about the _succ field; who sets it to NULL // and who sets it to a non-NULL value. // // - the objectMonitor.hpp file defines a succ_offset_in_bytes() function // so that field is probably queried and possibly set by either assembly // code in the interpreter and/or the compilers. // - _succ is set to NULL in objectMonitor.hpp by the constructor and // by the Recycle() function. // - _succ is set to NULL in objectMonitor.cpp in these places: // - before self-suspending in ObjectMonitor::enter() on lines 409-412 // Resolution: OK: self-suspending thread is owner // - in EnterI() after a failed retry on line 643 // Resolution: BUG: see RACE comments above // - in EnterI() after getting the lock on line 664 // Resolution: OK: thread is owner // - before self-suspending in ReenterI() on lines 767-769 // Resolution: BUG: self-suspending thread is not the owner; the // exiting thread might have just made the self-suspending thread the // successor so now we have no successor; // - after the self-suspend check in ReenterI() on line 788; after this // point the ReenterI() thread could be renamed the sucessor again // by the owning thread when it is exiting // Resolution: BUG: looks like this ReenterI() code path has the same // issue as line 643 above // - in RenterI() after getting the lock on line 810 // Resolution: OK: thread is owner // - in ExitEpilog if Knob_SuccEnabled is false on line 1313 // Resolution: OK: thread is owner // - in ObjectMonitor::wait() on line 1585 before we try to reenter // the monitor // Resolution: NOT SURE: We are the successor, but we clear that // because we have all of the enter protocol to do. The thread that // owns the monitor could exit and name us the successor again, but // in the long enter protocol we have several places to clear that. // - in TrySpin_VaryDuration() after getting the lock on line 2121 // Resolution: OK: This set to NULL isn't bad because the spinner is // an active and not a parked thread. // - in TrySpin_VaryDuration() when aborting the spin on line 2191 // Resolution: OK: This set to NULL isn't bad because the spinner is // an active and not a parked thread. // - _succ is set to non-NULL in objectMonitor.cpp in these places: // - in TrySpin_VaryDuration() when caller observes _succ == NULL // it makes itself the successor on line 2047 and line 2170; the // setter doesn't own the lock in either of these locations. Is // it possible for the spinner to race with the lock owner setting // _succ on exit? // Resolution: OK: This potential overwrite isn't bad because the // spinner is an active and not a parked thread. // - in ExitEpilog if Knob_SuccEnabled is true on line 1313 // Resolution: OK: This is the normal/safe place. // // The _succ variable is critical to reducing futile wakeup frequency. // _succ identifies the "heir presumptive" thread that has been made // ready (unparked) but that has not yet run. We need only one such // successor thread to guarantee progress. // See http://www.usenix.org/events/jvm01/full_papers/dice/dice.pdf // section 3.3 "Futile Wakeup Throttling" for details. // // Note that spinners in Enter() also set _succ non-null. // In the current implementation spinners opportunistically set // _succ so that exiting threads might avoid waking a successor. // Another less appealing alternative would be for the exiting thread // to drop the lock and then spin briefly to see if a spinner managed // to acquire the lock. If so, the exiting thread could exit // immediately without waking a successor, otherwise the exiting // thread would need to dequeue and wake a successor. // (Note that we'd need to make the post-drop spin short, but no // shorter than the worst-case round-trip cache-line migration time. // The dropped lock needs to become visible to the spinner, and then // the acquisition of the lock by the spinner must become visible to // the exiting thread). // // It appears that an heir-presumptive (successor) must be made ready. // Only the current lock owner can manipulate the EntryList or // drain _cxq, so we need to reacquire the lock. If we fail // to reacquire the lock the responsibility for ensuring succession // falls to the new owner. //
15-11-2013

I agree with the deferral at this point. So far we do not have a clear picture of how the problem is arising though I suspect Dave Dice's comments regarding the JVMTI issue may well be the key. Because the basic issue is a parked thread not getting unparked we could insert extra unparks at various places to "fix" this - the original suggestion skipped a quick exit and so added the additional unpark that way; Dave has suggested an explicit unpark on the JVMTI path. These are all "safe" in that extra unparks can only improve liveness and do not affect correctness (as park is permitted to return spuriously) but they may not be the right fix at the right place and the extra unpark could potentially impact performance. I think we can readily investigate Dave's theory by examining the park permit before and after the JVMTI event notification (or just additional tracing) to see if the park permit is being lost during the callout. And if this is constrained to being a JVMTI issue, as it appears, then this is not a critical fix and we can readily address it in 8u.
15-11-2013

Just had a long conversation with Dave Dice. He has not been able to reproduce this problem, but by looking at the stack traces and the source code, and understanding of earlier timing holes that he fixed, he suggested that we need to look closely at the logic in which a thread calls out to jvmti just before calling park. There is a concern: T1 in wait, releases the monitor for a contended lock, has been notified, is _succ, is about to park T2: owns lock, notifies, unparks T1 T1: calls out to jvmti, which could run java code and could "eat" the unpark - each thread only gets one ParkEvent for objectMonitor synchronization T1: comes back and calls unpark, from which it never returns Suggestion is that T1 would need to clear _succ (should be self), fence, and unpark itself Given that we need to fully understand the problem, experiment with possible solutions, if this is the likely solution, ensure it is in all needed locations, seriously reanalyze code paths, seriously stress test this - i.e. risk is high AND Given that this problem has been there since 2006, and requires jvmti notifications to happen, and is hard to duplicate - i.e. reproducibility is hard -- we propose deferring this out of JDK8 and experimenting in JDK9
15-11-2013

Dave Dice responds: A thread that���s identified as _succ can call park() as long as it has a pending unpark ���permit��� in the ParkEvent. That���s safe and normal. --- The tracing needs to show whether the park permit exists when unpark is called - or show that the park in fact returned. Also note that in standalone test case _succ != Self either so there is no problem regardless.
15-11-2013

I wrote a standalone test with class initializations with no debugger involved. Was not able to reproduce the hang yet but got the symptom with non-NULL _succ at ReenterI parking: % grep ReenterI ??? | grep parking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393:MYTRACE: #22: 0x8dd800, lock: 0xfffffd7f41b2aca8, succ: 0, _cxq: 0, state: 4, ReenterI parking 393:MYTRACE: #22: 0x8dd800, lock: 0xfffffd7f41b2aca8, succ: 0x8df800, _cxq: 0, state: 4, ReenterI parking <<==== _succ != NULL ====>> 394:MYTRACE: #22: 0x8de000, lock: 0xfffffd7f41b2acc8, succ: 0, _cxq: 0, state: 4, ReenterI parking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874:MYTRACE: #22: 0x8e6000, lock: 0xfffffd7f41b2c4b8, succ: 0, _cxq: 0, state: 4, ReenterI parking 874:MYTRACE: #22: 0x8e6000, lock: 0xfffffd7f41b2c4b8, succ: 0x8e7800, _cxq: 0, state: 4, ReenterI parking <<==== _succ != NULL ====>> 877:MYTRACE: #23: 0x8df800, lock: 0xfffffd7f41b2aca8, succ: 0, _cxq: 0, state: 4, ReenterI parking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15-11-2013

Reproduced it again. This is a filtered tracing (added another event: "init before UNLOCK" but not everyware): debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, init STEP 1 - LOCKED debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, init before UNLOCK debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0xfffffd7d9ceecbd0, state: 5, EnterI park 1 debugee.stdout> MYTRACE: #33: 0xd7c000, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0, EnterI done 2 debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0xfffffd7d9cdebaa0, state: 5, EnterI park 1 debugee.stdout> MYTRACE: #33: 0xd7c000, lock: 0xfffffd7deb98af80, init STEP 1 - LOCKED debugee.stdout> MYTRACE: #33: 0xd7c000, lock: 0xfffffd7deb98af80, init STEP 2 - WAITING debugee.stdout> MYTRACE: #33: 0xd7c000, lock: 0xfffffd7deb98af80, succ: 0xd7a000, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, succ: 0xd7a000, _cxq: 0, state: 4, EnterI unpark debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0, state: 2, EnterI done 4 debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, init STEP 9 - NOTIFY debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, init STEP 12 - FINISHED debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, succ: 0xd77800, _cxq: 0xfffffd7d9cceac60, state: 4, EnterI unpark debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, klass_at_impl 1 - LOCKED debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0xfffffd7d9cceac60, exit shortcut debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0xfffffd7d9cceac60, state: 2, EnterI done 4 debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, init STEP 1 - LOCKED debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, init STEP 4 - RETURN debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, klass_at_impl 1 - LOCKED debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, succ: 0xd7c000, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, klass_at_impl 3 - LOCKED debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, succ: 0xd7c000, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, klass_at_impl 1 - LOCKED debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, klass_at_impl 3 - LOCKED debugee.stdout> MYTRACE: #31: 0xd77800, lock: 0xfffffd7deb98af80, succ: 0xd7c000, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, succ: 0xd7c000, _cxq: 0, EnterI done 1 debugee.stdout> MYTRACE: #33: 0xd7c000, lock: 0xfffffd7deb98af80, succ: 0xd7c000, _cxq: 0, state: 4, ReenterI parking <<<< FOREVER !!! debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, klass_at_impl 3 - LOCKED debugee.stdout> MYTRACE: #32: 0xd7a000, lock: 0xfffffd7deb98af80, succ: 0xd7c000, _cxq: 0, exit shortcut This is a stack trace of the the T#33 indefinitely parked in the ReenterI() call: ----------------- lwp# 33 / thread# 33 -------------------- fffffd7fff2bcafa lwp_cond_wait (d7d150, d7d138, 0, 0) fffffd7ffe06ce71 void os::PlatformEvent::park() () + cd fffffd7ffe01064a void ObjectMonitor::ReenterI(Thread*,ObjectWaiter*) () + 306 fffffd7ffe01518a void ObjectMonitor::wait(long,bool,Thread*) () + 752 fffffd7ffe449eb8 void ObjectSynchronizer::waitUninterruptibly(Handle,long,Thread*) () + 144 fffffd7ffd1a34f3 void InstanceKlass::initialize_impl(instanceKlassHandle,Thread*) () + 39f fffffd7ffd1a0f6a void InstanceKlass::initialize(Thread*) () + 92 fffffd7ffd44c041 void InterpreterRuntime::_new(JavaThread*,ConstantPool*,int) () + 9e9 fffffd7fec465aff * nsk/share/jdi/MonitorWaitExecutor.monitorWait()V+31 (line 186) fffffd7fec40661e * nsk/share/jdi/MonitorWaitExecutor.doEventAction()V+26 (line 161) fffffd7fec40661e * nsk/share/jdi/JDIEventsDebuggee$EventActionsThread.run()V+34 (line 428) fffffd7fec400671 * nsk/share/jdi/JDIEventsDebuggee$EventActionsThread.run()V+10064 fffffd7ffd49287e void JavaCalls::call_helper(JavaValue*,methodHandle*,JavaCallArguments*,Thread*) () + 1622 fffffd7ffd49121b void JavaCalls::call(JavaValue*,methodHandle,JavaCallArguments*,Thread*) () + 3f fffffd7ffd48e5c8 void JavaCalls::call_virtual(JavaValue*,KlassHandle,Symbol*,Symbol*,JavaCallArguments*,Thread*) () + 77c fffffd7ffd48ee19 void JavaCalls::call_virtual(JavaValue*,Handle,KlassHandle,Symbol*,Symbol*,Thread*) () + ed fffffd7ffd864773 void thread_entry(JavaThread*,Thread*) () + c7 fffffd7ffe4e9709 void JavaThread::thread_main_inner() () + 521 fffffd7ffe4e8e1f void JavaThread::run() () + 84f fffffd7ffe0548d2 java_start () + 1ce fffffd7fff2b6ecb _thr_setup () + 5b fffffd7fff2b7100 _lwp_start () The threads T#31 and T#32 are gone (not present) in the pstack dump.
14-11-2013

I was going to say that beware that your tracing output is not atomic, so you may not be reading consistent values of the different variables.
14-11-2013

Ok, it is possible that the outputs above are not clean. Let me double-check everything and reproduce this tracing again.
14-11-2013

These outputs were posted to show that when a thread is re-parked at ReenterI() forever, the succ is always non-null and pointing to the re-parked thread. The corresponding pstack dump always shows that the re-entering thread has been parked but other two threads are gone.
14-11-2013

Probably, not all the event are present in the output above. My understanding is that a shortcut exit is an unlock. Not sure about last questions yet. Another sample output with a slightly different format: initialize_impl: #32: 0xd74800, lock: 0xfffffd7deb98af80, STEP 1 ::exit shortcut: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 ::EnterI park 1: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0xd74800, _cxq: 0xfffffd7d9ceec510 ::EnterI done 2: #33: 0xd76800, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0 ::EnterI park 1: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0xfffffd7d9cdeb820 initialize_impl: #33: 0xd76800, lock: 0xfffffd7deb98af80, STEP 1 initialize_impl: #33: 0xd76800, lock: 0xfffffd7deb98af80, step 2 WAITING ::EnterI unpark: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd74800, _cxq: 0 ::EnterI done 4: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0 initialize_impl: #32: 0xd74800, lock: 0xfffffd7deb98af80, step 9 NOTIFY initialize_impl: #32: 0xd74800, lock: 0xfffffd7deb98af80, step 12 FINISHED ::EnterI unpark: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0xd72000, _cxq: 0xfffffd7d9cceaae0 ::EnterI done 4: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0xfffffd7d9cceaae0 ::EnterI park 1: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0xfffffd7d9cde9c20 initialize_impl: #31: 0xd72000, lock: 0xfffffd7deb98af80, STEP 1 ::EnterI unpark: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd74800, _cxq: 0 initialize_impl: #31: 0xd72000, lock: 0xfffffd7deb98af80, step 4 RETURN ::exit shortcut: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0xd74800, _cxq: 0 ::EnterI done 4: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0, _cxq: 0 klass_at_impl 1: #32: 0xd74800, lock: 0xfffffd7deb98af80, LOCKED klass_at_impl 3: #32: 0xd74800, lock: 0xfffffd7deb98af80, LOCKED ::exit shortcut: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 ::EnterI done 1: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 klass_at_impl 1: #31: 0xd72000, lock: 0xfffffd7deb98af80, LOCKED ::EnterI done 1: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 ReenterI parkng: #33: 0xd76800, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 // Parked forever klass_at_impl 1: #32: 0xd74800, lock: 0xfffffd7deb98af80, LOCKED ::exit shortcut: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 ::EnterI done 1: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 klass_at_impl 1: #31: 0xd72000, lock: 0xfffffd7deb98af80, LOCKED ::exit shortcut: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 klass_at_impl 3: #31: 0xd72000, lock: 0xfffffd7deb98af80, LOCKED ::exit shortcut: #31: 0xd72000, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 ::EnterI done 1: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0 klass_at_impl 3: #32: 0xd74800, lock: 0xfffffd7deb98af80, LOCKED ::exit shortcut: #32: 0xd74800, lock: 0xfffffd7deb98af80, succ: 0xd76800, _cxq: 0
14-11-2013

In fact, it is exactly what really happens. The thread at Reenter() is reparking with the _succ that was not cleared and pointing to the re-entering (current) thread T#31. Below is an output with my tracing that I collected earlier: debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, init STEP 1 - LOCKED debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #31: 0xd5e800, lock: 0xfffffd7deb98af80, init STEP 1 - LOCKED debugee.stdout> MYTRACE: #31: 0xd5e800, lock: 0xfffffd7deb98af80, init STEP 2 - WAITING debugee.stdout> MYTRACE: #31: 0xd5e800, lock: 0xfffffd7deb98af80, succ: 0xd61000, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, init STEP 9 - NOTIFY debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, init STEP 12 - FINISHED debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, klass_at_impl 1 - LOCKED debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, klass_at_impl 3 - LOCKED debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, klass_at_impl 1 - LOCKED debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, klass_at_impl 3 - LOCKED debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #33: 0xd63000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, EnterI done 1 debugee.stdout> MYTRACE: #31: 0xd5e800, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, state: 4, ReenterI parking <<Forever!!>> Please, note above that the _succ == 0xd5e800 (which is current thread #31) just before reparking! debugee.stdout> MYTRACE: #33: 0xd63000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, exit shortcut debugee.stdout> MYTRACE: #33: 0xd63000, lock: 0xfffffd7deb98af80, init STEP 1 - LOCKED debugee.stdout> MYTRACE: #33: 0xd63000, lock: 0xfffffd7deb98af80, init STEP 4 - RETURN debugee.stdout> MYTRACE: #33: 0xd63000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, exit shortcut But I'm still puzzled how this could happen. The ObectMonitor::wait() does this: OrderAccess::loadload() ; if (_succ == Self) _succ = NULL ; before call to the ReenterI(): OrderAccess::fence() ; . . . . . . . . . . . . . . . . . . . . if (v == ObjectWaiter::TS_RUN) { enter (Self) ; } else { guarantee (v == ObjectWaiter::TS_ENTER || v == ObjectWaiter::TS_CXQ, "invariant") ; ReenterI (Self, &node) ; node.wait_reenter_end(this); } The only idea is that at the line if (_succ == Self) _succ = NULL ; the thread somehow does not really see that_succ == Self so that it does not clear it.
14-11-2013

I don't understand the traces. Why do we have a fast exit with a successor after a LOCKED action - it should be an unlock: debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, init STEP 1 - LOCKED debugee.stdout> MYTRACE: #32: 0xd61000, lock: 0xfffffd7deb98af80, succ: 0xd5e800, _cxq: 0, exit shortcut Note also that when thread 31 waits it sees thread 32 as succ_ and yet thread 32 then sees thread 31 as _succ before doing the notify - which doesn't make sense to me. thread 32 is in the wait-queue until the notify moves it to the cxq/entry-queue and only then can it become _succ. ???
14-11-2013

Dave Dice has now explained that the proposed bug that this change "fixed" is not in fact possible. The _succ thread can not re-park without checking _owner and without clearing _succ; when the previous owner sees _succ != NULL it has already relinquished ownership so if it sees a _succ then the _succ will see the change of ownership and so will acquire the lock. If another thread barges in again the _succ thread will clear succ_ before parking again - either way the new owner ends up in the same state as the previous owner racing with the "succ" thread but not in a way that jeopardizes liveness.
14-11-2013

Re: this comment from David H: > Also some of the analysis seems incomplete, in particular: That would be why I said this when I pasted that analysis: > I wrote a bunch of notes on the code on my flight from Boston -> Denver. > Putting them here before I forget. They haven't been sanity checked yet. I can already see that part of my analysis was splatted into the middle of another section... I'll revisit these notes shortly.
13-11-2013

I agree with David that dropping the _succ!=NULL check in the fast exit path is quite safe and low risk. Local testing shows that the test nsk/jdi/stress/MonitorEvents/MonitorEvents002 does not hang with that fix. But we need Dave Dice to look at it anyway.
12-11-2013

Release team: Based on input from Alan Bateman we're putting this on NMI list as David Holmes were discussing this with David Dice.
12-11-2013

Karen says: Please get Dave Dice to review the proposed fix.
12-11-2013

This needs to be fixed ASAP once the details have been worked out. Deferring to 9 is way too far in the future. It is likely this bug will only increase in visibility as we continue to see larger scale multi-core devices that will change the timing of these kinds of races. The actual proposed fix of just dropping the _succ!=NULL check is quite safe and very low risk as we are now doing more conservative actions rather than taking an optimized "quick exit" path.
12-11-2013

Requesting deferral to 9, any change to this area of the code is too risky at this point. We haven't changed this code in jdk8.
11-11-2013

Dan, some of the "races" are known to be possible that is why there is also the notion of the Responsible thread that will do a timed rather than un-timed park. I think what we may have identified is a flaw in the Responsible protocol more than the _succ handling protocol per-se but it is very hard to say for sure. Also some of the analysis seems incomplete, in particular: // - line 643-646- the successor didn't get the lock, realizes that it // is the successor and sets _succ to NULL; fence() is called to // flush the _succ field, but I don't think that'll help if the // current owning thread is exiting and setting the _succ field // to us again. This is a race that David H identified with the // proposed solution to set '_succ = NULL' before repark; I didn't // and David H. probably didn't realize that we already have that // race! You seemed to have missed the comment after we set _succ to NULL: if (_succ == Self) _succ = NULL ; // Invariant: after clearing _succ a thread *must* retry _owner before parking. OrderAccess::fence() ; we loop back around and try to acquire the lock again. If the owner appointed us _succ after we cleared it then it will have unparked us, so if we find a third thread has reclaimed the monitor, we will park _but_ this will return immediately because of the previous park. Hence no bug. The race I identified was a race between the exiting thread checking _succ!=NULL and doing a simple return, and the _succ thread clearing _succ. In this case there is no unpark so the nominal successor will wait indefinitely in the park().
10-11-2013

ILW: 3=MLH (I=M - rare deadlocks - no known test failures yet, L=L - intermittent with low frequency, W=H - no workaround)
08-11-2013

Another useful piece of information below: On 10/24/13 4:45 AM, David Holmes (JBS) wrote: By strange coincidence a question was asked recently about how the ObjectMonitor code worked in a particular scenario: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-October/009869.html I couldn't answer specifically as you have to basically replay all the events described in the scenario and work out exactly how the data structures would look. But the scenario described has some similarity with what we appear to see here. Or it may be completely unrelated. I think the key here is that we need to track the thread that completes the initialization and see how the notify_all behaves.
08-11-2013

I temporarily assigned this bug to myself. It is Ok to change the ownership if necessary.
08-11-2013