JDK-6535709 : interrupt of wait()ing thread isn't triggering InterruptedException - test intwait3
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 5.0u17,6,6u1,7
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS:
    generic,windows_2000,windows_2003,windows_xp generic,windows_2000,windows_2003,windows_xp
  • CPU: generic,x86
  • Submitted: 2007-03-19
  • Updated: 2019-08-22
  • Resolved: 2011-04-25
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 Other
7Fixed hs21Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Description
The NSK test runtime/threads/intwait3 has been failing occasionally in the nightlies:

http://gtee.sfbay/gtee/results/MUSTANG/NIGHTLY/VM-MAIN/2007-03-12/RT_Baseline/vm/64BITWIN-AMD64/server/mixed/vm-vm_6.0_server_mixed_64BITWIN-AMD642007-03-12-20-22-33/ResultDir/intwait3/intwait3.log

http://gtee.sfbay/gtee/results/MUSTANG/NIGHTLY/VM-MAIN/2007-02-24/Main_Baseline/vm/64BITWIN-AMD64/server/mixed/vm-vm_6.0_server_mixed_64BITWIN-AMD642007-02-24-20-23-52/ResultDir//intwait3//intwait3.log

Both failures occurred on machine icemaker 

The test basically generates 5000 thousand threads, each of which does this.wait() and expects to received an InterruptException. The main thread starts each thread by doing:

t.start();
t.interrupt();

The target thread expects to get an InterruptedException and if not then an error will be reported.

There are two observed failure modes:

1. In the nightly failures, a thread has completed wait() without encountering the InterruptedException.

2. When reproducing the failure on icemaker I also found that sometimes the test will hang. jstack shows that in that case the main thread is waiting doing a join() on the target, while the target is still blocked in this.wait() - indicating that the interruption got lost somewhere.


The test uses wait() in a way that is sensitive to spurious wakeups, but its odd that it suddenly started to fail. Spurious wakeups in the VM have known causes and this scenario does not seem to fit. Spuriosu wakeups would not account for the hang either.

The failure has only been seen on the one Windows 2003 Server machine. Changing the Java code to trace what was happening seemed to prevent the original failure mode from manifesting. Running two instances of the test in parallel seems to allow the hang to be readily reproduced (the test is being run by the shell in a continuous loop until failure).

Comments
The wait logic actually contains three is_interrupted calls, two of which clear the interrupt. Hence the current failure can be explained as can the hangs - see the eval.
22-08-2019

Checked the object file disassembly using dumpbin /disasm and it shows everything working correctly and everything being issued in the right order. It seems unlikely that we would get a memory order/visibility problem with a dual-core system, and in any case adding an intermediate fence() does not solve the problem.
22-08-2019

I found the problem that causes sleep to not be interrupted - a simple race in is_interrupted(true) - see JDK-6498581 - the fix for which also fixes the current problem. However it does not explain the current problem. My current theory is that the win32 C compiler is not honouring ordering relating to volatile variables.
22-08-2019

I instrumented os::interrupt and os::is_interrupted to report which thread was being interrupted and when the interrupt was being cleared etc. Much to my surprise this actually introduced a hang into this intwait3 test program - which indicates that we have a race condition somewhere. The tracing showed a couple of interesting things: 1. We saw multiple interrupts on the same thread. This is not part of the test program - there is only one interrupt per thread. I can only assume that the JavaThread memory is being quickly reused and so what appears to be the same thread is a different Java thread. 2. Just before the hang we see: set interrupt on thread xxxx Checking interrupt: clear for thread XXX The set message comes at the end of the os::interrupt method after interruption is complete, whereas the is_interrupted check comes after the check for osthread->interrupted() and before the clearing of the interrupt if requested. So not only does it appear that the interrupt state was not seen to be set, the unpark() performed as part of os::interrupt did not seem to stick because the thread blocked in the wait() call. This isn't making any sense at all.
22-08-2019

As suspected Thread::is_interrupted is returning false.
22-08-2019

To summarise, as I'm looking into this problem yet again. The interrupt() is somehow trigerring a "spurious wakeup" fron the wait() rather than throwing the InterruptedException. Tracing shows that in objectMonitor::wait we hit this path: // check if the notification happened if (!WasNotified) { // no, it could be timeout or Thread.interrupt() or both // check for interrupt event, otherwise it is timeout if (interruptible && Thread::is_interrupted(Self, true) && !HAS_PENDING_EXCEPTION) { TEVENT (Wait - throw IEX from epilog) ; THROW(vmSymbols::java_lang_InterruptedException()); } <=== we reach here: so it appears to be spurious } Further testing is trying to identify which conditions are not holding. I suspect it will be the is_interrupted check.
22-08-2019

See comments in 6498581 for more info. This may well be a duplicate.
22-08-2019

Here's the failing test program. It can run standalone. class intwait3 extends Thread { int name; boolean interrupted = false; intwait3(int n) { name = n; } public synchronized void run(){ try{ wait(); }catch (InterruptedException e) { interrupted = true; } } public static void main (String args[]) { intwait3[] joinThread = new intwait3[5000]; int k; for(k = 0; k < 5000; k++){ joinThread[k] = new intwait3(k); } for(k = 0; k < 5000; k++){ joinThread[k].start(); joinThread[k].interrupt(); try { joinThread[k].join(); }catch(InterruptedException e){ ; // empty } } for(k = 0; k < 5000; k++){ if(!joinThread[k].interrupted){ System.out.println("Error: interrupted exception for " + "joinThread " + joinThread[k].name); System.exit(1); } } System.exit(0); } }
22-08-2019

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/083f13976b51
25-03-2011

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot/hotspot/rev/083f13976b51
25-03-2011

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-rt/hotspot/rev/083f13976b51
22-03-2011

SUGGESTED FIX bool os::is_interrupted(Thread* thread, bool clear_interrupted) { ... OSThread* osthread = thread->osthread(); bool interrupted = osthread->interrupted(); ! if (interrupted && clear_interrupted) { osthread->set_interrupted(false); ResetEvent(osthread->interrupt_event()); } // Otherwise leave the interrupted state alone return interrupted; }
20-03-2011

EVALUATION There is a bug in the win32 version of os::is_interrupted which differs from the Solaris and Linux versions which do not have this bug. bool os::is_interrupted(Thread* thread, bool clear_interrupted) { ... OSThread* osthread = thread->osthread(); bool interrupted = osthread->interrupted(); if (clear_interrupted) { osthread->set_interrupted(false); ResetEvent(osthread->interrupt_event()); } // Otherwise leave the interrupted state alone return interrupted; } Here we return the original value of interrupted, but that value could have changed between the read and the subsequent clear - in which case the intermediate interrupt is lost. In contrast on Solaris we have: bool os::is_interrupted(Thread* thread, bool clear_interrupted) { ... OSThread* osthread = thread->osthread(); bool res = osthread->interrupted(); // NOTE that since there is no "lock" around these two operations, // there is the possibility that the interrupted flag will be // "false" but that the interrupt event will be set. This is // intentional. The effect of this is that Object.wait() will appear // to have a spurious wakeup, which is not harmful, and the // possibility is so rare that it is not worth the added complexity // to add yet another lock. It has also been recommended not to put // the interrupted flag into the os::Solaris::Event structure, // because it hides the issue. if (res && clear_interrupted) { osthread->set_interrupted(false); } return res; } Note that we only clear the interrupt if we saw that we were interrupted. In this way we guarantee that if we return false we did not modify the interrupt state - or for the Windows case the interrupt event state - hence an interrupt can not be lost. This bug can manifest as a spurious wakeup (as per the intwait3 test described in this CR) or as a hang (as reported in 6741489 but also possible with intwait3 if the timing changes - eg by adding tracing code). The logic in ObjectMonitor::wait essentially does the following: if (os::is_interrupted(true)) throw InterruptedException; release_monitor(); thread->_parkEvent->reset(); if (!os::is_interrupted(false)) thread->_parkEvent->park(); reacquire_monitor(); if (notNotified) { if (os::is_interrupted(true)) throw InterruptedException else // spurious wakeup } else { // normal notification } If the interrupt hits during the initial is_interrupted(true) call then we can get the situation where it will return false, but meanwhile the interrupt issued a _ParkEvent->unpark(). If the unpark() happens before the reset() above then the thread does not return from the park() and we get the hang. If the unpark() happens after the reset() then we immediately return from park(), but the final is_interrupted(true) returns false (as it was previously cleared) and so we get the spurious wakeup. The spurious wakeup in not in fact a bug as it is permitted behaviour from Object.wait(), and in that sense the intwait3 test is invalid. However the hang is definitely a bug. Regardless one fix addresses both issues, and while spurious wakeups are permitted, we try to minimize them, and we always want to know under what circumstances they can occur.
19-03-2011