FULL PRODUCT VERSION :
This bug is present in all NIO releases. Verified on 1.5 and 1.6 beta.
ADDITIONAL OS VERSION INFORMATION :
EXTRA RELEVANT SYSTEM CONFIGURATION :
2 CPU box (totoal of 4 virtual CPUs due to hyperthreading)
A DESCRIPTION OF THE PROBLEM :
Bug in Sun WindowsSelectorImpl can cause concurrent Selector.register and SelectionKey.interestOps can ignore interestOps. I suspect that a concurrent Selector.deregister and SelectionKey.interestOps can cause the same problem (although I did not debug that mode of failure carefully).
The problem happens when Windows Selector Impl tries to grow the natively allocated FD array (via PollWrapper.grow()). To do this, a new bigger array is allocated, the data from the old array is coipied into the new one, and the new one is assigned to be used by the Selector.
However, if a change to the interestOps happens while the process above is being performed, the new interest Ops could be written to the OLD array after that channel's record has been copied to the new array but before the copying process is complete. That will cause the change to the interest Ops to be lost.
The way deregister is moving the last channel to the deleted position also seems to open up a possibility to lose an interestOps update to the last channel (the one being moved).
Basically, change to the interest ops must be synchronized with the growing and reorg of the FD array.
As sides points:
1) Why is the replaceEntry in PollWrapper NOT a static function?
2) Why is PollWrapper.grow so inefficient in copying data?! It performs a few function calls and lots of arithmetics per each record copied. This is bad for cases when you have a LOT of connections and need to double the array -- the very cases that NIO was designed to implement efficiently.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
I have an NIO Server that I am stress testing to accept over 50K connections as rapidly as I can, while doing a full-dulplex communication over those connections.
I impelment this by having one thread accept all IO requests and save the keys away for the worker thread pool to execute. To prevent the save key being enqueued more than once at a time, I set its interest in events to 0 when I upt it on the queue (this is done in a thread with the selector, so it is done synchronously with other selector ops).
When a worker thread is done with the IO operation on the socket channel, it puts its interest back to the "interested" operations. That is done asynchronously in a different thread than the selector.
Hence, I am performing a lot of IO and interest set changes while other socket channels are being registered.
This problem reproduces almost every run.
EXPECTED VERSUS ACTUAL BEHAVIOR :
When I reset the interest back to, say, READ, I expect the channel to be selected some time soon after new data arrives (or has already arrived) to the socket.
Some of the channels whose interest set was changed while the selector was growing its array (and in a VERY inefficient way, which opened up the window for the bug even wider) would never be selected again -- because setting their interest from 0 to READ was lost.
This bug can be reproduced often.
---------- BEGIN SOURCE ----------
Just read the description in "Steps to Reproduce".
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Externally synchronize key.interestOps() to selector.register and selector.deregister.