JDK-6785442 : ConcurrentLinkedQueue.remove() and poll() can both remove the same element
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.concurrent
  • Affected Version: 6u10
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86
  • Submitted: 2008-12-16
  • Updated: 2010-04-04
  • Resolved: 2009-08-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7
7 b70Fixed
Description
FULL PRODUCT VERSION :
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)


ADDITIONAL OS VERSION INFORMATION :
Linux 2.6.18-ts15 #3 SMP PREEMPT Thu Jun 12 15:18:57 GMT 2008 x86_64 x86_64 x86_64 GNU/Linux

A DESCRIPTION OF THE PROBLEM :
There appears to be a race condition in ConcurrentLinkedQueue which can cause a single item to be removed twice.  The problem is related to an interaction between poll() and remove().

remove() traverses the list of Nodes and returns true if it can CAS a Node's item from non-null to null. If other threads are simultaneously poll()ing during this traversal, it is possible that the remove() thread could fall behind, and end up iterating over nodes that have already been unlinked.

poll() unlinks the head Node of the list, reads the node's item value, sets the reference to null, and then returns the value.  However when setting the reference to null, it does not do a CAS, so it is possible that remove() could have set the reference to null after poll() read it but before poll() sets it.  If this happens, both functions will believe they have successfully removed the same value from the list.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Apply the given patch to ConcurrentLinkedQueue and then run the associated CLQTest class.

Note that the patch simply causes Thread.sleep() to be caused under certain (contrived) circumstances in order to reliably expose the race condition.  It does not affect the correctness of the code.

CLQTest is fairly self-explanatory. It creates a new ConcurrentLinkedQueue, adds two elements to it, starts a thread to remove one of the elements(), and then calls poll() twice.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
First poll returned: A
Second poll returned: null
Async remove("B") returned: true
PASS

ACTUAL -
First poll returned: A
Second poll returned: B
Async remove("B") returned: true
FAIL


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
public class CLQTest
        implements Runnable
{

    private final ConcurrentLinkedQueue<String> queue =
            new ConcurrentLinkedQueue<String>();
    private volatile boolean removeWorked;

    public CLQTest()
            throws InterruptedException
    {
        queue.add("A");
        queue.add("B");
        Thread t = new Thread(this);
        t.start();
        Thread.sleep(1000);
        String a = queue.poll();
        System.out.println("First poll returned: " + a);
        String b = queue.poll();
        System.out.println("Second poll returned: " + b);
        t.join();
        System.out.println("Async remove(\"B\") returned: " + removeWorked);
        if (removeWorked ^ "B".equals(b)) {
            System.out.println("PASS");
        } else {
            System.out.println("FAIL");
        }
    }

    public void run() {
        removeWorked = queue.remove("B");
    }

    public static void main(String[] args) throws Exception {
        new CLQTest();
    }

}


patch file:
--- jdk1.6.0_10_src/java/util/concurrent/ConcurrentLinkedQueue.java	2008-09-26 07:16:03.000000000 +0000
+++ ConcurrentLinkedQueue.java	2008-12-11 22:21:07.390740309 +0000
@@ -218,6 +218,13 @@
                         casTail(t, first);
                 } else if (casHead(h, first)) {
                     E item = first.getItem();
+                    if ("B".equals(item)) {
+                        try {
+                            Thread.sleep(2000);
+                        } catch (InterruptedException e) {
+                            // ignore
+                        }
+                    }
                     if (item != null) {
                         first.setItem(null);
                         return item;
@@ -345,6 +352,13 @@
         if (o == null) return false;
         for (Node<E> p = first(); p != null; p = p.getNext()) {
             E item = p.getItem();
+            if ("A".equals(item)) {
+                try {
+                    Thread.sleep(2000);
+                } catch (InterruptedException e) {
+                    // ignore
+                }
+            }
             if (item != null &&
                 o.equals(item) &&
                 p.casItem(item, null))

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
@@ -218,8 +218,7 @@
                         casTail(t, first);
                 } else if (casHead(h, first)) {
                     E item = first.getItem();
-                    if (item != null) {
-                        first.setItem(null);
+                    if ((item != null) && first.casItem(item, null)) {
                         return item;
                     }
                     // else skip over deleted item, continue loop,

Comments
EVALUATION changeset for this fix: http://hg.openjdk.java.net/jdk7/tl/jdk/rev/12e479399ced
07-08-2009

EVALUATION This problem was discussed on the concurrency-interest mailing list: http://cs.oswego.edu/pipermail/concurrency-interest/2009-February/005880.html the issue is: "poll() should use casItem rather than setItem(null) so that it can detect a concurrent remove()." and the fix is being provided by Martin Buchholz and Doug Lea.
09-06-2009