JDK-8027747 : Regex: odd behavior of capturing group under possessive quantifier
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 7u9
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: os_x
  • Submitted: 2013-03-20
  • Updated: 2015-01-13
Related Reports
Duplicate :  
Description
FULL PRODUCT VERSION :
java version  " 1.7.0_09 " 
Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Mac OS X 10.7.5

A DESCRIPTION OF THE PROBLEM :
When matching against a regular expression that has a capturing group inside a possessive quantifier, the group sometimes shows captured input, even if it was not part of the final match.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See source code below.

(1) compile pattern  " ([abc]+?)(b)?+(d) " 
(2) match against  " abcd " 
(3) check the value of matcher.group(2)

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
For this regex, with Matcher  " m " , I would expect m.group(2) to be null.  More generally, since the three groups do not overlap, I'd expect  " m.group(1) + m.group(2) + m.group(3) "  to coincide with the input stream (if m.group(2) is not null).
ACTUAL -
Instead, m.group(2)= " b " , even though m.group(1)= " abc "  and m.group(3)= " d " .

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.regex.*;

public class TestRegex {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile( " ([abc]+?)(b)?+(d) " );
        Matcher m = pattern.matcher( " abcd " );
        if(m.matches()) {
            System.out.println(m.group(0));
            System.out.println(m.group(1) +  " | "  + m.group(2) +  " | "  + m.group(3));
        } else {
            System.out.println( " does not match " );
        }
    }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Don't use possessive quantifiers around capturing groups.
Comments
The "tail" node of possessive group does not get backtrack/back off, so it does not reset the matched result.
14-11-2014

As a proof of the issue, the perl's behavior can be considered: > perl -e 'print "$1|$2|$3\n" if 'abcd' =~ /([abc]+?)(b)?+(d)/;' prints 'abc||d' as expected
02-11-2013