JDK-6520207 : Dollar/UnixDollar bad behavior shouldn't match twice in "\n"
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 7
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: x86
  • Submitted: 2007-02-01
  • Updated: 2015-01-13
Description
FULL PRODUCT VERSION :
java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b06)
Java HotSpot(TM) Client VM (build 1.7.0-ea-b06, mixed mode, sharing)


ADDITIONAL OS VERSION INFORMATION :
Linux helium 2.6.17-10-generic #2 SMP Tue Dec 5 22:28:26 UTC 2006 i686 GNU/Linux

A DESCRIPTION OF THE PROBLEM :
Pattern.compile("$").matcher("a\nb\nc\n") matches twice instead of once.

http://elliotth.blogspot.com/2007/01/what-do-anchors-and-mean-in-regular.html

the first match is the final line terminator. the second match is the end-of-input.

in MULTILINE mode this is unfortunate (because it's not Perl-compatible and should be listed in the incompatibilities with Perl 5 in the documentation), but it's understandable because of the "or" in the definition of what MULTILINE causes $ to match.

but in non-MULTILINE mode, this is incorrect (in that i don't see how it's specified by the documentation).

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
run the supplied test case.


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.regex.*;

public class test {
 public static void main(String[] args) {
  Pattern p = Pattern.compile("$");
  Matcher m = p.matcher("a\nb\nc\nhello\nworld\n");
  int count = 0;
  while (m.find()) {
   ++count;
  }
  System.err.println(count);
 }
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
i would have suggested using \Z, but that's broken too ;-)
Copied from http://bugs.openjdk.java.net/show_bug.cgi?id=100084#c0
Description From ###@###.### 2009-07-09 01:40:09 PDT

Created an attachment (id=99) [details]
contains the exported diff and a jtreg testcase

sunbug=6520207

Pattern.compile("$").matcher("a\nb\nc\n") matches twice instead of once.
--------------------------------------------

Adding a simple check in the Pattern$Dollar class to avoid matching without any
content.

Comments
SUGGESTED FIX See attached patch from OpenJDK bugzilla:
23-07-2012