JDK-8215626 : The '^' operator (negation in char classes) in regex does not work properly
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 11,12
  • Priority: P3
  • Status: Closed
  • Resolution: Not an Issue
  • OS: linux_ubuntu
  • CPU: x86_64
  • Submitted: 2018-12-17
  • Updated: 2020-07-16
  • Resolved: 2019-01-08
Related Reports
Relates :  
Description
A DESCRIPTION OF THE PROBLEM :
Hi, the operator '^' (negation in a character classes) seems not to work.
I provide a source code example where his behavior is totally different in Java 8 and Java 11 


REGRESSION : Last worked in version 8u191

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Encoding : UTF-8
The output is ooerdqK$Fop22{78ae������������
ACTUAL -
Encoding : UTF-8
The output is ooerdqKFop22{78ae

---------- BEGIN SOURCE ----------
import java.text.Normalizer;
import java.util.regex.Pattern;

/**
 *
 * @author Andres Bel Alonso
 */
public class BugExample {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // need UTF-8 encoding, ensure it
        System.out.println("Encoding : " + System.getProperty("file.encoding"));
        
        // I want to change this input ir order to delete the non ascii characters and non combining diacritical marks
        // but keep ������ and $ 

        String input = "oo����er����������dqK$F����o����p����2������2����{78a������������e������������";
        String str = Normalizer.normalize(input, Normalizer.Form.NFD);
        Pattern pattern = Pattern.compile("[^\\p{ASCII}&&[^\\p{InCombiningDiacriticalMarks}]&&[^������$]]");
        // I make me clean string
        String out = pattern.matcher(str).replaceAll("");
        
        // Java 8 ouput : ooerdqK$Fop22{78ae������������
        // Java 11 ouput : ooerdqKFop22{78ae
        // java 11 output does not complain because it cleans the characters i wanted to keep. Java 8 output is ok
        System.out.println("The output is " + out);
        
        // Finally, using the regex [\\P{ASCII}&&[\\P{InCombiningDiacriticalMarks}]&&[^������$]] works good in java 11
    }
    
}

---------- END SOURCE ----------

FREQUENCY : always



Comments
In JDK9 the "^" operator precedence was corrected to match the design, by issue: https://bugs.openjdk.java.net/browse/JDK-6609854 and also raised in this bug: https://bugs.openjdk.java.net/browse/JDK-8189343 the discussion to explain is here: http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html To explain in relation to this issue, the example shown can be simplified logically to this: [^a&&[^b]&&[^cde]]" In JDK8 this incorrectly expanded to the same as: [[^a]&&[^b]&&[^cde]] which was incorrect because "^" has the least precedence of all the operators and can only be used after [ to negate a Character class. Hence in JDK9 onwards this now correctly expands to the same as: [^[a&&[^b]&&[^cde]]] Which means in the context of this particular example, it will remove anything that is: Not(ASCII intersect Not(DiacriticMark) intersect Not(�������$)) which is the same as Not(ASCII) Which means it will just leave plain ASCII chars in the input string. To achieve what you want you need to specify the first term in it's own Character class with []: "[[^\\p{ASCII}]&&[^\\p{InCombiningDiacriticalMarks}]&&[^�������$]]"
08-01-2019

I will investigate this
02-01-2019

Possibly a difference caused by the change in Unicode version? The composition of the various Unicode blocks might have changed between 8 and 11.
19-12-2018

To reproduce the issue, run the attached test case. JDK 8u191 - Pass JDK 11.0.1 - Fail JDK 12-ea+21 - Fail Output on JDK 8u191: Encoding : UTF-8 The output is ooA��erA��a�� dqK$FoA��pA��2a��2A��{78aa��s��i����ea�������a������� Output on JDK 12-ea : Encoding : UTF-8 The output is ooAerAa dqKFoApA2a2A{78aasieaa
19-12-2018