JDK-8354490 : Pattern.CANON_EQ causes a pattern to not match a string with a UNICODE variation
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 18,25
  • Priority: P3
  • Status: Closed
  • Resolution: Not an Issue
  • OS: generic
  • CPU: generic
  • Submitted: 2025-04-12
  • Updated: 2025-05-21
  • Resolved: 2025-05-21
Related Reports
Relates :  
Relates :  
Description
ADDITIONAL SYSTEM INFORMATION :
> uname -a
Linux 0be9c4498283 6.12.5-linuxkit #1 SMP Tue Jan 21 10:23:32 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

> java -version
openjdk 24 2025-03-18
OpenJDK Runtime Environment (build 24+36-3646)
OpenJDK 64-Bit Server VM (build 24+36-3646, mixed mode, sharing)

A DESCRIPTION OF THE PROBLEM :
Enabling **canonical equivalence** causes a pattern to not match a string with a **variation selector**.  
`Pattern.compile("^[^/]*\\.[^/]*$", Pattern.CANON_EQ)` does not match a string containing a variation selector (e.g., `U+FE0F`).  
While `Pattern.compile("^[^/]*\\.[^/]*$")` matches the string.

The workaround is to remove variation selectors from the string.

The described bug is the root cause of the following problem:  
On macOS, when a path matcher is created using the pattern `glob:*.*`,  
it gets converted to the regular expression `^[^/]*\.[^/]*$`,  
and the `Pattern.CANON_EQ` flag is passed during pattern compilation.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Compile pattern using the expression ""^[^/]*\\.[^/]*$" and the flag `Pattern.CANON_EQ`.
2. Match a string that contains a variation selector

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The pattern matches the string
ACTUAL -
The pattern does not match the string

---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        // there is the variant selector (U+FE0F) after the heart emoji
        var strWithVariantSelector = "❤️ file.txt";

        var expr = "^[^/]*\\.[^/]*$";
        var pattern = Pattern.compile(expr);
        var patternWithCanonEq = Pattern.compile(expr, Pattern.CANON_EQ);

        var patternMatches = pattern.matcher(strWithVariantSelector).matches();
        var patternWithCannonMatches = patternWithCanonEq.matcher(strWithVariantSelector).matches();

        System.out.println(patternMatches); //true
        System.out.println(patternWithCannonMatches); //false
    }
}
---------- END SOURCE ----------


Comments
Upon final inspection and reviewing the Unicode Regex spec as it relates to this question, this does not appear to be a bug, after all. The request is for support of extended grapheme clusters from a user-defined character class. This isn't supported in Java regex, and isn't called for in the Unicode spec. This does not appear to be a bug. If you want your matcher to be sensitive to extended grapheme clusters instead of code-units, then you could use something like the following as a workaround: "^(?:(?!/)\\X)*\\.(?:(?!/)\\X)*$" instead of the string provided in the bug. We can match grapheme clusters in Java Regex using the \X character class.
21-05-2025

I suspect that there may be a bug in the normalization of characters and comparison logic that is occurring here and it's getting tripped up in the CANON_EQ logic. Looking into it.
15-05-2025

This issue could well be a duplicate of JDK-8354659, or rather JDK-8354659 is a duplicate of this issue.
16-04-2025

I think the change of the encoding is a red herring - it appears to work, but it does not. All it does is that the bytes in the source code are interpreted differently. To show the effect, this is a JShell example: jshell> byte[] data = new byte[] {(byte) 0xe2, (byte) 0x9d, (byte) 0xa4, (byte) 0xef, (byte) 0xb8, (byte) 0x8f}; data ==> byte[6] { -30, -99, -92, -17, -72, -113 } jshell> new String(data, java.nio.charset.Charset.forName("UTF-8")) $2 ==> "❤" jshell> new String(data, java.nio.charset.Charset.forName("ISO-8859-1")) $3 ==> "â\235¤ï¸\217" I.e. with ISO-8859-1 the interpretation of the String is (presumably) incorrect. There is an easy way to make the original example resilient to input javac encoding changes, try this: ``` import java.util.regex.Pattern; public class Main { public static void main(String[] args) { // there is the variant selector (U+FE0F) after the heart emoji var strWithVariantSelector = "\u2764\uFE0F file.txt"; var expr = "^[^/]*\\.[^/]*$"; var pattern = Pattern.compile(expr); var patternWithCanonEq = Pattern.compile(expr, Pattern.CANON_EQ); var patternMatches = pattern.matcher(strWithVariantSelector).matches(); var patternWithCannonMatches = patternWithCanonEq.matcher(strWithVariantSelector).matches(); System.out.println(patternMatches); //true System.out.println(patternWithCannonMatches); //false } } ``` This should fail even when using the ISO-8859-1 encoding: $ /usr/lib/jvm/java-11-openjdk-amd64/bin/java -Dfile.encoding=ISO-8859-1 /tmp/Main.java true false
14-04-2025

Using the following command and problem is gone: >c:\jdk-18eab13\bin\java -Dfile.encoding=ISO-8859-1 Main.java true true
14-04-2025

I don't think this is related to javac. This is the behavior of the regexp Pattern. The reason why it "works" with `-encoding ISO-8859-1` is that the unicode characters are (mis-)interpreted as ISO-8859-1 characters, and there's no variation selector in the string literal anymore. And the default encoding was changed to UTF-8 in JDK 18 (https://openjdk.org/jeps/400), which explains the observed behavior on Windows. Observed behavior on Linux is the same on JDK 11 and JDK 24: $ /usr/lib/jvm/java-11-openjdk-amd64/bin/java /tmp/Main.java true false $ ~/tools/jdk/jdk-24/bin/java /tmp/Main.java true false
14-04-2025

The observations on Windows 11: JDK 8: Passed, returned true, true. JDK 11 and JDK 17: Passed, JDK 18+12: Passed. JDK 18+13: Failed, returned true, false. JDK 24, 25ea+6: Failed. But if we compile the code with -encoding ISO-8859-1, the problem is gone. Perhaps it is an javac issue. >"c:\jdk-25ea+6"\bin\javac -encoding ISO-8859-1 Main.java >"c:\jdk-25ea+6"\bin\java Main true true
14-04-2025