Bug ID: JDK-8225061 Performance regression in Regex

JDK-8225061 : Performance regression in Regex

Type: Bug
Component: core-libs
Sub-Component: java.util.regex
Affected Version: 13

Priority: P2
Status: Closed
Resolution: Fixed
OS: os_x
CPU: generic

Submitted: 2019-05-30
Updated: 2020-03-17
Resolved: 2019-06-01

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 13	JDK 14
13 b24Fixed	14Fixed

Related Reports

Relates :	JDK-8221431 - Support for Unicode 12.1
Relates :	JDK-8241037 - GraphemeTest test does not verify Grapheme

Description

Recent Unicode upgrade to 12.1 caused performance regression in regex.
It is reported that the build has slowed a bit, seems to be spending a lot of time in the grapheme support looking for boundaries
```java.lang.Thread.State: RUNNABLE
    at java.util.regex.Grapheme.getType(java.base/Grapheme.java:168)
    at java.util.regex.Grapheme.nextBoundary(java.base/Grapheme.java:73)
    at java.util.regex.Pattern$NFCCharProperty.match(java.base/Pattern.java:3971)
    at java.util.regex.Pattern$Curly.match0(java.base/Pattern.java:4380)
    at java.util.regex.Pattern$Curly.match(java.base/Pattern.java:4354)
    at java.util.regex.Pattern$GroupHead.match(java.base/Pattern.java:4779)
    at java.util.regex.Pattern$Branch.match(java.base/Pattern.java:4724)
    at java.util.regex.Pattern$GroupHead.match(java.base/Pattern.java:4779)
    at java.util.regex.Pattern$CharPropertyGreedy.match(java.base/Pattern.java:4281)
    at java.util.regex.Pattern$Begin.match(java.base/Pattern.java:3672)
    at java.util.regex.Matcher.match(java.base/Matcher.java:1756)
    at java.util.regex.Matcher.matches(java.base/Matcher.java:713)
    at sun.nio.fs.UnixFileSystem$3.matches(java.base/UnixFileSystem.java:308)
    at jdk.tools.jmod.JmodTask$JmodFileWriter.matches(jdk.jlink/JmodTask.java:808)
    at jdk.tools.jmod.JmodTask$JmodFileWriter$3.visitFile(jdk.jlink/JmodTask.java:788)
    at jdk.tools.jmod.JmodTask$JmodFileWriter$3.visitFile(jdk.jlink/JmodTask.java:777)
    at java.nio.file.Files.walkFileTree(java.base/Files.java:2803)
    at jdk.tools.jmod.JmodTask$JmodFileWriter.processSection(jdk.jlink/JmodTask.java:776)
    at jdk.tools.jmod.JmodTask$JmodFileWriter.processClasses(jdk.jlink/JmodTask.java:752)
    at jdk.tools.jmod.JmodTask$JmodFileWriter.write(jdk.jlink/JmodTask.java:484)
    at jdk.tools.jmod.JmodTask.create(jdk.jlink/JmodTask.java:438)
    at jdk.tools.jmod.JmodTask.run(jdk.jlink/JmodTask.java:208)
    at jdk.tools.jmod.Main.main(jdk.jlink/Main.java:34)```

Comments

http://cr.openjdk.java.net/~redestad/8225061/open.01/ With an ASCII fast-path and a few other simplifications I got time down for the Mac OS X-style jmods regex to ~420 us/op (2x regression), then [~naoto] pointed out we could revert to the old, simpler boundary check in NFCCharProperty, used before JDK-8221431, which gets rid of the Mac build regression in its entirety for the type of matching done by jmod. The optimizations to Grapheme are useful, and improves performance on micros targetting grapheme clusters specifically by about 2x
31-05-2019
The regression was caused by the call to Grapheme.nextBoundary() in NFCCharProperty.match() method, which got slower with the fix to JDK-8221431 / JDK-8222978 (Unicode 12.1 / Grapheme 12.0 support). The purpose of issuing nextBoundary() is to detect whether to call (much heavy weight) Normalizer.normalize() call or not. Since this fast check does not require fully fledged boundary detection, including stateful segmentation check such as Emoji sequence, simply checking the break possibility between two code points as before should suffice. Suggested fix is to bring back the isBoundary(cp1, cp2) method from the previous revision in Grapheme.java, and issue it only from NFCCharProperty.match() method for the fast check.
31-05-2019
Correction: timings for the micro with Pattern.CANON_EQ on 13-b22 is ~240 us/op, and ~6 ms/op on 13-b23. I have experimented with a few improvements to Grapheme that gets the timing down to ~0.4 ms/op, and might get a bit more by ensuring ASCII fast-paths in more places.
31-05-2019
The issue is only reproducible on macOS as its file names require canonical equivalence for matching. There's a code in Pattern.NFCCharProperty.match() whether to issue Normalizer or not, by fast checking the length with Grapheme. That is the location of the regression.
31-05-2019
Just a quick update on this. The PathMatcher on macOS compiles patterns with canonical equivalence, that seems to be where the regression is. Claes has created a micro benchmark that shows a regression from ~65000ns/op to 58899989.176ns/op in jdk-13+23.
31-05-2019
The stack trace is correct, there is one jmod command in particular that spins with this stack trace for >1 minute. I have not duplicated it on other platforms yet.
31-05-2019
I can't see a regression on linux, and isolating the regex and trying it on some inputs doesn't indicate a regression (or use of the Grapheme pieces). The makefiles could generate several --exclude's to jmod, are you sure this is the one being executed in the stacktrace above?
31-05-2019
The build slow down is very significant on macOS. -r fb0cfce19262 (change-set before Unicode upgrade) make; time make jmods real 0m27.633s user 1m40.857s sys 0m26.470s -r 8dae495a59e7 (change-set with Unicode upgrade) make; time make jmods real 5m52.304s user 20m2.583s sys 0m28.617s The pattern that jmod is to using to match each class and resource is: ^.(?:(?:_the\.[^/])\|(?:_[^/]\.marker)\|(?:[^/]\.diz)\|(?:[^/]\.debuginfo)\|(?:[^/]\.dSYM/.)\|(?:[^/]\.dSYM)\|(?:[^/]\.pdb)\|(?:[^/]\.map))$
31-05-2019